I was sent a whole bunch of .jpg files of scanned documents with text that I wanted to extract.
I have Microsoft Office Document Imaging (MODI) installed, so I was keen to use that to perform the OCR (instead of re-typing all the text!). The only problem is that MODI only understands TIFF and MDI formats.
You can convert a whole lot of files using the following command:
mogrify -format tiff *.jpg
This creates new tiff files for each JPEG file. The only problem is that MODI doesn’t like the particular flavour of TIFF generated. Fortunately ImageMagick has 1001 options to configure exactly what you want to happen.
A bit of experimentation and I’ve found that the following extra options generate TIFF files that can be read without problems:
mogrify -format tiff -colorspace RGB -compress RLE *.jpg
All good, except that I then discovered that the scanning was at such a low DPI that the OCR wasn’t able to find any text :-(
Something else that sounds interesting is that MODI can be programmed against. Maybe I could automate this even more!