concordia-memories.org

roger.pape · Posted: Thu Feb 25, 2010 8:22 pm Post subject: Searchable Image Files

Most of us have come to rely on a search engine, such as Google, Yahoo, Bing or the like, to find information on the Web. Similarly, using the search capability of various computer programs is almost indispensable for finding something like a name or place in large data files. Essentially all word processing, spreadsheet, database, and other applications have some form of ‘find’ function. This avoids the need to read through page after page of data for the information one is looking for.

For other than simple text files, you may have noticed that much of the data is posted in Adobe Acrobat pdf file format on this website. The pdf format has become a widely accepted documentation standard. As opposed to proprietary formats, such as Microsoft Word .doc files, the pdf format can be viewed on all types of computers and in the various browsers with a free viewer. In addition to the Acrobat Reader provided by Adobe, there are other public domain readers available. I happen to like a freeware program called Foxit available at http://www.foxitsoftware.com/pdf/reader/. This program is much faster and has a few features not found in the Adobe program.

The primary advantage of the pdf file format is that one can combine text and graphics in the same file. (Other proprietary programs, such as MS Word, can do this also but pdf files tend to be smaller in size without the added overhead such as found in a .doc file.) One must realize, however, that graphic images in themselves cannot be searched. When printed pages of text are converted to computer files with a scanner, the result is a series of images, not a string of text characters. So pdf image-only files cannot be searched by normal means. The solution is to use an optical character recognition (OCR) process to extract the text from the image. Fortunately, the pdf file standard also provides for an alternate format known as “searchable images”. This format adds a hidden text layer behind the images that is aligned with the words in an image. The search function in a pdf reader can then locate character strings and highlight any matches it finds.

Where feasible, I have tried to run scanned files through an OCR engine and post the resulting files in this searchable image format. (This does not apply to some files on other websites to which I have provided a link.) Automated OCR software is not perfect and may not provide 100% accuracy in its text recognition, particularly if the quality of the print is poor. So, if you search some of the files, you may not find all occurrences of the string you are looking for, but the results are surprisingly good.

If you are at all interested, you might try the Foxit reader that I referred to above with a searchable image file. There is an icon in the middle of the toolbar at the top of the window (a page with eyeglasses) that will switch the display between the image and the hidden text layer.