Getting Text and Table Data from Printed Materials (Topic)

From TaDa Wiki
Jump to: navigation, search


Optical Character Recognition (OCR) is the name for a set of tools that look at images of text (from a scan or photograph) and attempt to convert those images of text into computer-readable text that you can easily manipulate. Below are a number of tools for OCRing documents you may have found in books. However, be aware that OCR is far from foolproof -- whatever tool you use, make sure you proof-read your results, and be ready to be frustrated. Some fonts are harder to OCR than others, low quality scans can cause problems, etc.

Tools for Text

There are two main programs for OCRing text:

Tools for Data in Tables

OCRing data formatted in tables requires special tools. Tabula is a nice tool, but there are no foolproof solutions.

OCRing Entire Books

Want to OCR an entire book? There are services for that!

  • 1DollarScan: Fast and cheap (as little as $0.01 a page), but be aware you don't get your books back -- to automate the process they cut the spine off the book so they can feed the pages through a machine one at a time.
  • eRecords USA: More expensive, but they return your books intact.

Hiring People to Transcribe or Clean Data

Because OCR is often imperfect, you may need to hire people to review your OCR results. Or, if you're working with bad prints or a weird font, you may just need to hire people to do the whole job by hand. Two good options are:

  • oDesk oDesk is like eBay for freelance contractors of different sorts. You go on and either invite individuals to work for you, or post a job and receive applications. The site has a full range of skill sets, from people who just do data entry to people with advanced degrees and computer skills. If you want to digitize something on oDesk, you would post the job, pick a contractor, and hire them to do the job, then pay them. Three recommendations about oDesk:
    • Always give your contractors a trial job -- quality is hard to discern on these sites, especially if your job requires some critical thought on behalf of the contractor, so pick a couple candidates and give them all a trail job before picking who to work with.
    • Give bonuses for good work. People really appreciate them. And if you find someone good, you'll probably want to use them again. Plus, the contractors rate you too, so it's good to establish a healthy relationship.
    • For data entry, I recommend hiring two people for each task -- then you can check whether the data was correctly entered by comparing the results from the two contractors.
  • Amazon MTurk: MTurk operates differently than oDesk -- rather than picking a contractor and working with them directly, you set-up a kind of interface (with some help from Amazon) where a generic user is presented with a small task. For example, the image of one word may appear, and the user has to type the word they see into a box. You never interact with your workers. Unlike oDesk, you have to put more energy into creating this streamlined interface (where on oDesk you can just send some PDFs or an excel sheet), but it's much, much cheaper.