OCR:
Why it's our Friend
Home > Build > Design
> Articles
By
Don Herion
(1)
(2)
The
Process
Converting printed documents into computer editable documents is fairly straightforward.
Upon starting TypeReader, I select my scanner Mustek 1200 LP (approx. cost
$100). I click on the Auto Straighten and Auto Orientation
buttons. I used default settings for the rest.
I
lay the book on my flatbed scanner. I then choose From Scanner under
the Get Page button and then select Auto Start. My scanning
software starts. There I finalize my scan settings. Based on what Ive read
and some initial testing (I started at 200dpi) I chose 300 dpi and selected black
& white. As long as you have good quality documents (i.e. clean readable text)
these settings will be sufficient.
 |
| Converted text w/errors highlighted |
I
then scan the document and it is immediately sent to TypeReader. There
the image is broken up into sections and the software converts those sections
into computer editable text. You then have the option of exporting this new document in a number
of formats including .html, .rtf, .doc, and others. I tried using the .html method,
but I did not like the result. It created lots of tables. I ended up saving my
files as .rtf. I then opened my files in Word and did my spell check. I then saved
the files as .html and imported them into our Dreamweaver template. I scanned
the images separately (72 dpi/color) and imported them over after cropping them
in Photoshop.
99
percent?
Is OCR 99 % accurate? The answer is - pretty close. One of the
pluses with TypeReader 6.0 is the 'Properties' window that pops up revealing the
number of 'suspect' and 'illegible' characters on the page. I did have to spend
a little time correcting mistakes. Most come from converting certain kinds of
type. Italicized fonts, quotation marks and some period/commas appear to be the
usual suspects that I had to fix. But I was quite impressed with the speed and
accuracy of the process.
 |
| Property window reveals possible errors |
Serious
OCR
If you are seriously considering using OCR to convert many documents
get yourself a fast flatbed scanner with an auto document feeder. Of course you
can not run a book through an auto feeder and they will add several hundred dollars
to the price of your scanner.
Final
Thoughts
OCR is a great technology that has come a long way in the last
few years. It has sure saved me a ton of work. It could do the same for you.
Check
out the final result: Sample
Chapter 10 - HTML 4 for the World Wide Web