Saturday, July 16, 2011

OCR in Linux: First Impression

Tonight I was cleaning up and found a couple C programs from 1987. These were on paper, and I've been scanning and then recycling paper lately, and considered doing that here. I didn't want to take the time to type the code, though the programs were short. They were interesting enough that I might want to play with them in the future--one logs a Unix user out, killing all his processes, and the other returns the file name part of a path (dropping the path string before and the file extension). Okay, I'm not sure why that's interesting.

Besides, how many sheets of paper do you have laying around with 1987 dates marked "**** HONEYWELL FEDERAL SYSTEMS INC. CONFIDENTIAL and PROPRIETARY ****"? This dates to when I was on the SCOMP team, though has nothing SCOMP-related or even remotely confidential or proprietary. What was SCOMP? See http://en.wikipedia.org/wiki/Multilevel_security

So I scanned the code with xsane and noticed, as I had before, the option to save as text. As before, it didn't work, but this time I paid attention to the error message, gocr not found. So I googled and then install gocr (see http://jocr.sourceforge.net/) .

gocr does a respectable job, but not great. However, it's better than typing from scratch. Here are some  lines from the scanned text:


-    l2   jnclude stdio.h>
l3   include <signal.h>
l4
15   #define ALL_USEn_PROCS  -l
l6
l7   void main()
18
l9      int     kill();
20      jnt     process_id;
21      int     signal;
22
23      process  id = ALL USR PROCS;
24      signa! =- SIGKTL ;    -
25       void   rintf(''\nkil1ing all  rocesses...\n'');
26       void   ill(prOcess_id,  signa );

So it's missing underscores, curly brackets, and various other things. It looks like it added line numbers, but that's from the paper listing. Not great, but it beats typing that code from scratch, I guess.  There may be some options that help it do a better job, so if this becomes important I'll have a look.

No comments: