[FoRK] image processing for OCR

Damien Morton < fork at bitfurnace.com > on > Mon Jul 3 06:54:43 PDT 2006

The algorithm is called a flood fill. 
http://student.kuleuven.be/~m0216922/CG/floodfill.html

Though with what you are doing, its a bit different because the areas 
you want to fill arent known in advance, and I imagine a fair portion of 
the regions you would want to fill are broken regions - i.e. with at 
least some pixels missing somewhere along their borders.

You could do a blur followed by its inverse (sharpen), which would close 
small gaps, but I fear that with fine detailed text, you would create 
more problems than you would solve that way (i.e. by closing gaps that 
_should_ be there).

Can you put some of these scans up somewhere for download?

> On Mon, Jul 03, 2006 at 02:07:40PM +0100, Andy Armstrong wrote:
> 
>> Are you looking to automate this or is it a one off?
> 
> Yes, something like run a batch over an incoming directory
> on a server, before plugging it into a FineReader or OmniPage
> pipeline.
>  
>> I think the easiest way to do it programatically is to scan each  
>> raster turning filling on and off as you cross filled pixels. That  
>> implements the effect of filling the paths using an odd/even winding  
>> rule - which is what you want for text.
> 
> Unfortunately, it has to be a bit more intelligent than that. Only
> parts of the page have the stupid artefact, the others are fine.
> (Error rate is still lousy, though, there's definitely a need
> for an IUPAC proofreader). There *must* be an off-the-shelf filter
> for it already. It's just it's too old-skool for the web, or
> I don't know the proper terminology.
>  
>> The main problem then is handling the special case of the horizontal  
>> path segments at the tops and bottoms of letters - you only want to  
>> toggle filling where a path crosses the raster rather than where it  
>> just runs along the raster for a bit. The fact that your letter  
>> outlines are probably more than one pixel thick slightly complicates  
>> detecting that case.
> 
> I don't think this will work for http://eugen.leitl.org/sample.tif
> especially since the bottom of the scan is good-quality.
>  
>> Is that what you need to do? If so I'll try to provide more detail :)
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> FoRK mailing list
> http://xent.com/mailman/listinfo/fork


More information about the FoRK mailing list