[FoRK] image processing for OCR
Damien Morton <
fork at bitfurnace.com
> on >
Mon Jul 3 06:54:43 PDT 2006
The algorithm is called a flood fill.
Though with what you are doing, its a bit different because the areas
you want to fill arent known in advance, and I imagine a fair portion of
the regions you would want to fill are broken regions - i.e. with at
least some pixels missing somewhere along their borders.
You could do a blur followed by its inverse (sharpen), which would close
small gaps, but I fear that with fine detailed text, you would create
more problems than you would solve that way (i.e. by closing gaps that
_should_ be there).
Can you put some of these scans up somewhere for download?
> On Mon, Jul 03, 2006 at 02:07:40PM +0100, Andy Armstrong wrote:
>> Are you looking to automate this or is it a one off?
> Yes, something like run a batch over an incoming directory
> on a server, before plugging it into a FineReader or OmniPage
>> I think the easiest way to do it programatically is to scan each
>> raster turning filling on and off as you cross filled pixels. That
>> implements the effect of filling the paths using an odd/even winding
>> rule - which is what you want for text.
> Unfortunately, it has to be a bit more intelligent than that. Only
> parts of the page have the stupid artefact, the others are fine.
> (Error rate is still lousy, though, there's definitely a need
> for an IUPAC proofreader). There *must* be an off-the-shelf filter
> for it already. It's just it's too old-skool for the web, or
> I don't know the proper terminology.
>> The main problem then is handling the special case of the horizontal
>> path segments at the tops and bottoms of letters - you only want to
>> toggle filling where a path crosses the raster rather than where it
>> just runs along the raster for a bit. The fact that your letter
>> outlines are probably more than one pixel thick slightly complicates
>> detecting that case.
> I don't think this will work for http://eugen.leitl.org/sample.tif
> especially since the bottom of the scan is good-quality.
>> Is that what you need to do? If so I'll try to provide more detail :)
> FoRK mailing list
More information about the FoRK