New Forward Search OCR Module

We are happy to announce a new OCR Option for Forward Search 2.6. The new module makes it possible to search in scanned documents, which often are in JPG, Gif, BMP or PDF formats and only contains graphics, not text. The OCR (optical character recognition) module converts the image documents into text that is injected into the Forward Search index, ready for searching.

 

OCR (Optical Character Recognition) engines extracts and analyzes the shape of a bitmapped character, that produce a range of possible results, so they are supplemented by post analysis validations that support the most likely result followed by the possible alternatives. Each possible character is supported by a likelihood percentage with the analysis and check performed to derive the most plausible result.

 

As OCR engines in general only gives a near accurate result, that can vary based on several things such as; the quality of the original document, the language, fonts etc. The Forward Search OCR module will match then found words (fuzzy match) with the wordlist from the index of other relevant content (from same source or organization) and correct the word to improve recognition quality.

 

The Forward Search OCR module is based on the Tesseract Open Source OCR Engine originally developed at Hewlett Packard. The Module is tested with the Danish and English language but supports 33 languages that all are fully trainable.

 

The OCR module could add value to organizations that has large amount of documents which are not searchable, which could be accessible from an intranet solution etc., thus increasing findability.


Contact

Forward IT ApS

Rådhustorvet 1, 1 - DK-3520 Farum - Denmark

Phone (+45) 70 27 43 11 - e-mail 

Partner Portal & Support