I just read an article in eDiscovery Times about Beyond Recognition:
BeyondRecognition Ranked as Top Disruptive eDiscovery Technology to Watch in 2012 : eDiscovery Times
The author makes the point that OCR technology has not improved in the last 5 to 10 years, but that the Beyond Recognition (BR) application allows much more sophisticated technique of "glyph clustering." I'm interested in this as we have to OCR all of our documents that have been digitized from paper. When the paper copy is not in good condition, it can be very difficult to get a good image and therefore accurate representation of the text.
Has anyone used BR and what has been your experience with it?
I'd also be interested if anyone has used BR. OCR does not work well on my surveys that contain handwritten comments, but this BR approach sounds promising.
BeyondRecognition provides a number of document processing technologies for far more than just creating text from images. One of the key functionalities is classifying documents based on visual similarity, NOT based on a textual comparison. BR's visual similarity approach serves to normalize documents regardless of the type of container file, e.g. Word docs, PDF's printed directly from those Word docs, or scanned TIF or image-only PDF copies made from paper printouts of those files, all get classified together despite differences in resolution or orientation.. Well logs, maps, and graphs can be classified based on their appearance. The classification occurs automatically and is scalable to large collections or business processes. BR's visual coding can be used to quickly and accurately extract data elements from the classes for use in subsequent downstream data analytics programs.
For more information on BR's text creation, visual classification, visual coding, and logical document boundary determination capabilities, see the BR blog and website at: http://beyondrecognition.net/resources/document-u-blog/
: Since you are part of BR's management team, maybe you can answer a couple of questions for me.
One, can BR be set to automatically identify and delete tables that are contained in a document?
Two, what kind of classification and categorization can it perform? E.g., given a bunch (say 500,000 to 1 million or so documents), can it automagically group those documents into clusters that differentiate the documents' contents?
TIA,
Art
BR has a "negation" process in which it can remove or delete certain content, and, depending on what the tables looked like, negation could be used to remove the tables. BR can also be used to redact specific terms.
To use your terminology, BR can "automagically" cluster millions of documents based on their visual appearance. The visual classification may provide sufficient differentiation, or you may want to use visual coding to base differentiation on different coded values. For example, visual classification would put contracts of a certain type in a visual classification. To identify contracts with a specific customer or from a specific zip code, visual coding could be used to create fielded data for customer name or customer zip code - such fields can then be used to differentiate within the classification.
OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. A reliabel OCR reader can provide users fast and accurate image recognition function, which converts scanned images into searchable text formats, such as PDF, PDF/A, WORD and any other document formats and almost all the image formats can be detected and recognized by OCR control. Actually, an ocr scanner can batch recognize and process large volume images and documents in over 40 languages and characters sets.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.