BookmarkSubscribeRSS Feed
JuliaM
Calcite | Level 5

I just read an article in eDiscovery Times about Beyond Recognition:

BeyondRecognition Ranked as Top Disruptive eDiscovery Technology to Watch in 2012 : eDiscovery Times

The author makes the point that OCR technology has not improved in the last 5 to 10 years, but that the Beyond Recognition (BR) application allows much more sophisticated technique of "glyph clustering." I'm interested in this as we have to OCR all of our documents that have been digitized from paper. When the paper copy is not in good condition, it can be very difficult to get a good image and therefore accurate representation of the text.

Has anyone used BR and what has been your experience with it?

6 REPLIES 6
jaredp
Quartz | Level 8

I'd also be interested if anyone has used BR.  OCR does not work well on my surveys that contain handwritten comments, but this BR approach sounds promising.

JoeHowie
Calcite | Level 5

BeyondRecognition provides a number of document processing technologies for far more than just creating text from images. One of the key functionalities is classifying documents based on visual similarity, NOT based on a textual comparison. BR's visual similarity approach serves to normalize documents regardless of the type of container file, e.g. Word docs, PDF's printed directly from those Word docs, or scanned TIF or image-only PDF copies made from paper printouts of those files, all get classified together despite differences in resolution or orientation.. Well logs, maps, and graphs can be classified based on their appearance. The classification occurs automatically and is scalable to large collections or business processes. BR's visual coding can be used to quickly and accurately extract data elements from the classes for use in subsequent downstream data analytics programs.

For more information on BR's text creation, visual classification, visual coding, and logical document boundary determination capabilities, see the BR blog and website at: http://beyondrecognition.net/resources/document-u-blog/

art297
Opal | Level 21

: Since you are part of BR's management team, maybe you can answer a couple of questions for me.

One, can BR be set to automatically identify and delete tables that are contained in a document?

Two, what kind of classification and categorization can it perform?  E.g., given a bunch (say 500,000 to 1 million or so documents), can it automagically group those documents into clusters that differentiate the documents' contents?

TIA,

Art

JoeHowie
Calcite | Level 5

BR has a "negation" process in which it can remove or delete certain content, and, depending on what the tables looked like, negation could be used to remove the tables. BR can also be used to redact specific terms.

To use your terminology, BR can "automagically" cluster millions of documents based on their visual appearance. The visual classification may provide sufficient differentiation, or you may want to use visual coding to base differentiation on different coded values. For example, visual classification would put  contracts of a certain type in a visual classification. To identify contracts with a specific customer or from a specific zip code, visual coding could be used to create fielded data for customer name or customer zip code - such fields can then be used to differentiate within the classification.

BenthamLEE
Calcite | Level 5

OCR, is the mechanical or electronic conversion of scanned or photographed images of typewritten or printed text into machine-encoded/computer-readable text. A reliabel OCR reader can provide users fast and accurate image recognition function, which converts scanned images into searchable text formats, such as PDF, PDF/A, WORD and any other document formats and almost all the image formats can be detected and recognized by OCR control. Actually, an ocr scanner can batch recognize and process large volume images and documents in over 40 languages and characters sets.

JohnJPS
Quartz | Level 8
I know this is an old question... but I believe OCR is improving now, and even open source approaches are coming along and being integrated into R: https://ropensci.org/blog/technotes/2017/08/17/tesseract-16

BR is interesting in that they appear to offer a solution which can generate a catalog of all your documents and then retrieve relevant ones based on search of just about anything in addition to text (Example: search on the Nike "swoosh" logo). I believe they can also make corrections to the entire catalog if, say, a very specific "glyph" was mis-recognized as a B rather than an 8: if you make the correction anywhere, it automatically applies to every document in the catalog. I never actually got my hands on BR, but it was this facet of the overall BR solution that made it sound interesting to me.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 5107 views
  • 3 likes
  • 6 in conversation