I know this is an old question... but I believe OCR is improving now, and even open source approaches are coming along and being integrated into R: https://ropensci.org/blog/technotes/2017/08/17/tesseract-16 BR is interesting in that they appear to offer a solution which can generate a catalog of all your documents and then retrieve relevant ones based on search of just about anything in addition to text (Example: search on the Nike "swoosh" logo). I believe they can also make corrections to the entire catalog if, say, a very specific "glyph" was mis-recognized as a B rather than an 8: if you make the correction anywhere, it automatically applies to every document in the catalog. I never actually got my hands on BR, but it was this facet of the overall BR solution that made it sound interesting to me.
... View more