Hi All,
Want to know if any one can help in how i can import and parse pdf and jpg files in SAS Text Miner?
Thanks in advance.
Regards,
Kaushal Solanki
If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.
Below are two methods, one for jpeg,tiff,png,bmp,  gif and pdf(if image).
And a second one for pdf with embedded text.
/* T0099610 OCR using SAS/WPS and Tesseract-OCR (state of the art google offering)
If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.
 OCR using SAS and Tesseract-OCR (state of the art google offering)
Adminstrative information
Where I originally got the tesseract package
https://github.com/UB-Mannheim/tesseract/wiki
There is GUI available and you get a console with both distributions.
I did not install the GUI.
Also located on my google drive(use this before they do a SAS like enhancement)
https://drive.google.com/file/d/0ByX2ii2B0Rq9MmZmVVNjLXpNdkU/view?usp=sharing
Need this to handle PDFs
Get ghostscript here (note ghostscript can combine PDF files and can convert PDF to TIF.
http://www.ghostscript.com/download/gsdnld.html
Really only need one executable.
===============================================
HAVE SIX IMAGE FILES I NEED TO EXTRACT THE TEXT
I have these image files
 png.png
 bmp.bmp
 jpg.jpg
 tiff.tiff
 gif.gif
 pdftif.tiff  (you need to convert the PDF to an Image file)
IMAGE LOOKS LIKE (from proc gslide)
MYSTUDY C304456                               AJAX
DRAFT                                             VER 1.0
                      Ajax Study
                   Dose and Placebo
                         NOTE1
                         NOTE2
                         NOTE3
                         NOTE
PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21
================================================
WANT ( All of them converted to text)
MYSTUDY C304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21
SOLUTION
Basically it is one command and two commands for a PDF.
  * conver to text example - bmp to text;
  x c:/progra~2/tesseract-ocr/tesseract d:/tesser/slide.bmp d:/tesser/slide.txt;
 Note teseract-OCR is not purfect?
Create the image files and convert to text using
%macro tesser(device=png300,type=png);
 filename outfile "d:\tesser\&type..&type";
 goptions
    reset=all
    rotate=portrait
    gsfmode = replace
    device  = &device
    gsfname = outfile
    vsize=10in
    hsize=8in
    htext=2 ;
  run;quit;
  proc gslide ;
   title1 j=l h=2 font='Simplex' "MYSTUDY C04456" j=r "AJAX";
   title2 j=l h=2 font='Couier' "DRAFT" j=r "VER 1.0";
   title3 j=l h=2 " ";
   title4 j=c h=3.0 font='Helvetica' "Ajax Study";
   title5 j=c h=3.0 font='Arial' "Dose and Placebo";
   note j=c h=5 "NOTE1";
   note j=c h=5 "NOTE2";
   note j=c h=5 "NOTE3";
   note j=c h=5 "NOTE";
   footnote1 j=l h=2 font='Times Roman' "PGM: C:\Tut\Tut_GrfTwoWthTtl.sas ";
   footnote2 j=l h=2 font='Helvetica'   "OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - &sysdate &systime";
  run;
  quit;
  goptions reset=all;
  filename outfile clear;
  run;quit;
 ***          *****  *****  *   *  *****
*   *           *    *      *   *    *
    *           *    *       * *     *
  **            *    ****     *      *
 *              *    *       * *     *
*               *    *      *   *    *
*****           *    *****  *   *    *
#! 2aTEXT ;
  * convert to text;
  x c:/progra~2/tesseract-ocr/tesseract d:/tesser/&type..&type d:/tesser/&type..txt;
%mend tesser;
%tesser(device=png300,type=png);
%tesser(device=jpeg300,type=jpg);
%tesser(device=bmp,type=bmp);
%tesser(device=tiff,type=tiff);
%tesser(device=gif,type=gif);
* Here is how to handle a PDF;
Note pdf's often contain embedded fonts so it is better to use another tool to extract the text.
There are many free tools acrobat(even reader?), pdfwriter and pdf2text, boxoft and others in R nd Python.
You can convert the pdf to an image using free goshtscript at http://www.ghostscript.com/download/gsdnld.html.
You really only need  gswin64.exe.
Converting a pdf to a tiff and ectracting the text using tesseract,
x C:\Progra~1\gs\gs9.19\bin\gswin64.exe -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=d:/tesser/pdftif.tiff d:/tesser/pdf.pdf;
x c:/progra~2/tesseract-ocr/tesseract d:/tesser/pdftif.tiff d:/tesser/pdftif.txt;
TEXT files I created (not perfect and each may be a little different
/*
****   *   *   ***
*   *  **  *  *   *
*   *  * * *  *
****   *  **  * ***
*      *   *  *   *
*      *   *  *   *
*      *   *   ***
#! PNG ;
d:/tesser/png.txt
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
NOTE
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
    *  ****    ***
    *  *   *  *   *
    *  *   *  *
    *  ****   * ***
    *  *      *   *
*   *  *      *   *
 ***   *       ***
#! JPG ;
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
MYSTUDY (304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
NOTE
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
****   ****   *****  *****  *****  *****
*   *   *  *  *        *      *    *
*   *   *  *  *        *      *    *
****    *  *  ****     *      *    ****
*       *  *  *        *      *    *
*       *  *  *        *      *    *
*      ****   *        *    *****  *
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
****   *   *  ****
 *  *  ** **  *   *
 *  *  * * *  *   *
 ***   *   *  ****
 *  *  *   *  *
 *  *  *   *  *
****   *   *  *
#! BMP ;
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
*****  *****  *****  *****
  *      *    *      *
  *      *    *      *
  *      *    ****   ****
  *      *    *      *
  *      *    *      *
  *    *****  *      *
#! TIFF ;
MYSTUDY (304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
 ***   *****  *****
*   *    *    *
*        *    *
* ***    *    ****
*   *    *    *
*   *    *    *
 ***   *****  *
#! GIF ;
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21   
/* T1001390 SAS/R: Coverting PDF tables to SAS datasets (simple example)
WPS/SAS/R: Coverting PDF tables to SAS datasets (simple example)
There are more options in the TM (text mining package)
HAVE ( PDF file with the table below)
======================================
  NAME        SEX     AGE    HEIGHT   WEIGHT
  Alfred       M       14        69    112.5
  Alice        F       13      56.5       84
  Barbara      F       13      65.3       98
  Carol        F       14      62.8    102.5
  Henry        M       14      63.5    102.5
  James        M       12      57.3       83
  Jane         F       12      59.8     84.5
  Janet        F       15      62.5    112.5
  Jeffrey      M       13      62.5       84
  John         M       12        59     99.5
  Joyce        F       11      51.3     50.5
  Judy         F       14      64.3       90
  Louise       F       12      56.3       77
  Mary         F       15      66.5      112
  Philip       M       16        72      150
  Robert       M       12      64.8      128
  Ronald       M       15        67      133
  Thomas       M       11      57.5       85
  William      M       15      66.5      112
WANT  (SAS dataset)
===================
Up to 40 obs from sashelp.class total obs=19
Obs    NAME        SEX    AGE   HEIGHT  WEIGHT
  1    Alfred       M      14       69   112.5
  2    Alice        F      13     56.5      84
  3    Barbara      F      13     65.3      98
  4    Carol        F      14     62.8   102.5
  5    Henry        M      14     63.5   102.5
  6    James        M      12     57.3      83
  7    Jane         F      12     59.8    84.5
  8    Janet        F      15     62.5   112.5
  9    Jeffrey      M      13     62.5      84
 10    John         M      12       59    99.5
 11    Joyce        F      11     51.3    50.5
 12    Judy         F      14     64.3      90
 13    Louise       F      12     56.3      77
 14    Mary         F      15     66.5     112
 15    Philip       M      16       72     150
 16    Robert       M      12     64.8     128
 17    Ronald       M      15       67     133
 18    Thomas       M      11     57.5      85
 19    William      M      15     66.5     112
WORKING CODE
============
  file <- "d:/pdf/class.pdf";
  Rpdf <- readPDF(control = list(text = "-layout"));
  corpus <- VCorpus(URISource(file),
        readerControl = list(reader = Rpdf));
  classtext <- as.data.frame(content(content(corpus)[[1]]));
FULL SOLUTION
=============
* create a pdf;
title;footnote;
ods pdf file="d:/pdf/class.pdf";
proc print data=sashelp.class noobs;
run;quit;
ods pdf close;
* xpdf executables have to be in the path;
%utl_submit_wps64('
options set=R_HOME "C:/Program Files/R/R-3.3.2";
libname wrk "%sysfunc(pathname(work))";
proc r;
submit;
source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
library("tm");
library("slam");
file <- "d:/pdf/class.pdf";
Rpdf <- readPDF(control = list(text = "-layout"));
corpus <- VCorpus(URISource(file),
      readerControl = list(reader = Rpdf));
array <- as.data.frame(content(content(corpus)[[1]]));
colnames(array)<-"lines";
endsubmit;
import r=array data=wrk.array;
run;quit;
');
proc print data=array(where=(lines ne ' ')) width=min;
run;quit;
Obs    LINES
  1    NAME SEX AGE HEIGHT WEIGHT
  3    Alfred M   14 69.0 112.5
  5    Alice  F   13 56.5  84.0
  7    Barbara F  13 65.3  98.0
  9    Carol F    14 62.8 102.5
 11    Henry M    14 63.5 102.5
 13    James M    12 57.3  83.0
 15    Jane   F   12 59.8  84.5
 17    Janet F    15 62.5 112.5
 19    Jeffrey M  13 62.5  84.0
 21    John   M   12 59.0  99.5
 23    Joyce F    11 51.3  50.5
 25    Judy   F   14 64.3  90.0
 27    Louise F   12 56.3  77.0
 29    Mary   F   15 66.5 112.0
 31    Philip M   16 72.0 150.0
 33    Robert M   12 64.8 128.0
 35    Ronald M   15 67.0 133.0
 37    Thomas M   11 57.5  85.0
 39    William M  15 66.5 112.0Hi rogerjdeangelis,
Thank you for the reply, But i am more interested that with SAS Text Miner itself is their ny way that i can do this.
Regards,
Kaushal Solanki
In SAS Text Miner, the Text Import node can read in .pdf files.  You can view a list of supported file types that the Text Import node can read at the following URL:
   http://support.sas.com/documentation/onlinedoc/txtminer/14.2/tmref.pdf
Go to Chapter 11 Macro Variables, Macros, and Functions -> %TMFILTER Macro -> Supported Document Formats
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.
