Text mining and content categorization

How to import pdf and jpg files in SAS Text Miner for parsing

Reply
Contributor
Posts: 57

How to import pdf and jpg files in SAS Text Miner for parsing

Hi All,

 

Want to know if any one can help in how i can import and parse pdf and jpg files in SAS Text Miner?

 

Thanks in advance.

 

Regards,

Kaushal Solanki

Valued Guide
Posts: 505

Re: How to import pdf and jpg files in SAS Text Miner for parsing

If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.

Below are two methods, one for jpeg,tiff,png,bmp,  gif and pdf(if image).

And a second one for pdf with embedded text.


/* T0099610 OCR using SAS/WPS and Tesseract-OCR (state of the art google offering)

If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.

 OCR using SAS and Tesseract-OCR (state of the art google offering)

Adminstrative information

Where I originally got the tesseract package
https://github.com/UB-Mannheim/tesseract/wiki

There is GUI available and you get a console with both distributions.
I did not install the GUI.

Also located on my google drive(use this before they do a SAS like enhancement)
https://drive.google.com/file/d/0ByX2ii2B0Rq9MmZmVVNjLXpNdkU/view?usp=sharing

Need this to handle PDFs
Get ghostscript here (note ghostscript can combine PDF files and can convert PDF to TIF.
http://www.ghostscript.com/download/gsdnld.html
Really only need one executable.

===============================================

HAVE SIX IMAGE FILES I NEED TO EXTRACT THE TEXT

I have these image files

 png.png
 bmp.bmp
 jpg.jpg
 tiff.tiff
 gif.gif

 pdftif.tiff  (you need to convert the PDF to an Image file)

IMAGE LOOKS LIKE (from proc gslide)

MYSTUDY C304456                               AJAX
DRAFT                                             VER 1.0

                      Ajax Study
                   Dose and Placebo

                         NOTE1
                         NOTE2
                         NOTE3
                         NOTE


PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21

================================================

WANT ( All of them converted to text)

MYSTUDY C304456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1
NOTE2
NOTE3
NOTE


PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21

SOLUTION

Basically it is one command and two commands for a PDF.

  * conver to text example - bmp to text;

  x c:/progra~2/tesseract-ocr/tesseract d:/tesser/slide.bmp d:/tesser/slide.txt;

 Note teseract-OCR is not purfect?

Create the image files and convert to text using

%macro tesser(device=png300,type=png);

 filename outfile "d:\tesser\&type..&type";
 goptions
    reset=all
    rotate=portrait
    gsfmode = replace
    device  = &device
    gsfname = outfile
    vsize=10in
    hsize=8in
    htext=2 ;
  run;quit;

  proc gslide ;

   title1 j=l h=2 font='Simplex' "MYSTUDY C04456" j=r "AJAX";
   title2 j=l h=2 font='Couier' "DRAFT" j=r "VER 1.0";
   title3 j=l h=2 " ";
   title4 j=c h=3.0 font='Helvetica' "Ajax Study";
   title5 j=c h=3.0 font='Arial' "Dose and Placebo";
   note j=c h=5 "NOTE1";
   note j=c h=5 "NOTE2";
   note j=c h=5 "NOTE3";
   note j=c h=5 "NOTE";
   footnote1 j=l h=2 font='Times Roman' "PGM: C:\Tut\Tut_GrfTwoWthTtl.sas ";
   footnote2 j=l h=2 font='Helvetica'   "OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - &sysdate &systime";

  run;
  quit;

  goptions reset=all;
  filename outfile clear;
  run;quit;

 ***          *****  *****  *   *  *****
*   *           *    *      *   *    *
    *           *    *       * *     *
  **            *    ****     *      *
 *              *    *       * *     *
*               *    *      *   *    *
*****           *    *****  *   *    *

#! 2aTEXT ;

  * convert to text;
  x c:/progra~2/tesseract-ocr/tesseract d:/tesser/&type..&type d:/tesser/&type..txt;

%mend tesser;


%tesser(device=png300,type=png);
%tesser(device=jpeg300,type=jpg);
%tesser(device=bmp,type=bmp);
%tesser(device=tiff,type=tiff);
%tesser(device=gif,type=gif);

* Here is how to handle a PDF;

Note pdf's often contain embedded fonts so it is better to use another tool to extract the text.
There are many free tools acrobat(even reader?), pdfwriter and pdf2text, boxoft and others in R nd Python.

You can convert the pdf to an image using free goshtscript at http://www.ghostscript.com/download/gsdnld.html.
You really only need  gswin64.exe.

Converting a pdf to a tiff and ectracting the text using tesseract,

x C:\Progra~1\gs\gs9.19\bin\gswin64.exe -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=d:/tesser/pdftif.tiff d:/tesser/pdf.pdf;
x c:/progra~2/tesseract-ocr/tesseract d:/tesser/pdftif.tiff d:/tesser/pdftif.txt;


TEXT files I created (not perfect and each may be a little different

/*

****   *   *   ***
*   *  **  *  *   *
*   *  * * *  *
****   *  **  * ***
*      *   *  *   *
*      *   *  *   *
*      *   *   ***

#! PNG ;


d:/tesser/png.txt

MYSTU DY CO4456 AJAX
DRAFT VER 1 .0

Ajax Study
Dose and Placebo

NOTE1
NOTEZ
NOTES
NOTE

PGM: C:\| ut\l 961w§ththas

OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21

    *  ****    ***
    *  *   *  *   *
    *  *   *  *
    *  ****   * ***
    *  *      *   *
*   *  *      *   *
 ***   *       ***

#! JPG ;


MYSTUDY CO4456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1

NOTEZ

NOTE3

NOTE

PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
MYSTUDY (304456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1
NOTEZ
NOTES
NOTE

PGM: C:\| ut\l 961w§ththas

OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21


****   ****   *****  *****  *****  *****
*   *   *  *  *        *      *    *
*   *   *  *  *        *      *    *
****    *  *  ****     *      *    ****
*       *  *  *        *      *    *
*       *  *  *        *      *    *
*      ****   *        *    *****  *


MYSTUDY CO4456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1

NOTE2

NOTE3

NOTE

PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21

****   *   *  ****
 *  *  ** **  *   *
 *  *  * * *  *   *
 ***   *   *  ****
 *  *  *   *  *
 *  *  *   *  *
****   *   *  *

#! BMP ;

MYSTUDY CO4456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1

NOTEZ

NOTE3
NOTE

PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21

*****  *****  *****  *****
  *      *    *      *
  *      *    *      *
  *      *    ****   ****
  *      *    *      *
  *      *    *      *
  *    *****  *      *

#! TIFF ;

MYSTUDY (304456 AJAX
DRAFT VER 1.0

Ajax Study
Dose and Placebo

NOTE1
NOTEZ
NOTES

PGM: C:\| ut\l 961w§ththas

OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21

 ***   *****  *****
*   *    *    *
*        *    *
* ***    *    ****
*   *    *    *
*   *    *    *
 ***   *****  *

#! GIF ;


MYSTU DY CO4456 AJAX
DRAFT VER 1 .0

Ajax Study
Dose and Placebo

NOTE1

NOTE2

NOTE3
NOTE

PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0

Ajax Study
Dose and Placebo

NOTE1

NOTE2

NOTE3
NOTE

PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21   


/* T1001390 SAS/R: Coverting PDF tables to SAS datasets (simple example)

WPS/SAS/R: Coverting PDF tables to SAS datasets (simple example)

There are more options in the TM (text mining package)

HAVE ( PDF file with the table below)
======================================

  NAME        SEX     AGE    HEIGHT   WEIGHT

  Alfred       M       14        69    112.5
  Alice        F       13      56.5       84
  Barbara      F       13      65.3       98
  Carol        F       14      62.8    102.5
  Henry        M       14      63.5    102.5
  James        M       12      57.3       83
  Jane         F       12      59.8     84.5
  Janet        F       15      62.5    112.5
  Jeffrey      M       13      62.5       84
  John         M       12        59     99.5
  Joyce        F       11      51.3     50.5
  Judy         F       14      64.3       90
  Louise       F       12      56.3       77
  Mary         F       15      66.5      112
  Philip       M       16        72      150
  Robert       M       12      64.8      128
  Ronald       M       15        67      133
  Thomas       M       11      57.5       85
  William      M       15      66.5      112

WANT  (SAS dataset)
===================

Up to 40 obs from sashelp.class total obs=19

Obs    NAME        SEX    AGE   HEIGHT  WEIGHT

  1    Alfred       M      14       69   112.5
  2    Alice        F      13     56.5      84
  3    Barbara      F      13     65.3      98
  4    Carol        F      14     62.8   102.5
  5    Henry        M      14     63.5   102.5
  6    James        M      12     57.3      83
  7    Jane         F      12     59.8    84.5
  8    Janet        F      15     62.5   112.5
  9    Jeffrey      M      13     62.5      84
 10    John         M      12       59    99.5
 11    Joyce        F      11     51.3    50.5
 12    Judy         F      14     64.3      90
 13    Louise       F      12     56.3      77
 14    Mary         F      15     66.5     112
 15    Philip       M      16       72     150
 16    Robert       M      12     64.8     128
 17    Ronald       M      15       67     133
 18    Thomas       M      11     57.5      85
 19    William      M      15     66.5     112


WORKING CODE
============

  file <- "d:/pdf/class.pdf";
  Rpdf <- readPDF(control = list(text = "-layout"));
  corpus <- VCorpus(URISource(file),
        readerControl = list(reader = Rpdf));
  classtext <- as.data.frame(content(content(corpus)[[1]]));


FULL SOLUTION
=============

* create a pdf;
title;footnote;
ods pdf file="d:/pdf/class.pdf";
proc print data=sashelp.class noobs;
run;quit;
ods pdf close;

* xpdf executables have to be in the path;
%utl_submit_wps64('
options set=R_HOME "C:/Program Files/R/R-3.3.2";
libname wrk "%sysfunc(pathname(work))";
proc r;
submit;
source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
library("tm");
library("slam");
file <- "d:/pdf/class.pdf";
Rpdf <- readPDF(control = list(text = "-layout"));
corpus <- VCorpus(URISource(file),
      readerControl = list(reader = Rpdf));
array <- as.data.frame(content(content(corpus)[[1]]));
colnames(array)<-"lines";
endsubmit;
import r=array data=wrk.array;
run;quit;
');

proc print data=array(where=(lines ne ' ')) width=min;
run;quit;

Obs    LINES

  1    NAME SEX AGE HEIGHT WEIGHT
  3    Alfred M   14 69.0 112.5
  5    Alice  F   13 56.5  84.0
  7    Barbara F  13 65.3  98.0
  9    Carol F    14 62.8 102.5
 11    Henry M    14 63.5 102.5
 13    James M    12 57.3  83.0
 15    Jane   F   12 59.8  84.5
 17    Janet F    15 62.5 112.5
 19    Jeffrey M  13 62.5  84.0
 21    John   M   12 59.0  99.5
 23    Joyce F    11 51.3  50.5
 25    Judy   F   14 64.3  90.0
 27    Louise F   12 56.3  77.0
 29    Mary   F   15 66.5 112.0
 31    Philip M   16 72.0 150.0
 33    Robert M   12 64.8 128.0
 35    Ronald M   15 67.0 133.0
 37    Thomas M   11 57.5  85.0
 39    William M  15 66.5 112.0
Contributor
Posts: 57

Re: How to import pdf and jpg files in SAS Text Miner for parsing

Hi

 

 

SAS Employee
Posts: 12

Re: How to import pdf and jpg files in SAS Text Miner for parsing

In SAS Text Miner, the Text Import node can read in .pdf files.  You can view a list of supported file types that the Text Import node can read at the following URL:

   http://support.sas.com/documentation/onlinedoc/txtminer/14.2/tmref.pdf

 

Go to Chapter 11 Macro Variables, Macros, and Functions -> %TMFILTER Macro -> Supported Document Formats

 

 
Ask a Question
Discussion stats
  • 3 replies
  • 164 views
  • 4 likes
  • 3 in conversation