Hi All,
Want to know if any one can help in how i can import and parse pdf and jpg files in SAS Text Miner?
Thanks in advance.
Regards,
Kaushal Solanki
If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.
Below are two methods, one for jpeg,tiff,png,bmp, gif and pdf(if image).
And a second one for pdf with embedded text.
/* T0099610 OCR using SAS/WPS and Tesseract-OCR (state of the art google offering)
If you have IML interface to R you can use google tesseract. This is the tool
google uses to digitize books.
OCR using SAS and Tesseract-OCR (state of the art google offering)
Adminstrative information
Where I originally got the tesseract package
https://github.com/UB-Mannheim/tesseract/wiki
There is GUI available and you get a console with both distributions.
I did not install the GUI.
Also located on my google drive(use this before they do a SAS like enhancement)
https://drive.google.com/file/d/0ByX2ii2B0Rq9MmZmVVNjLXpNdkU/view?usp=sharing
Need this to handle PDFs
Get ghostscript here (note ghostscript can combine PDF files and can convert PDF to TIF.
http://www.ghostscript.com/download/gsdnld.html
Really only need one executable.
===============================================
HAVE SIX IMAGE FILES I NEED TO EXTRACT THE TEXT
I have these image files
png.png
bmp.bmp
jpg.jpg
tiff.tiff
gif.gif
pdftif.tiff (you need to convert the PDF to an Image file)
IMAGE LOOKS LIKE (from proc gslide)
MYSTUDY C304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21
================================================
WANT ( All of them converted to text)
MYSTUDY C304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwoWthTtl.sas
OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - 16AUG16 06:21
SOLUTION
Basically it is one command and two commands for a PDF.
* conver to text example - bmp to text;
x c:/progra~2/tesseract-ocr/tesseract d:/tesser/slide.bmp d:/tesser/slide.txt;
Note teseract-OCR is not purfect?
Create the image files and convert to text using
%macro tesser(device=png300,type=png);
filename outfile "d:\tesser\&type..&type";
goptions
reset=all
rotate=portrait
gsfmode = replace
device = &device
gsfname = outfile
vsize=10in
hsize=8in
htext=2 ;
run;quit;
proc gslide ;
title1 j=l h=2 font='Simplex' "MYSTUDY C04456" j=r "AJAX";
title2 j=l h=2 font='Couier' "DRAFT" j=r "VER 1.0";
title3 j=l h=2 " ";
title4 j=c h=3.0 font='Helvetica' "Ajax Study";
title5 j=c h=3.0 font='Arial' "Dose and Placebo";
note j=c h=5 "NOTE1";
note j=c h=5 "NOTE2";
note j=c h=5 "NOTE3";
note j=c h=5 "NOTE";
footnote1 j=l h=2 font='Times Roman' "PGM: C:\Tut\Tut_GrfTwoWthTtl.sas ";
footnote2 j=l h=2 font='Helvetica' "OUT: C:\Tut\Tut_GrfTwoWthTtl.pdf - &sysdate &systime";
run;
quit;
goptions reset=all;
filename outfile clear;
run;quit;
*** ***** ***** * * *****
* * * * * * *
* * * * * *
** * **** * *
* * * * * *
* * * * * *
***** * ***** * * *
#! 2aTEXT ;
* convert to text;
x c:/progra~2/tesseract-ocr/tesseract d:/tesser/&type..&type d:/tesser/&type..txt;
%mend tesser;
%tesser(device=png300,type=png);
%tesser(device=jpeg300,type=jpg);
%tesser(device=bmp,type=bmp);
%tesser(device=tiff,type=tiff);
%tesser(device=gif,type=gif);
* Here is how to handle a PDF;
Note pdf's often contain embedded fonts so it is better to use another tool to extract the text.
There are many free tools acrobat(even reader?), pdfwriter and pdf2text, boxoft and others in R nd Python.
You can convert the pdf to an image using free goshtscript at http://www.ghostscript.com/download/gsdnld.html.
You really only need gswin64.exe.
Converting a pdf to a tiff and ectracting the text using tesseract,
x C:\Progra~1\gs\gs9.19\bin\gswin64.exe -dNOPAUSE -dBATCH -sDEVICE=tiffg4 -sOutputFile=d:/tesser/pdftif.tiff d:/tesser/pdf.pdf;
x c:/progra~2/tesseract-ocr/tesseract d:/tesser/pdftif.tiff d:/tesser/pdftif.txt;
TEXT files I created (not perfect and each may be a little different
/*
**** * * ***
* * ** * * *
* * * * * *
**** * ** * ***
* * * * *
* * * * *
* * * ***
#! PNG ;
d:/tesser/png.txt
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
NOTE
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
* **** ***
* * * * *
* * * *
* **** * ***
* * * *
* * * * *
*** * ***
#! JPG ;
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
MYSTUDY (304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
NOTE
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
**** **** ***** ***** ***** *****
* * * * * * * *
* * * * * * * *
**** * * **** * * ****
* * * * * * *
* * * * * * *
* **** * * ***** *
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
**** * * ****
* * ** ** * *
* * * * * * *
*** * * ****
* * * * *
* * * * *
**** * * *
#! BMP ;
MYSTUDY CO4456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTE3
NOTE
PGM: C:\Tut\Tut_GrfTwthhTtl.sas
OUT: C:\Tut\Tut_GrfTwthhTtl.pdf - 16AUG16 06:21
***** ***** ***** *****
* * * *
* * * *
* * **** ****
* * * *
* * * *
* ***** * *
#! TIFF ;
MYSTUDY (304456 AJAX
DRAFT VER 1.0
Ajax Study
Dose and Placebo
NOTE1
NOTEZ
NOTES
PGM: C:\| ut\l 961w§ththas
OUT: C:\Tut\Tut_GrF|'wo_VVthTtl.pdf - 16AUG16 06:21
*** ***** *****
* * * *
* * *
* *** * ****
* * * *
* * * *
*** ***** *
#! GIF ;
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21
MYSTU DY CO4456 AJAX
DRAFT VER 1 .0
Ajax Study
Dose and Placebo
NOTE1
NOTE2
NOTE3
NOTE
PGM: C:\Tut\Tut_Gn‘TwthhTt|.sas
OUT: C:\Tut\Tut_Gn‘TwthhTt|.pdf - 16AUG16 06:21
/* T1001390 SAS/R: Coverting PDF tables to SAS datasets (simple example)
WPS/SAS/R: Coverting PDF tables to SAS datasets (simple example)
There are more options in the TM (text mining package)
HAVE ( PDF file with the table below)
======================================
NAME SEX AGE HEIGHT WEIGHT
Alfred M 14 69 112.5
Alice F 13 56.5 84
Barbara F 13 65.3 98
Carol F 14 62.8 102.5
Henry M 14 63.5 102.5
James M 12 57.3 83
Jane F 12 59.8 84.5
Janet F 15 62.5 112.5
Jeffrey M 13 62.5 84
John M 12 59 99.5
Joyce F 11 51.3 50.5
Judy F 14 64.3 90
Louise F 12 56.3 77
Mary F 15 66.5 112
Philip M 16 72 150
Robert M 12 64.8 128
Ronald M 15 67 133
Thomas M 11 57.5 85
William M 15 66.5 112
WANT (SAS dataset)
===================
Up to 40 obs from sashelp.class total obs=19
Obs NAME SEX AGE HEIGHT WEIGHT
1 Alfred M 14 69 112.5
2 Alice F 13 56.5 84
3 Barbara F 13 65.3 98
4 Carol F 14 62.8 102.5
5 Henry M 14 63.5 102.5
6 James M 12 57.3 83
7 Jane F 12 59.8 84.5
8 Janet F 15 62.5 112.5
9 Jeffrey M 13 62.5 84
10 John M 12 59 99.5
11 Joyce F 11 51.3 50.5
12 Judy F 14 64.3 90
13 Louise F 12 56.3 77
14 Mary F 15 66.5 112
15 Philip M 16 72 150
16 Robert M 12 64.8 128
17 Ronald M 15 67 133
18 Thomas M 11 57.5 85
19 William M 15 66.5 112
WORKING CODE
============
file <- "d:/pdf/class.pdf";
Rpdf <- readPDF(control = list(text = "-layout"));
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf));
classtext <- as.data.frame(content(content(corpus)[[1]]));
FULL SOLUTION
=============
* create a pdf;
title;footnote;
ods pdf file="d:/pdf/class.pdf";
proc print data=sashelp.class noobs;
run;quit;
ods pdf close;
* xpdf executables have to be in the path;
%utl_submit_wps64('
options set=R_HOME "C:/Program Files/R/R-3.3.2";
libname wrk "%sysfunc(pathname(work))";
proc r;
submit;
source("C:/Program Files/R/R-3.3.2/etc/Rprofile.site", echo=T);
library("tm");
library("slam");
file <- "d:/pdf/class.pdf";
Rpdf <- readPDF(control = list(text = "-layout"));
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf));
array <- as.data.frame(content(content(corpus)[[1]]));
colnames(array)<-"lines";
endsubmit;
import r=array data=wrk.array;
run;quit;
');
proc print data=array(where=(lines ne ' ')) width=min;
run;quit;
Obs LINES
1 NAME SEX AGE HEIGHT WEIGHT
3 Alfred M 14 69.0 112.5
5 Alice F 13 56.5 84.0
7 Barbara F 13 65.3 98.0
9 Carol F 14 62.8 102.5
11 Henry M 14 63.5 102.5
13 James M 12 57.3 83.0
15 Jane F 12 59.8 84.5
17 Janet F 15 62.5 112.5
19 Jeffrey M 13 62.5 84.0
21 John M 12 59.0 99.5
23 Joyce F 11 51.3 50.5
25 Judy F 14 64.3 90.0
27 Louise F 12 56.3 77.0
29 Mary F 15 66.5 112.0
31 Philip M 16 72.0 150.0
33 Robert M 12 64.8 128.0
35 Ronald M 15 67.0 133.0
37 Thomas M 11 57.5 85.0
39 William M 15 66.5 112.0
Hi rogerjdeangelis,
Thank you for the reply, But i am more interested that with SAS Text Miner itself is their ny way that i can do this.
Regards,
Kaushal Solanki
In SAS Text Miner, the Text Import node can read in .pdf files. You can view a list of supported file types that the Text Import node can read at the following URL:
http://support.sas.com/documentation/onlinedoc/txtminer/14.2/tmref.pdf
Go to Chapter 11 Macro Variables, Macros, and Functions -> %TMFILTER Macro -> Supported Document Formats
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.