I follow on this blog post https://blogs.sas.com/content/sgf/2023/11/08/extract-text-from-a-pdf-file-using-sas-viya/
but for me it's not working. @pstyliadis
Running this program
%let path = /Projects/Extract text from PDFs and create tables/mypdfs;
%put &=path;
proc cas;
file log;
table.dropCaslib /
caslib='ac_pdf' quiet = true;
run;
proc cas;
session mySession;
table.addCaslib /
caslib="ac_pdf"
description="pdf files"
dataSource={srctype="path"}
path="&path" subdirs=true ;
run;
proc casutil;
list files incaslib='ac_pdf';
quit;
proc casutil;
load casdata='' /* To read in all files use an empty string. For a single file specify the file name */
incaslib='ac_pdf' /* The location of the files to load */
importoptions=(fileType="document" fileExtList = 'PDF' tikaConv=True) /* Specify document import options */
casout='pdf_data' outcaslib='casuser' replace; /* Specify the output cas table info */
quit;
I get the following errors in the log:
When I upload my pdfs to a caslib via sftp then it works with the following code.
Except that it throws out the following problem note, but I think it should work once it gets solved by the admin.
proc cas ;
session mySession;
table.dropCaslib / caslib='_TMPCAS_' quiet=true;
table.dropCaslib / caslib='_LOADTMP' quiet=true;
run;
/*** Macro variable setup ***/
/* Specify file path to your images (such as the giraffe_dolphin_small example data) */
%let imagePath = /caslibs/akaike/my_pdf/;
/* Specify the caslib and table name for your image data table */
%let imageCaslibName = casuser;
%let imageTableName = images;
/* Specify the caslib and table name for the augmented training image data table */
%let imageTrainingCaslibName = &imageCaslibName;
%let imageTrainingTableName = &imageTableName.Augmented;
proc cas;
file log;
table.dropCaslib /
caslib='loadPDFTempCaslib' quiet = true;
run;
/*** Load and display images ***/
/* Create temporary caslib and libref for loading images */
caslib loadPDFTempCaslib datasource=(srctype="path") path="&imagePath"
subdirs notactive sessref=mySession;
libname _loadtmp cas caslib="loadPDFTempCaslib";
libname _tmpcas_ cas caslib="CASUSER";
proc casutil;
list files incaslib='loadPDFTempCaslib';
quit;
proc casutil;
load casdata='' /* To read in all files use an empty string. For a single file specify the file name */
incaslib='loadPDFTempCaslib' /* The location of the files to load */
importoptions=(fileType="document" fileExtList = 'PDF' tikaConv=True) /* Specify document import options */
casout='pdf_data' outcaslib='casuser' replace; /* Specify the output cas table info */
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.