All,
I get multiple zipped folders that contain XML files. I want to "point" SAS into the zipped folder and parse out the XML files into records in a SAS dataset.
Here's a diagram of what I'm dealing with:
data.zip
|__sub_folder |__ file1.xml |__ file2.xml
...
|__file500.xml
I know I can use a FILENAME with the ZIP option then read in the contents of a zipped folder:
filename inzip ZIP "C:\input_data\data.zip";
data folder_contents;
length memname $200 isFolder 8;
fid=dopen("inzip");
if fid=0 then
stop;
memcount=dnum(fid);
do i=1 to memcount;
/*Scans and gives only the name of the xml file*/
memname=scan(dread(fid,i), 2, '\');
/*check for trailing / in folder name */
output;
end;
rc=dclose(fid);
run;
title "Files in the ZIP file";
proc print data=folder_contents noobs N;
run;
And I know I can use the LIBNAME engine with the XML option (in this case XMLV2) with an XML map (made from the XML Mapper) to read in XML files into a dataset:
filename SXLEMAP "C:\xml_map\my_map2.map";
libname my_xml_file XMLV2 "C:\XML_files\test.xml" xmlmap = SXLEMAP;
libname out "C:\output";
data out.xml_contents;
set my_xml_file.test;
run;
How do I get a LIBNAME pointing "inside" a zipped folder, if I have to use the FILENAME engine to look inside said folder?
Right now, I believe we can only use the ZIP options in the FILENAME engine, not the LIBNAME.
I want to avoid unzipping since it takes a while to unzip all of the XML files (and IT at my agency has strong restrictions on using X commands. Otherwise, I'd use X commands to move the XML files from the zipped folder to a staging folder, then use a macro to import the XML files).
Got it!
Here is what I got:
filename inzip ZIP "C:\input_data\data.zip";
data folder_contents;
length memname $200 isFolder 8;
fid=dopen("inzip");
if fid=0 then
stop;
memcount=dnum(fid);
do i=1 to memcount;
/*Scans and gives only the name of the xml file*/
memname=scan(dread(fid,i), 2, '\');
output;
end;
rc=dclose(fid);
run;
/* create a report of the ZIP contents */
title "Files in the ZIP file";
proc print data=folder_contents noobs N;
run;
/* identify a temp folder in the WORK directory */
filename xl "%sysfunc(getoption(work))/file1.xml" ;
/* hat tip: "data _null_" on SAS-L */
data _null_;
/* using member syntax here */
infile inzip(sub_folder\file1.xml)
lrecl=256
recfm=F
length=length
eof=eof unbuf;
file xl lrecl=256 recfm=N;
input;
put _infile_ $varying256. length;
return;
eof:
stop;
run;
filename SXLEMAP "C:\xml_map\my_map2.map';
libname my_xml XMLV2 "%sysfunc(getoption(work))/file1.xml" xmlmap=sxlemap access=readonly;
libname out "C:\Output";
data out.xml_contents;
set my_xml.test;
run;
It's the intermediate data _null_ step I'm a little shaky on. So data step basically tells SAS to pluck the xml file from inside the zipped folder and stick it in a temporary work directory? This then allows me to set a LIBNAME on that file in the temporary work directory and leverage the XML engine to parse out the XML?
You're almost there. You just need to add a middle step in which you use DATA step to read the XML file out of the zip archive and then write it to a temp space in your session. Then you can use LIBNAME XML2 to read the XML as data.
I have a similar example with an Excel file in this blog post.
Got it!
Here is what I got:
filename inzip ZIP "C:\input_data\data.zip";
data folder_contents;
length memname $200 isFolder 8;
fid=dopen("inzip");
if fid=0 then
stop;
memcount=dnum(fid);
do i=1 to memcount;
/*Scans and gives only the name of the xml file*/
memname=scan(dread(fid,i), 2, '\');
output;
end;
rc=dclose(fid);
run;
/* create a report of the ZIP contents */
title "Files in the ZIP file";
proc print data=folder_contents noobs N;
run;
/* identify a temp folder in the WORK directory */
filename xl "%sysfunc(getoption(work))/file1.xml" ;
/* hat tip: "data _null_" on SAS-L */
data _null_;
/* using member syntax here */
infile inzip(sub_folder\file1.xml)
lrecl=256
recfm=F
length=length
eof=eof unbuf;
file xl lrecl=256 recfm=N;
input;
put _infile_ $varying256. length;
return;
eof:
stop;
run;
filename SXLEMAP "C:\xml_map\my_map2.map';
libname my_xml XMLV2 "%sysfunc(getoption(work))/file1.xml" xmlmap=sxlemap access=readonly;
libname out "C:\Output";
data out.xml_contents;
set my_xml.test;
run;
It's the intermediate data _null_ step I'm a little shaky on. So data step basically tells SAS to pluck the xml file from inside the zipped folder and stick it in a temporary work directory? This then allows me to set a LIBNAME on that file in the temporary work directory and leverage the XML engine to parse out the XML?
@athomson - Exactly! Good job!
Hi Chris (and anyone else reading this post)!
We've made some pretty good progress so far in trying to "macro-tize" the above so that it will read into multiple zipped folders and then parse out multiple XML files into records in a dataset.
See the post below to offer any thoughts how we can overcome a couple of problems, and even make the macro run more efficiently!
Thanks!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.