Improving a macro that parses out multiple XML files from multiple zipped folders

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

Improving a macro that parses out multiple XML files from multiple zipped folders

We've written a macro that reads in multiple zipped folders and for each folder, reads in multiple XML files and parses the XML files into records in a dataset. 

 

Here is how things look:

 

/folders/myfolders/In

                              |- data_zip1.zip

                                             |- sub_folder

                                                           |- xmlfile1.xml

                                                           |- xmlfile2.xml

                                                           ...

                                                           |- xmlfile500.xml

                              |- data_zip2.zip

                                             |- sub_folder

                                                           |- xmlfile1.xml

                                                           |- xmlfile2.xml

                                                           ...

                                                           |- xmlfile500.xml

...and so on for n number of zipped folders. 

 

Problem 1: It seems like every time we run the macro, it appends the output dataset instead of writing over it.

 

Problem 2: We also need to delete the temporary work directories and datasets after each proc append (the xml_test_&j. datasets towards the bottom of the code) so they don't clog up SAS EG and throw a hissy fit for having more than 50 datasets. 

 

See link below for our first post when we were trying to figure out how to parse out an XML file from a zipped folder. 

https://communities.sas.com/t5/General-SAS-Programming/Import-zipped-XML-files/m-p/322732#M42476 

 

options spool nonotes nosource nosource2 errors=0;
 
filename zips "/folders/myfolders/In/";
 
*get names of zip files contained within the directory;
data zips_contents;
     fid=dopen("zips");
     if fid=0 then stop;
 
     memcount = dnum(fid);
 
     do i = 1 to memcount;
           memname = dread(fid,i);
           output;
     end;
run;
 
*assign macro variables for the zip files;
** zipNOBS: number of zip files in the folder;
**&&zipname&i: name of the zip file;
proc sql noprint;
select count(*)
into :zipNObs
from zips_contents;
select memname
into :zipname1-:zipname%left(&zipNObs)
from zips_contents;
quit;
 
*check macro variables - outputs value to log;
%put count of zip files is &zipNObs;
%put name of zip files is &zipname1;
 
*macro cycles through each zip folder in the directory location;
*for each zip folder, cycle through each of the xml files, parses the xml, and append to a SAS dataset;
%macro doit;
%do i=1 %to &zipNObs;
 
filename inzip ZIP "/folders/myfolders/In/&&zipname&i..";
 
*read the name of each xml file in each zipped folder; data contents_&i.; length memname $200 ; fid=dopen("inzip"); if fid=0 then stop; memcount=dnum(fid); do i=1 to memcount; /*Scans and gives only the name of the xml file*/ memname=scan(dread(fid,i), 2, '\'); output; end; rc=dclose(fid); run; *populates dataset CONTENTS that contains a list of each of the xml files in the zip file; PROC APPEND BASE=CONTENTS DATA=CONTENTS_&i. FORCE; RUN; *assign macro variables for the xml files; ** xmlNOBS: number of xml files in the folder; **&&xmlname&i: name of the xml file; proc sql noprint; select count(*) into :xmlNObs from contents_&i.; select memname into :xmlname1-:xmlname%left(&xmlNObs) from contents_&i.; quit; *check macro variables for the xml files - outputs value to log; %put count of xml files is &xmlNObs; %put name of xml files is &xmlname1; %do j=1 %to &xmlNObs; /* identify a temp folder in the WORK directory */
/* have SAS take each xml file and put it in a temporary work directory*/ filename xl "%sysfunc(getoption(work))/&&xmlname&j" ; /* hat tip: "data _null_" on SAS-L */ data _null_; /* using member syntax here */ infile inzip(sub_folder\&&xmlname&j) lrecl=256 recfm=F length=length eof=eof unbuf; file xl lrecl=256 recfm=N; input; put _infile_ $varying256. length; return; eof: stop; run;
*use the my_map2.map as a guide for sas to parse the xml file;
*for each xml file in its temporary work directory, parse and append; filename SXLEMAP '/folders/myfolders/XML_Map/my_map2.map';
libname my_xml XMLV2 "%sysfunc(getoption(work))/&&xmlname&j.." xmlmap=sxlemap access=readonly; data xml_test_&j.; set my_xml.xml_contents; run; *xml_test contains the parsed xml file. One record per xml file; PROC APPEND BASE=xml_test DATA=xml_test_&j. FORCE; RUN; %end; %end; %mend; %doit *check and output transactions as permanent dataset; proc print data= xml_test; run; libname output "/folders/myfolders/Out"; DATA output.parsed_XML_Data; set xml_test; run;

 

 


Accepted Solutions
Solution
‎01-11-2017 10:53 PM
Occasional Contributor
Posts: 6

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

I think your suggestions worked! 

 

I realized that there were two big inefficiencies that was causing problems when you pointed out we can just use a static name for the xml files / datasets:

 

1. Towards the beginning of the first do loop, we kept the datasets CONTENTS_ static since it was redundant to have both the filename "inzip" have &i as well as contents_&i since contents_ was pointing to that filename in the first place.

 

2. Similar situation in the second do loop when we cycle through the xml files. It was redundant to have both the libname "my_xml" have &j as well as xml_test_&j since xml_test_ is was pointing to the libname. 

 

Finally, throwing in the proc datasets delete routine fixed the issue of the macro appending the final dataset to itself if we run the program multiple times. 

 

Thanks everyone for all of the feedback! I'm calling this a win. If we (or anyone) finds any more oppturnities to make this macro more efficient we'll be sure to share!

 

Here is what we got: 

 

/*assign the directory where the zipped folders are to*/
/*the macro variable dir*/
%macro doit(&dir); options spool nonotes nosource nosource2 errors=0; filename zips "&dir";
/*get the names of the zipped folders in the directory*/ data zips_contents; fid=dopen("zips"); if fid=0 then stop; memcount = dnum(fid); do i = 1 to memcount; memname = dread(fid,i); output; end; run;
/*Assign zipped folders as macro variables*/
/*zopNOBS is the number of zipped folders*/
/*zipname is the name of each zipped folder*/ proc sql noprint; select count(*) into :zipNObs from zips_contents; select memname into :zipname1-:zipname%left(&zipNObs) from zips_contents; quit; proc datasets lib = work; delete xml_test; run; /*macro cycles through each zipped folder in the directory location*/
/*For each zipped folder, read in the xml file*/ %do i=1 %to &zipNObs; filename inzip ZIP "&dir/&&zipname&i"; data contents_; length memname $200 ; fid=dopen("inzip"); if fid=0 then stop; memcount=dnum(fid); do i=1 to memcount; /*Scans and gives only the name of the xml file*/ memname=scan(dread(fid,i), 2, '\'); output; end; rc=dclose(fid); run; /*populate dataset CONTENTS_ that contains a list of each of the xml files in the zipped folder*/ PROC APPEND BASE=CONTENTS DATA=CONTENTS_ FORCE; RUN; /*assign xml files as macro variables*/ /*xmlNOBS is the number of xml files in the zipped folder*/ /*xmlname is name of the xml file*/ proc sql noprint; select count(*) into :xmlNObs from contents_; select memname into :xmlname1-:xmlname%left(&xmlNObs) from contents_; quit;
/*Loop through each xml file in a zipped directory*/ %do j=1 %to &xmlNObs; /* identify a temp folder in the WORK directory */ filename xl "%sysfunc(getoption(work))/&&xmlname&j"; /*Get the xml file and set it to a temporary work directory*/ data _null_; /*using member syntax here*/
/*the name of the sub_folder then the XML file*/ infile inzip(sub_folder\&&xmlname&j) lrecl=256 recfm=F length=length eof=eof unbuf; file xl lrecl=256 recfm=N; input; put _infile_ $varying256. length; return; eof: stop; run;
/*Parse the XML file into a record in a SAS dataset*/
/*using the xml map file to help SAS parse out the xml file*/
/*that was generated using the "SAS XML mapper" application*/ filename SXLEMAP '/C:/XML_Map/my_map.map'; libname my_xml XMLV2 "%sysfunc(getoption(work))/&&xmlname&j" xmlmap=sxlemap access=readonly; data xml_test_; set my_xml.xml_contents; run; /*xml test contains the parsed xml file. One record per xml file*/ PROC APPEND BASE=xml_test DATA=xml_test_ FORCE; RUN; %end; %end; *check and output transactions as permanent dataset; proc print data= xml_test;
run;

/*Save the dataset*/ libname output "C:/Out"; DATA output.parsed_xml_data; set xml_test; run; %mend;
/*Run the macro with the directory of where the*/
/*zipped folders are saved*/ %doit(C:/In/zipped_folders)

 

View solution in original post


All Replies
Respected Advisor
Posts: 4,135

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

[ Edited ]

@athomson

It would help a lot if you could attach two sample zip files each containing two sample XML's so we can actually run your code and see what happens and test whatever we suggest as improvements.

Super Contributor
Posts: 474

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

[ Edited ]

Hi.
 
For problem 2:
 
make the xml_test_&j dataset a static name, like this:
 
...

 

data _xml_test;
     set my_xml.xml_contents;
run;

*xml_test contains the parsed xml file. One record per xml file;
PROC APPEND BASE=xml_test DATA=_xml_test. FORCE;
RUN;
%end;

 

...
 
This will overwrite the dataset for each cycle, so you will not need to delete it.
 
For problem 1:
 
from your code the output dataset is overwritten not appended I can assure you that, what is not overwritten is the auxiliary xml_test dataset in the WORK library. It's always appended with the contents of the xml_test_&j datasets... So by not deleting it in the beginning of the macro and running the macro several times in the same SAS session (same WORK library) you are appending everytime the result of your current run to the previous one... Since at the end you are generating the output dataset from xml_test, here's why you have that append effect.
 
To solve this, you have to delete the xml_test dataset just before cycling through the files, like this for example
 
...

 

*check macro variables for the xml files - outputs value to log;
%put count of xml files is &xmlNObs;
%put name of xml files is &xmlname1;

 

* delete xml_test;
proc datasets lib=work nolist;
     delete xml_test;
quit;

 

...

 
Hope it helps.
 
Daniel Santos @ www.cgd.pt

Solution
‎01-11-2017 10:53 PM
Occasional Contributor
Posts: 6

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

I think your suggestions worked! 

 

I realized that there were two big inefficiencies that was causing problems when you pointed out we can just use a static name for the xml files / datasets:

 

1. Towards the beginning of the first do loop, we kept the datasets CONTENTS_ static since it was redundant to have both the filename "inzip" have &i as well as contents_&i since contents_ was pointing to that filename in the first place.

 

2. Similar situation in the second do loop when we cycle through the xml files. It was redundant to have both the libname "my_xml" have &j as well as xml_test_&j since xml_test_ is was pointing to the libname. 

 

Finally, throwing in the proc datasets delete routine fixed the issue of the macro appending the final dataset to itself if we run the program multiple times. 

 

Thanks everyone for all of the feedback! I'm calling this a win. If we (or anyone) finds any more oppturnities to make this macro more efficient we'll be sure to share!

 

Here is what we got: 

 

/*assign the directory where the zipped folders are to*/
/*the macro variable dir*/
%macro doit(&dir); options spool nonotes nosource nosource2 errors=0; filename zips "&dir";
/*get the names of the zipped folders in the directory*/ data zips_contents; fid=dopen("zips"); if fid=0 then stop; memcount = dnum(fid); do i = 1 to memcount; memname = dread(fid,i); output; end; run;
/*Assign zipped folders as macro variables*/
/*zopNOBS is the number of zipped folders*/
/*zipname is the name of each zipped folder*/ proc sql noprint; select count(*) into :zipNObs from zips_contents; select memname into :zipname1-:zipname%left(&zipNObs) from zips_contents; quit; proc datasets lib = work; delete xml_test; run; /*macro cycles through each zipped folder in the directory location*/
/*For each zipped folder, read in the xml file*/ %do i=1 %to &zipNObs; filename inzip ZIP "&dir/&&zipname&i"; data contents_; length memname $200 ; fid=dopen("inzip"); if fid=0 then stop; memcount=dnum(fid); do i=1 to memcount; /*Scans and gives only the name of the xml file*/ memname=scan(dread(fid,i), 2, '\'); output; end; rc=dclose(fid); run; /*populate dataset CONTENTS_ that contains a list of each of the xml files in the zipped folder*/ PROC APPEND BASE=CONTENTS DATA=CONTENTS_ FORCE; RUN; /*assign xml files as macro variables*/ /*xmlNOBS is the number of xml files in the zipped folder*/ /*xmlname is name of the xml file*/ proc sql noprint; select count(*) into :xmlNObs from contents_; select memname into :xmlname1-:xmlname%left(&xmlNObs) from contents_; quit;
/*Loop through each xml file in a zipped directory*/ %do j=1 %to &xmlNObs; /* identify a temp folder in the WORK directory */ filename xl "%sysfunc(getoption(work))/&&xmlname&j"; /*Get the xml file and set it to a temporary work directory*/ data _null_; /*using member syntax here*/
/*the name of the sub_folder then the XML file*/ infile inzip(sub_folder\&&xmlname&j) lrecl=256 recfm=F length=length eof=eof unbuf; file xl lrecl=256 recfm=N; input; put _infile_ $varying256. length; return; eof: stop; run;
/*Parse the XML file into a record in a SAS dataset*/
/*using the xml map file to help SAS parse out the xml file*/
/*that was generated using the "SAS XML mapper" application*/ filename SXLEMAP '/C:/XML_Map/my_map.map'; libname my_xml XMLV2 "%sysfunc(getoption(work))/&&xmlname&j" xmlmap=sxlemap access=readonly; data xml_test_; set my_xml.xml_contents; run; /*xml test contains the parsed xml file. One record per xml file*/ PROC APPEND BASE=xml_test DATA=xml_test_ FORCE; RUN; %end; %end; *check and output transactions as permanent dataset; proc print data= xml_test;
run;

/*Save the dataset*/ libname output "C:/Out"; DATA output.parsed_xml_data; set xml_test; run; %mend;
/*Run the macro with the directory of where the*/
/*zipped folders are saved*/ %doit(C:/In/zipped_folders)

 

Occasional Contributor
Posts: 6

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

I was going to make up some zipped XML files and upload, but it looked like we got a solution before I had the chance.

I'd share the actual zipped xml files we're using, but its protected and confidential data.
Super Contributor
Posts: 474

Re: Improving a macro that parses out multiple XML files from multiple zipped folders

Well, there is always room for improvement I think.

 

You could reduce the part of gathering the xml names into a single data step, avoiding all those temporary datasets.

 

Something like this for example:

 

* get zips and xmls into macro vars

  &zipNobs = total zip files in dir
  &&xmlNobs_&I = total xml files in zip(&I)
  &&zipName_&I = zip(&I)) name
  &&xmlName_&I._&J = xml(&J) name in zip(&I);

 

data _null_;

     * open dir;
     _RC=filename('zips',"&dir");
     _FIDDIR=dopen("zips");
     if not _FIDDIR then stop;

     * cycle through dir elements;
     _DIRCOUNT = dnum(_FIDDIR);
     do _I = 1 to _DIRCOUNT;


        * get zip name into macro var;
        _ZIPNAME=dread(_FIDDIR,_I);
        call symput(cats('zipName_',put(_I,best.)),_ZIPNAME);

 

        * open zip;
        _RC=filename('inzip',_ZIPNAME);
        _FIDZIP=dopen("inzip");
        if not _FIDZIP then stop;

 

        * cycle through zip elements;
        _ZIPCOUNT=dnum(_FIDZIP);
        do _J=1 to _ZIPCOUNT;


           * get xml name into macro var;
           _XMLNAME=scan(dread(_FIDZIP,_J),2,'\');
           call symput(catx('_','xmlName_',put(_I,best.),put(_J,best.)),_XMLNAME);
        end;

 

        _RC=close(_FIDZIP);
        call symputn(cats('xmlNObs_',put(_I,best.)),_J); * get total xml files in zip;

     end;


     _RC=close(_FIDDIR);
     call symputn('zipNObs',_I); * get total zip files in dir;
run;

 

%put _all_;

 

But if you could provide us with the zipped xml files, that would help to give you a more accurate and optimized solution. 

 

I understand it's confidentional data and there's no need for that, but maybe some kind of dummy xml files would greatly help.

 

Daniel Santos @ www.cgd.pt

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 342 views
  • 2 likes
  • 3 in conversation