We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Extracting Excel Files via the XML components

by Super User on ‎07-09-2014 06:24 PM - edited on ‎10-05-2015 03:45 PM by Community Manager (675 Views)

There have been a few questions lately on extracting data from ill formatted excel files and its a problem I've run into before. Typically I've used a variety of methods from manually going through these files to saving as CSV and parsing.

 

There's a new method that's been suggested that uses the fact that the new XLSX files are really zipped XML files. So if you change the extension of the file to a .zip and then expand the file you'll see the components. Then you can proceed to extract the data from the XML files.

 

Here's a sample program that demonstrates this, with the following assumptions:

1. Windows 7

2. 7-Zip

3. Ability to run X-Commands

 

Steps:

 

1. Rename excel file to .zip

2. Expand using 7zip commands

3. Read in all sheets, store with some naming convention

4. Clean up /*TO DO*/

 

Parameters:

1. excelfile: Path to XLSX file WITHOUT extension (.xlsx)

 

 

 

 

 

 

*Path to Excel file without extension;

%let excelfile=C:\_LOCALdata\Temp\Example1;

 

%put &excelfile;

 

*Copy file with zip extension;

x copy "&excelfile..xlsx" "&excelfile..zip";

 

 

*Extract zip file;

options mprint symbolgen xwait;;

 

%sysexec "C:\Program Files\7-Zip\7z.exe" x "&excelfile..zip" -o"&excelfile.";

 

 

*Import the Shared Strings file first as this is all the text in the document;

filename myxml "&excelfile.\xl\sharedStrings.xml";

filename mymap "&excelfile.\xl\xmlexample_generate.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data si;

    set myxml.t;

run;

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

data si_fmt;

    set si;

    start=put(t_ordinal, 8. -l);

    label=T;

    fmtname="stringFMT";

    type="N";

run;

 

proc format cntlin=si_fmt;

run;

 

 

*code to reach each worksheet in, used in next step;

%macro import_worksheet(sheet=, fout=);

 

*Map to each worksheet and import;

filename myxml "&excelfile.\xl\worksheets\&sheet.";

filename mymap "&excelfile.\xl\worksheets\sample.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data values;

    set myxml.c;

run;

 

data valuest;

    set myxml.c;

run;

 

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

 

data &fout;

    set values;

    if t='s' then value=put(v, stringfmt.);

    else value=v;

    keep row_ordinal c_ordinal r value;

run;

 

 

 

%mend import_worksheet;

 

*count the number of sheets to import;

FILENAME FILES PIPE "DIR &excelfile.\xl\worksheets\*.xml /B";

 

DATA List_Sheets;

    INFILE FILES TRUNCOVER END=LAST;

    INPUT SNAME $200.;

   

RUN;

filename files clear;

 

*Import each sheet;

Data _null_;

    set list_sheets;

    sheet_name=catx("_", scan("&excelfile", -1, "\"), scan(sname, 1, "."));

    call execute('%import_worksheet(sheet='||sname||",fout="||sheet_name||")");

run;

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.