BookmarkSubscribeRSS Feed

Extracting Excel Files via the XML components

Started ‎07-09-2014 by
Modified ‎10-05-2015 by
Views 2,114

There have been a few questions lately on extracting data from ill formatted excel files and its a problem I've run into before. Typically I've used a variety of methods from manually going through these files to saving as CSV and parsing.

 

There's a new method that's been suggested that uses the fact that the new XLSX files are really zipped XML files. So if you change the extension of the file to a .zip and then expand the file you'll see the components. Then you can proceed to extract the data from the XML files.

 

Here's a sample program that demonstrates this, with the following assumptions:

1. Windows 7

2. 7-Zip

3. Ability to run X-Commands

 

Steps:

 

1. Rename excel file to .zip

2. Expand using 7zip commands

3. Read in all sheets, store with some naming convention

4. Clean up /*TO DO*/

 

Parameters:

1. excelfile: Path to XLSX file WITHOUT extension (.xlsx)

 

 

 

 

 

 

*Path to Excel file without extension;

%let excelfile=C:\_LOCALdata\Temp\Example1;

 

%put &excelfile;

 

*Copy file with zip extension;

x copy "&excelfile..xlsx" "&excelfile..zip";

 

 

*Extract zip file;

options mprint symbolgen xwait;;

 

%sysexec "C:\Program Files\7-Zip\7z.exe" x "&excelfile..zip" -o"&excelfile.";

 

 

*Import the Shared Strings file first as this is all the text in the document;

filename myxml "&excelfile.\xl\sharedStrings.xml";

filename mymap "&excelfile.\xl\xmlexample_generate.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data si;

    set myxml.t;

run;

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

data si_fmt;

    set si;

    start=put(t_ordinal, 8. -l);

    label=T;

    fmtname="stringFMT";

    type="N";

run;

 

proc format cntlin=si_fmt;

run;

 

 

*code to reach each worksheet in, used in next step;

%macro import_worksheet(sheet=, fout=);

 

*Map to each worksheet and import;

filename myxml "&excelfile.\xl\worksheets\&sheet.";

filename mymap "&excelfile.\xl\worksheets\sample.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data values;

    set myxml.c;

run;

 

data valuest;

    set myxml.c;

run;

 

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

 

data &fout;

    set values;

    if t='s' then value=put(v, stringfmt.);

    else value=v;

    keep row_ordinal c_ordinal r value;

run;

 

 

 

%mend import_worksheet;

 

*count the number of sheets to import;

FILENAME FILES PIPE "DIR &excelfile.\xl\worksheets\*.xml /B";

 

DATA List_Sheets;

    INFILE FILES TRUNCOVER END=LAST;

    INPUT SNAME $200.;

   

RUN;

filename files clear;

 

*Import each sheet;

Data _null_;

    set list_sheets;

    sheet_name=catx("_", scan("&excelfile", -1, "\"), scan(sname, 1, "."));

    call execute('%import_worksheet(sheet='||sname||",fout="||sheet_name||")");

run;

Comments

Hello @Reeza,

 

This is very helpful. I was trying to automate reading the excel file into SAS. My challenge is to remove strikethrough values from the sheet.

 

My ideas was to modify the xml to remove those strikethroughs and convert back to xlsx file for import into SAS. I was on Linux environment and can't use the VBA approach for this task. 

 

My question for you is how are the .map files created? I didn't find any when extracting the excel file. 

 

Thanks,

Kiran

Version history
Last update:
‎10-05-2015 03:45 PM
Updated by:
Contributors

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Labels
Article Tags