BookmarkSubscribeRSS Feed

Extracting Excel Files via the XML components

Started ‎07-09-2014 by
Modified ‎10-05-2015 by
Views 3,245

There have been a few questions lately on extracting data from ill formatted excel files and its a problem I've run into before. Typically I've used a variety of methods from manually going through these files to saving as CSV and parsing.

 

There's a new method that's been suggested that uses the fact that the new XLSX files are really zipped XML files. So if you change the extension of the file to a .zip and then expand the file you'll see the components. Then you can proceed to extract the data from the XML files.

 

Here's a sample program that demonstrates this, with the following assumptions:

1. Windows 7

2. 7-Zip

3. Ability to run X-Commands

 

Steps:

 

1. Rename excel file to .zip

2. Expand using 7zip commands

3. Read in all sheets, store with some naming convention

4. Clean up /*TO DO*/

 

Parameters:

1. excelfile: Path to XLSX file WITHOUT extension (.xlsx)

 

 

 

 

 

 

*Path to Excel file without extension;

%let excelfile=C:\_LOCALdata\Temp\Example1;

 

%put &excelfile;

 

*Copy file with zip extension;

x copy "&excelfile..xlsx" "&excelfile..zip";

 

 

*Extract zip file;

options mprint symbolgen xwait;;

 

%sysexec "C:\Program Files\7-Zip\7z.exe" x "&excelfile..zip" -o"&excelfile.";

 

 

*Import the Shared Strings file first as this is all the text in the document;

filename myxml "&excelfile.\xl\sharedStrings.xml";

filename mymap "&excelfile.\xl\xmlexample_generate.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data si;

    set myxml.t;

run;

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

data si_fmt;

    set si;

    start=put(t_ordinal, 8. -l);

    label=T;

    fmtname="stringFMT";

    type="N";

run;

 

proc format cntlin=si_fmt;

run;

 

 

*code to reach each worksheet in, used in next step;

%macro import_worksheet(sheet=, fout=);

 

*Map to each worksheet and import;

filename myxml "&excelfile.\xl\worksheets\&sheet.";

filename mymap "&excelfile.\xl\worksheets\sample.map";

 

libname myxml xmlv2 automap=replace xmlmap=mymap;

 

proc contents data=myxml._all_;

run;

 

data values;

    set myxml.c;

run;

 

data valuest;

    set myxml.c;

run;

 

 

libname myxml;

filename myxml clear;

filename mymap clear;

 

 

data &fout;

    set values;

    if t='s' then value=put(v, stringfmt.);

    else value=v;

    keep row_ordinal c_ordinal r value;

run;

 

 

 

%mend import_worksheet;

 

*count the number of sheets to import;

FILENAME FILES PIPE "DIR &excelfile.\xl\worksheets\*.xml /B";

 

DATA List_Sheets;

    INFILE FILES TRUNCOVER END=LAST;

    INPUT SNAME $200.;

   

RUN;

filename files clear;

 

*Import each sheet;

Data _null_;

    set list_sheets;

    sheet_name=catx("_", scan("&excelfile", -1, "\"), scan(sname, 1, "."));

    call execute('%import_worksheet(sheet='||sname||",fout="||sheet_name||")");

run;

Comments

Hello @Reeza,

 

This is very helpful. I was trying to automate reading the excel file into SAS. My challenge is to remove strikethrough values from the sheet.

 

My ideas was to modify the xml to remove those strikethroughs and convert back to xlsx file for import into SAS. I was on Linux environment and can't use the VBA approach for this task. 

 

My question for you is how are the .map files created? I didn't find any when extracting the excel file. 

 

Thanks,

Kiran

Contributors
Version history
Last update:
‎10-05-2015 03:45 PM
Updated by:

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

SAS AI and Machine Learning Courses

The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.

Get started

Article Labels
Article Tags