To import XML file (Hierarchical structured) into SAS

morglum · Posted 08-20-2014 10:17 PM

Hi everyone,

I would like to download the entire table by mother tongue, age, sex and dissemination area found on this page. I am not sure how to import the resulting data into a SAS database though.

I tried using

libname myxml xml "filepath/filename.xml", but SAS crashed after a little more than an hour.

Any idea?

thanks,

jakarman · Posted 08-21-2014 02:05 AM

The links do not work, probably because they are having your session/user in it.
Anyway there is not information you wanted to show us that is needing for any ideas....

---->-- ja karman --<-----

morglum · Posted 08-21-2014 08:30 AM

Interesting - I tried the links on a different computer and they worked in IE, Firefox and Chrome. They also worked on a third computer, an iPad. Looks like your environment may be too much secure

Seriously though, this is very interesting and free data. I would really like to get it into SAS, but can't seem to get it to work under SAS 9.2. Any help (or idea) would be greatly appreciated.

jakarman · Posted 08-21-2014 10:19 AM

It was a script blocker active or some old version checking, indeed old version chrome is giving the data.
Have downloaded that and seeing after unzipping two files. One named generic_98 (15Gb) and one named Structure_98 (47Mb).

There are many XML-file conventions and for that to read a xml-file you sometimes need to use a mapfile (structure) the XMLmapper is found at SAS (Atlas)

at SAS Support Downloads Browse. When the structure file is the same as the mapper it must be.

The XML users guide is mentioning this. SAS(R) 9.4 XML LIBNAME Engine: User's Guide

As far I can see the structure-xml file is not a mapper file but real data.

---->-- ja karman --<-----

morglum · Posted 08-21-2014 11:06 AM

Thanks for the link, I'll see what I can come up with. cheers!

Reeza · Posted 08-21-2014 11:08 AM

If you download the IVT you can then use Beyond20/20, which is free, to export to CSV.

morglum · Posted 08-21-2014 12:13 PM

Reeza, that's clever, I didnt think about that. good thinking, I'll do just that.

Thanks Jaap for looking into this. I've fired an email to statscan, if I get an answer I'll post it here.

cheers all.

S.

morglum · Posted 08-21-2014 01:58 PM

I tried exporting the IVT to CSV but ran into a limit of Beyond 2020 : only the first 18 733 lines have been exported (about a third)..

ballardw · Posted 08-21-2014 10:56 AM

You really expect us to download a 166MB file at that first link??

The second link has download options, I typically would select the CSV format and start with Proc Import on that.

morglum · Posted 08-21-2014 11:05 AM

Hi Ballard,

The 16 KB CSV file doesnt contain the geography I want (dissemination block), only the aggregated data at the national level.

I'm surprised that 150 MB looks so huge, it's only a couple minutes download on my admittedly slow DSL internet. thanks anyway!

jakarman · Posted 08-21-2014 11:32 AM

the structure described at http://sdmx.org/ is describing a rather a specific XML structure used by census and governments. I tried the XMLmapper but he doesn't like the sizing of the XMLfiles. Would go for another approach (IVT) unless you want to be the first one solving this interface.

---->-- ja karman --<-----

tebillings · Posted 08-21-2014 10:47 PM

The SAS XML Mapper cannot import very large files. Somewhere between 30 and 50 meg it stops working (9.4 version).

IF your XML file is setup in a way that a subset of the file will include all or most of the tags: <whatever> and closing tags </whatever>. I was able to make a 30 meg text extract (from a 500 meg xml file) - the 1st 30 meg of lines, paying attention to get full tag sets - and closed it with a single extra hand-typed </tag>. Presto, the XML Mapper could parse it and read it and produce the xmlmap. I then used that map to read the entire 500 meg of xml. A few character fields are too short but it is easy to modify the xmlmap and rerun to fix that.

I'm not thrilled with the xmlmap - it introduced ordinals that may be unnecessary, but I'm relieved that I could read 500 meg of what will eventually be 9-10 Gig of input XML files. Now I can take the files and reconstruct the desired data.

However, if the XML file structure in your huge files cannot be reduced to a small extract, you can't read it in the XML Mapper.

An alternative: try the XML handling features of the Groovy language, available via PROC GROOVY (yes, it's a real language and a real SAS PROC).

morglum · Posted 08-22-2014 02:23 AM

Hi Tebillings,

Thanks for writing this up! I managed to export everything from beyond 20/20 this time around, but next time I would be very curious to learn about a language with such a funny name..

Vince28_Statcan · Posted 08-22-2014 01:20 PM

Hi Morglum,

I am glad to see non governmental users tap into our dissemination data. I am not posting on behalf of Statscan but myself only. Fereeza kindly pointed out the issue via e-mail. Our dissemination formats for complete files are indeed not the simplest to load directly into SAS. The beyond20/20 option is a good alternative for users that may not have the corporate IT infrastructure to use external XML tools and expertise to handle SDMX.

However, from personal experience on a project a few months ago where I was waiting for a database access and ran out of patience and extracted a SDMX which contained the specific information I needed at that point in time, I can say that it is feasible to import the SDMX files directly into SAS. Our dissemination website provides links to SDMX documentation but the basic gist is that the STRUCTURE file is metadata about the main file namely in terms of codesets definitions.

The SAS xml engine uses sequential access and thus shouldn't have any issue handling large files. As a matter of fact, I modified slightly my .map file from a few months back and extracted the ~35M records from the data file into SAS via only the XML engine before coming here to comment. SDMX however does not comply to the GENERIC xml in the xml libname engine documentation hence why you specifically need to build a custom XMLMAP for the job and the libname statement in your original post did not work.

A good place to get started in a scenario like this is to read the XML file as a text file via INFILE statement, put that into a large enough variable and limit yourself with OBS= infile option (I use between 100-1000). Then in another datastep output that into a different file to check the xml structure without crashing windows from out of memory exception because most default xml viewers would load the entire 15gb file into memory. Use a simple text viewer to see the structure and build your custom XMLMAP with it. There are different ways to map the file and then transform it into what you wish to obtain.

If you have tried the XML map option at some point but that it failed and that failure wasn't caused by a faulty .map file, it might be because of the XML file encoding or because you did not expand logical record length (LRECL= option) for the file and the XML structure broke down from truncation.

Proc groovy is a layer to call java from within SAS. There probably exists a free java class online that reads SDMX - I wouldn't know where to begin to look. The data step component object Java Object could've been used in a similar way but again that requires some external java expertise.

I am glad to see you managed to import the data. I just came by to share that it is possible to import into SAS via custom xmlmap with some effort.

Vincent

ptimusk · Posted 07-20-2016 12:38 AM

I went to the web site for SDMX standards and downloaded all their scheme and created an XML map for each scheme using a method of automapping suggested by user @BrunoMuellerin this question's thread