08-13-2012 12:35 PM
Was wondering if someone could refer me to good documentation / papers on rules/process for how the Add-In conversts data in an Excel sheet to xml, and then how the SAS server reads the data into a SAS dataset (using the XML engine). Whenever I read data from Excel, I worry about all the things that can go wrong (and there are SO many!). So wanted to understand more of how this magic is working, Detailed example (and questions) below.
I have a simple stored process on a server, and am using the add-in to stream data to it, the stored process then reads the data into a SAS dataset (using libname with XML engine), and runs PROC PRINT and sends the results back to Excel.
The stored process code is basically:
libname MyExcel xml; data mydata; set MyExcel.&_WebIN_SASNAME; run; proc print data=MyData; run;
I am feeding in data with with a few intentional oddities thrown in (numbers stored as text, dates stored as text, etc. See below (also attached).
The output I get is below (with problems highlighted):
(The attached file has this data, along with the xml data that is streamed).
When the Add-IN converts the Excel data to XML, it converts the "raw" values stored in Excel, not the formated values (this is good). The xml file has no indication of variable types (numeric vs char), it just has values. When the XML engine reads the xml file, it has to guess at variable types (i.e. which informat to use). (BTW, I don't blame SAS for this guessing, anything reading Excel has to guess, since Excel doesn't have variable types, and more importantly doesn't even have variables).
The CHARACTER column comes in fine. As does the NUMERIC column.
When importing MIXED1, the XML engine sees the value E, and reads the column with a character informat. Good, this is like the MIXED option. (I didn't check if there is a limit on number of rows scanned to see if there is any character data).
In MIXED2, bot the 3 and the 5 are string values in Excel ( '3 and '5). But in the xml there is no indication that they are strings. So the XML engine reads MIXED2 as numeric. Fair enough.
DATE1 is an Excel date (1=1/1/1900). When the Add-In generates the XML, the values stored in Excel are written out: <Date1>40909</Date1> Since the XML engine sees numbers, it reads them as numeric. Fair. Easy enough to convert these to SAS dates in SAS.
DATE2 has string values values in Excel that looks like dates ( '1/1/2012 etc) These values are left in the XML: <Date2>1/1/2012</Date2>. The XML engine sees this, decides they are dates, and reads them using MMDDYY Informat. Nifty.
DATE3 is mostly string values representing dates. But I added a Q. The Add-In dutifully creates the appropriate XML. But the XML engine decides to read the column with $8 informat. Since the last value is actually 9 characters long, it is truncated, and the value for Excel value '1/15/2012 becomes SAS value '1/15/201'. Disappointing.
DATE4 has an invalid date entered in row 4. As with DATE3, the XML engine decides the column is character (fair) but reads this in using $8,and data are truncated. Disappointing.
DATE5 is all strings in Excel. When the XML engine gets them, it decides MMDDYY. is the appropriate informat (because most of them look like dates?). The '5 in row 4 becomes a missing in SAS, because it can be read by the informat. Oddly, no message is written to the log by this conversion error.
So my questions:
1. Does the above description sound correct?
2. Are you also disappointed to see the highlighted rows in the output? (I know the input data are odd, but IMHO, this is the problem with reading data from Excel, users will throw whatever they want into a spreadsheet).
3. Are there ways to control this behavior? From my bit of testing, the Add-In is always doing a faithful job of converting the values to xml. But I was hoping to have more control of how the XML engine reads the data (similar to options like MIXED, GUESSINGROWS, SCANDATE, DBSASTYPE, etc.) I haven't worked much with XML, perhaps I should look into XML maps etc (Was hoping to avoid that when dealing with SAS-generated XML)?
4. Not shown above, but would appreciate any thoughts on issues of numeric precision during this process. I see that the xml engine has an XMLDOUBLE option, but not sure what is controlling the amount of precision when the add-in generates the XML from excel data.
5. Why is the variable order reversed with SAS reads the XML?