10-28-2015 10:29 PM - edited 10-28-2015 10:43 PM
I've noticed that there are many postings here about inconsistent or unexpected results when using PROC IMPORT with Excel. I would like to share a program I've written that address these problems. The trick is to force all columns to be imported as character by using the GETNAMES=NO option in PROC IMPORT, and then in a data step to convert each variable to its proper format. The catch is that you need to create a control worksheet that lists each variable name and its desired format. The program was presented at a meeting of the Toronto SAS user group. See TORSAS Web Site "Importing Excel into SAS: A Robust Approach for Difficult-To-Read Worksheets" . E-mail me if you would like a copy of the program.
10-29-2015 05:45 AM
Whilst I understand why you are going this route, I do feel that Excel has a number of fatal flaws in terms of structure which may not be picked up, and it is still a bit of a manual process. From my thinking the problem is in the use of Excel itself and the process. As I post quite a lot on this subject the proper process for the import of data should be a such (and pretty much goes along with Software Development Life Cycle procedure):
Agree with vendor on data decisions
Document structure, content, milestones etc.
Develop export (vendor), import (client) programs
Move to production
Keep importing at timepoints updating agreement when necessary
Following that procedure, at the start of the process, when agreeing with the vendor on data structure, I would personally have a nice long document drawn up on why not to use something such as Excel - which is an unstructured, proprietary format tool, in many cases not aligned with proper data structure thing. If the vendor insisted that there was no alternative, and I wasn't allowed to drop them from any further data processing, then I would indicate that they would need to provide the data in CSV at minimum (SaveAs from Excel). Why would I do this, well that is simple, Excel is notorious for hiding certain things from the user, and for not having proper database structures - i.e. each cell can be completely different from everything else whereas on a database a column has to all be the same format. Once the vendor saves the data to CSV a lot of these data issues are highlighted, and the vendor is then taking the responsibility for fixing it. If that step doesn't happen and the data comes to you, then you are in a real problem. In my industry data has to be validated, auditable, transparent, so we would need to store the Excel file, create the CSV file, then procedure to run validation programs over the data to ensure nothing has been mistranslated between the formats. Now if you wat to be responsible for that process then fine, go ahead with getting spreadsheets in, personally if I am paying vendors to supply data either externally, or even internally between departments, then it is their responsibility to provide accurate data in an appropriate format.
Another reason to follow the above procedure is so that a) everything is documented and transparent, but more importantly b) that things are agreed upon. Once the agreement is there that they will send var1 (num) var2 (char), if that structure changes in any way, it would immediately be highlighted in issues with the import program (thats a datastep import of CSV data), and you can go straight back to the vendor stating their data doesn't conform to the agreement, again pushing the responsibility back to the vendor. Of course updates can be made, but need to be addressed in the agreement first. It also makes programming an almost non-existant part.
So, for the TL;DR version: Excel is not a data transfer/database/data capture format, was never intended as such. Use the right tools and processes to achieve the aim, saves resources, keeps a nice clean and auditable/transparent trail.
10-29-2015 09:04 AM
Thanks for your thoughtful reply. In my situation I have little to no influence on data originators. I can’t ask too much of them because they generally have limited resources, e.g., software and training for tools such as MS Access. I get the data for free, and I too have limited resources. I want to avoid the requirement for a special version of their data files just for me. There’s risk in asking data originators to perform tasks such as copy/paste, and the possibility of file versioning errors when saving a copy in CSV. So far I haven’t encountered a “fatal flaw” in this approach, as far as I know. I encourage data originator to follow best practices, such as using consistent and meaningful column names, providing metadata such as units, formatting numeric fields, and using the ISO format for dates. I agree that this approach results in “a bit of a manual process”, as it’s a bother to create a control file for each source of data, but I think it results in higher data integrity where the data originator uses Excel. This program can be easily modified to work with CSV files, and this is on my list of future enhancements.
10-29-2015 09:25 AM
No probs, I think this is the key point from your text there:
"There’s risk in asking data originators to perform tasks such as copy/paste, and the possibility of file versioning errors when saving a copy in CSV."
Yes, but then you are taking that responsibility on yourself. Its unfortunate, but it does seem to be quite a widespread thing, this capture of data outside a proper container (doesn't have to be a database, but that is simplest), and I know its principally because Excel does provide a straightforward easy to use front end to enter information. Unfortunately its rarely the same people who use that front end as those who have to use the output, therefore they see the upsides and none of the downsides. Trust me, I still throw a tantrum when someone says they will send me a spreadsheet )
06-18-2016 06:05 PM
06-18-2016 08:10 PM