07-08-2014 08:12 PM
I asked this question at StackOverflow and someone suggested I post here (first time poster).
I have a few hundred Excel files (.xlsx) that contain demographic, biochemical, and time data collected every hour for up to 5 days, with each day in a new worksheet. The data was not entered in a consistent manner AT ALL so sometimes 'name' might be in A1 and other times it might be in A2 or B2. Same thing for 'dateTIME' and other variables.
This is the most vexing issue, though: There are specific variable names listed in column A (though not in consistent column cells throughout all the workbooks) with their corresponding value (depending on the time that they were collected (eg. 00:00 (Column B) vs 14:00 (Column P)) listed horizontally. The data does not always start from 00:00 (eg, midnight... sometimes it starts at 3:00 or 22:00, etc.) but Column B is always 00:00, Column C 01:00, etc etc.
How can I go through each of these workbooks and the multiple worksheets within to extract the name/dates/a specific variable with its corresponding values into some sort of organized table? These are the tools I can use (can use nothing else because of security reasons): SAS 9.2, R, Access, Oracle.
I think an Excel Macro using LINQ code might be helpful, but I've never worked with LINQ before.
I'd prefer to do this in SAS, but this data is a mess and I'm not sure where to start. I'm great with the PROC step but not so great with the DATA step yet.
I've attached a sample workbook (with multiple worksheets). Keep in mind that each workbook in this set of a few hundred files is generally in this format but may have the variable names and values I'm looking for in a different column or row (usually only one or two cells away). I think I have to query this somehow.
I was thinking maybe I should write a macro to go through all the workbooks and worksheets and sort all the variables alphabetically and then move on from there.. is this a good idea? Any tips would be appreciated. I generally don't have command privileges on my computer, but our IT department can map me to a group with command privileges temporarily in order to do this...
07-09-2014 10:52 AM
I actually went through a similar exercise involving Lotus 123 and using SAS 5.18.
This is when I learned to hate spreadsheets.
The approach we ended up using was to identify groups of files with similar structures and write individual programs to deal with the similar ones. Since we ended up with differing behavior for some similar items, Zip vs Zip+4 we had to make the programs as macros with parameters that allowed us to adjust for some of those.
I doubt that proc import is going to work well with this data as SAS will expect all of the values for a column to be numeric or text. You will would like need to use mixed option and then sort out the differences between datasets. With as much garbage as you imply I would likely get a script to save all of the pages as separate CSV as then I could read each with much more control.
I know you didn't collect the data but who ever was in charge of this project dropped a big ball. Data quality checking should start as soon as data collection starts to minimize this kind of problem.
07-09-2014 11:21 AM
Just to add, I don't think linq will help you. Also, Excel VBA whilst straight forward wouldn't be any different from doing it in SAS. I completely agree with ballardw, perhaps a petition to have Excel banned? What I would do in Excel is write a small VBA script which opens all the files in Excel and saves as CSV. Then read them in manually and post process each file.
Depending on the structure, if the variable name is on the left and mainly the same, then you could setup a set of macros, e.g.:
Then with imported data loop over the data and call the macros, this would generalize at least part of the data mapping, e.g:
if cola in ("name","ssn","studyid"...) then call execute('%'||strip(cola)||' (...);');
07-09-2014 12:19 PM
You said the files were .xlsx but have posted a .xls files. Are the files XLS or XLSX, this makes a big difference, as XLSX files are compressed XML that can be extracted whereas XLS are not.
07-09-2014 01:57 PM
The files are XLSX -- I posted an XLS file because the SAS forum wouldn't allow me to attach an .xlsx.
So I can still follow the advice/steps outlined above with .xlsx files, correct?
07-10-2014 03:52 AM
Personally, I would still go with saving the data to CSV and then write a data step import - far more control. Whilst you could go down the route of un-zipping an xlsx and then reading in the XML files, expanding unique values etc. its probably a lot of work which you could avoid by saving to CSV.
07-10-2014 01:20 PM
@capony I added you to a shared doc, where I'm trying to automate a solution that will read the excel files.
It only works in parts right, I'm having trouble with my macro quoting which is why I haven't posted it here but if you would like to help test it out let me know. I don't have enough time to go into it more today
If anyone else is interested in helping let me know and I'll add you.
07-10-2014 02:18 PM
Wow, thanks so much Reeza. I really appreciate your help with this. I am hamstrung in what methods I can use because of all the security restrictions on the data and our IT systems. This is my first time in the SAS forums-- even so I'm a bit embarrassed to admit that I can't find the shared doc. Can you kindly tell me how to access it?
07-10-2014 03:12 PM