03-31-2015 03:52 PM
Hi, I'd like to hear suggestions on easy ways to import excel files with many variables so they can be merged with other imported excel files without getting errors that certain variables are defined as character and numeric. I've read many suggestions in these forums and elsewhere but all seem overly complex to me, so I'm open to new strategies that are simple and efficient.
I routinely need to import and merge files from excel. I understand that excel is not a database, and that SAS has to guess the format, and that it uses 8 rows to guess, etc. But this must be a common task, I am surprised that it is so difficult. I feel that the suggestions I've seen on the internet are completely bonkers. Renaming a variable, inputting a new variable with a specified format, dropping the old variable, and then renaming the new variable to the old variable name is an OK solution if you have one variable that is "defined as character and numeric", but when you need to merge 10 files with 80 variables and 15 of them are defined differently (and the differences vary across files) it becomes a hassle.
In addition to the variable rename technique above I've tried/heard about the following workarounds:
1. Converting the files to CSV and specifying the format: This is an ok workaround, but it's pretty inconvenient to open each file, save it as a different format, and store two formats of a data set (not to mention if the two data sets have different variable names for the same variable, I would have to go in and change them by hand in excel or write a ton of code to rename, keep, drop, etc. Not appealing for one-time projects).
2. A colleague of mine inserts 8 rows of letters for char variables and numbers for num variables and then deletes them after import. Definitely not an appealing option.
3. In Excel, manually changing the format of each column and praying that the dbms = xlsx will read them. Again, a hassle with multiple files with long lists of variables.
4. Changing the number of guessing rows: This is fine, but if a variable is blank in one data set, or if there are some variables that have values in rows where others have blanks (and thus can't be sorted such that the values are on top across both variables) then this strategy won't help.
Here is the best solution I've come up with so far:
5. Put each dataset in different tabs in the worksheet. Set the excel file as a library using libname lib excel "<file path>". Use proc contents to export the variable names to a third excel worksheet. Use that list to create a list of individual formats for each variable. Then set the excel datasets to a SAS dataset using dbSAStype for each variable. I think this procedure is very thorough but overly complex.
Today I tried to experiment with the following strategy:
libname lib excel "<file-path>";
informat var1 var2 $200.;
set lib."dat1$"n lib."dat2"n;
in dat1, var1 has some character strings, and in dat2, var1 is totally blank. SAS told me me var1 and var2 are defined as character and numeric and wouldn't merge the files. (An aside: Why can't I just change the format of one of the variables? Why can SAS guess that I spelled commands wrong but it can't guess that a variable with digits as characters might be numeric?)
There's got to be an easier way! I understand that Excel isn't ideal but many people use Excel so there must be a simpler way to perform what must be a very common task. I'd really love to hear ideas that might make this procedure simpler.
03-31-2015 09:38 PM
No, you can set it, and generally it can be set to scan the entire file first which helps to resolve a lot of the errors.
Out of curiosity what type of algorithm would you suggest to determine if a column is text/character? And how would you ensure that this is consistent between files where even variable names are not consistent as in your example?
04-01-2015 07:09 AM
Just a note that these issues are the same no matter what software you use to read in the excel files, R, Python, SAS, SPSS etc...
04-01-2015 12:16 PM
Hi Reeza, setting the guessingrows to read the entire file will help in some cases, but not in the case where a variable is char in one file and blank in the other.
Maybe no one can solve the algorithm problem, and that's ok. What I don't understand is why basic data cleaning tasks (that must be performed quite often) are so convoluted. For example: In SAS, if I want to rename 80 variables in my dataset, I have to use a rename statement for every single one and have 80 lines of code (or complex macros or code that is beyond many users' understanding). in R, I can set a vector of names to be the names of all of the variables in a dataset with
names(dat) = var_names or
names(dat) = read.xls("var_names.xls").
In SAS, if a variable gets imported as character and I want to convert it to a number, I have to input it into a new variable, drop the old variable, and rename the new variable. In R, it's
dat$var1 = as.numeric(dat$var1)
and character variables are converted to NA values.
I'm not saying SAS has to be like R. R has its own annoyances to be sure. But why are these basic tasks so difficult?
04-01-2015 12:30 PM
For renaming, if you have a variable mapping list its an easy exercise to create a macro variable that will perform the rename for you.
04-01-2015 04:26 AM
Shailey, your note is very clear on the data processing problem.
The humans working with Excel can make a tremendous of their data. As that is how spreadsheet works (cell oriented) you cannot solve that SAS cannot solve that nobody can solve that. What you have found is a lot of idea-s of people thinking they are solving that.
There is a strange attitude as we have been educated to solve problems expecting every problem has a solution. Going real there are likely more problems having no solution than there problems having one (may be still unknown). same as R numbers are outnumbering Q en those of I are far more on those.
You must have noticed that: the number 1 get right aligned in a cell but the character 1 (type a '1 ) get left aligned and marked with a green corner. There are a lot of Excel fucntions showing the value of 1 as result of some other value. A human will see only a 1 but in reality these are totally different. How would you correct those endless list of possible differences.
04-01-2015 06:12 AM
You have given an apt description why Excel has NO place in Business Intelligence. Except maybe as a simple tool to create (optically) nice output, at the very end of the processing chain.
Using data _from_ Excel is one good example for the acronym GIGO.
What you are experiencing is the same pain that everyone else working with SAS has when "data" comes from a source as unreliable and faulty as Excel.
Part of my income can be described as "payment for somehow making sense out of the crap idiots deliver when they try to be intelligent with seemingly easy-to-use tools".
In other words, don't despair. This is why you get paid the big bucks.
04-01-2015 12:22 PM
Hi Kurt, I am totally on board with this--merging multiple excel files from multiple sources with multiple processes (color coding as a variable indicator?? or notes?? really??) is frustrating and error-prone and illustrates why Excel is not well-suited for data storage. I have to admit, sometimes I do enjoy getting paid to solve these kinds of puzzles. But sometimes I wish I could just convert a variable to numeric without several lines of code.
04-01-2015 08:40 AM
Data processing and guessing (rows) are not compatible. You want to show you know what you're doing.
Use a scripting language like Power Shell or VBSCRIPT (I've used this one) to do #1 "save the EXCEL sheets to CSV files or other delimited file format. You can find lots of examples of this on the web and here:
Now all fields are character and you can read each with the proper SAS attributes.
Define all the fields as SAS variables in a data set with zero observations. This way you will have complete control over how each field is defined, length and type, to SAS and the INFORMAT used to read it.
call missing(of _all_);
Download my macro from this thread
also discussed here
This macro will use the data set as the the guide for which fields are read and how. The order of fields in the CSV is not important and fields that are not defined by the empty data set are ignored.
04-01-2015 12:23 PM
Thank you, data_null; I will stretch my mind and try to understand these procedures as I can. In the meantime I'm hoping I can find some more straightforward and simple solutions that I know how to use.
04-01-2015 12:39 PM
data_null_; I'm not sure what you're saying; do you mean that I should just give up on using SAS if I have to receive files in Excel format? Or that I can't use SAS unless I learn VB?
So this is a forum about integrating SAS with Microsoft, and the answer is "don't"? Seems a little user-unfriendly.