@Season wrote:
Does importing the file with the DATA step necessitates specification of the name and informat of each of every variable in the CSV? That would be a very formidable job as I have possibly thousands of columns in all.
Proc Import will write a basic data step program to read a text file. The code will be in the log and can be copied from the log to the editor, cleaned up and rerun (or issue a RECALL command immediately after the Proc Import to bring the code into the editor.)
The types of informats that often need to be addressed are those where the value contains all digits but you want to maintain leading zeros, such as account numbers. Change the informat to character long enough to hold the value (a $20. or similar). In most cases when modifying a Proc Import generated data step you can drop the FORMAT statements for variables except date, time and datetime variables unless you want to assign custom formats. Also to look out for are columns with mixed use of negative signs and () for negative values, or currency and percent signs that aren't on every value. These may require additional coding as well as in read as character and parse. If you have multiple currency symbols such as dollar, Yen, Pound, Franc and such this might be a very import consideration if you want to manipulate the currency values in any consistent manner.
Check on the assigned informat for your problem variables. If they were read as character but should be dates that is an indication that you may need to create new variables by parsing the values. Check on your national language settings (NLS) to see what order dates are read. OR if you see lots of invalid data messages involving those variables it is one indicator that the order may be different than your NLS and override to read as character and parse.
If you have variables that would best be considered Boolean, i.e. Yes/No, True/False, and such it may be worth creating and using a custom informat so that the results are numeric 1/0 as that will be much easier to work with in most cases going forward instead of a hodgepodge of Y/N T/F character values.
Another consideration not mentioned yet, is if these files are supposed to be of the same layout you should be able to use the same data step to read all of them by changing name of input file and output data set. But it is very likely that lengths of character variables will differ between files. So modify any of the $w. informats to allow for this. I generally start at 15% or so longer than the generated data step. And then check after reading that the values look right. If not make the informat wider and re-read.
A last issue relates to variable names generated from column headings that are either very long (will get truncated at 32 characters) or identical in the source file. If column headings are identical for the first 32 characters of a longer heading the first will get part of the text as the variable name. The others will get VARxxxx where xxxx may be the column number in the file. Identical shorter heading may get numeric suffixes added. Example a file with multiple headings of "Total", the first will have a variable name of Total, the next Total2 (or Total1 been awhile) with incremented numbers for each following.
Recommend setting option VALIDVARNAME=V7 before Proc Import. Dealing with variable names with spaces and non-standard characters gets old real quick having to use the name literal such as 'Stupid variable name'n every where. The V7 option will replace all the special characters with _ and be easier to type (or rename as desired).
One tool to help deal with some of this if you don't have good documentation is to copy the header row of the CSV, assuming it has column headers, and Paste that TRANSPOSED into a spreadsheet. That will give you one "row" per variable to do such things as examine how long the variable names might be, whether different files have different headers (paste into a different column in the spreadsheet and run a comparison of values). If you have a source that has narrative column headings you can with a little work in the spreadsheet get it to create LABEL assignment statements for variables by pasting the variable names from the proc import generated data set into another column (either using the INPUT statement from the code or Proc Contents output) and use spreadsheet functions to create text like varname ="original column heading text goes here".
Any data source that may have "thousands of columns" and doesn't provide documentation as to content of the file, such as expected lengths of character variables and layouts of date, time or datetime values needs to be considered with great suspicion. Without documentation how do you know what anything represents?
... View more