I have 5,000 files with 132 columns each. Aside from the first 10 columns, which are identifiers, half of the columns should be numeric, while the others should be character. An unknown number of files have all fields defined as character. I need to combine all 5,000 files into one dataset. I am unable to do so because I get the Variable has been defined as Character and Numeric error. How do I check that all columns are the expected type, and fix the errant columns if they are not? A manual solution will not work.
Do you only have the SAS datasets? Or did you create the SAS datasets from other files? If so what type of files?
If they are delimited text files then it should be simple to write code that will read the file in using the proper type (and other attributes) for the variables.
Do you have (or can you create) a dataset that has all of the possible variables defined properly?
If so then just read in the first lines to find out what variables are in the current file and then use the "template" dataset to drive the reading of the data.
So first read the names from the first line of the text file.
data names;
infile "file1.csv" dsd obs=1 ;
varnum+1;
input name :$32. @@ ;
run;
Now make a macro variable with the list of names in the order they appear in this particular text file.
proc sql noprint;
select nliteral(name) into :names separated by ' '
from names
order by varnum
;
quit;
Now use this macro variable to generate the code to read the actual data.
data want;
if 0 then set template(keep=&names);
infile "file1.csv" dsd firstobs=2 truncover;
input (&names) (+0);
run;
You can easily use a list of the files to be read to generate these steps for every file.
@Tom wrote:Do you have (or can you create) a dataset that has all of the possible variables defined properly?
Possibly! However not all variables exist in one single file. I could do a proc contents after the proc import, combine the output of the proc contents, and from there get a master list of variables using proc sort and a list of variables with multiple data types using proc sql. I'm not sure how I'd use that information to implement a fix however.
Talk to who ever is creating these files to see if there is documentation.
There may be patterns in the file naming that can be used to point them to a particular script to read them into SAS.
What type of files? Spreadsheet? Comma separated text? Other delimited text? Fixed-width columns? Named - each value has something like Accountnumber=1234567)? DBF? JSON?
WHAT will be rules for combining 5000 files into one data set?
The description of what you have, as I see it, looks like "job security" because no-one knows what you have and should not expect any "quick" resolution.
@ballardw wrote:What type of files?
I should specify these are csv files.
The description of what you have, as I see it, looks like "job security" because no-one knows what you have and should not expect any "quick" resolution.
Haha, something I encounter in my line of work very often. In this case however it's because the data came from several sources across multiple years. The named columns are (usually) consistent within year, but not across year, which is what we're working on now.
You need to do some investigation across the 5,000 files. Surely there can't be 5,000 different file layouts. How many do you think there are? For example if it turns out you have 10 different file layouts then you use PROC IMPORT to generate 10 different DATA step programs for importing these.
I would start buy trying to build a data dictionary. I'm imagining building a table (SAS dataset) that has one record per file-variable, and gives attributes of each variable in each SAS dataset, so something like:
File Variable Length A Var1 8 A Var2 $20 A Var3 8 A Var4 $58 B Var1 8 B Var3 $10 B Var4 $58
If you've already used PROC IMPORT to read in all 5,000 files, then you should be able to build that data dictionary from dictionary.columns, or PROC CONTENTS, or whatever.
Once you have that data dictionary, you can use it for determining which variables need to be recoded, and how. Maybe the rule is "if a variable is character is any dataset, then convert it to be character in every dataset, and set its length to the maximum length." Then add a column to your dictionary, to make it more of a crosswalk, with the current variable attributes and the new variable attributes, e.g. :
File Variable Length NewLength A Var1 8 8 A Var2 $20 $20 A Var3 8 $10 A Var4 $58 $75 B Var1 8 8 B Var3 $10 $10 B Var4 $75 $75
With that crosswalk file, you could generate the code to convert each variable into the new length, likely using a macro.
Then after all files have been converted to have the standard variable attributes, you can concatenate them.
It's nice when a file arrives with documentation. But when it doesn't, it's often handy to build your own.
Proc compare has an easy-to-use metadata comparison, maybe a bit easier than going into the vcolumn views, etc. It's not the complete solution you're looking for, but maybe an easy starting point?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.