BookmarkSubscribeRSS Feed
toomanystepsint
Fluorite | Level 6

I have 5,000 files with 132 columns each. Aside from the first 10 columns, which are identifiers, half of the columns should be numeric, while the others should be character. An unknown number of files have all fields defined as character. I need to combine all 5,000 files into one dataset. I am unable to do so because I get the Variable has been defined as Character and Numeric error. How do I check that all columns are the expected type, and fix the errant columns if they are not? A manual solution will not work.

13 REPLIES 13
Reeza
Super User
If all the files are the same structure you use the same script that is not proc import to read them.

Run proc import on the files, take the code from the log, fix it to ensure all variables are correct and then use that to read all the text files.
You can also read all text files at once into a single data set. Search on here for various methods of doing this.
toomanystepsint
Fluorite | Level 6
They do not have the same structure.
Reeza
Super User
Ok. So how do you know columns need to be which type?
toomanystepsint
Fluorite | Level 6
The columns will name names like ASD for numeric and ASD_cat for character. However, not all files contain ASD, and the files that contain ASD might not contain all numeric fields in the other files.
Tom
Super User Tom
Super User

Do you only have the SAS datasets? Or did you create the SAS datasets from other files?  If so what type of files?

If they are delimited text files then it should be simple to write code that will read the file in using the proper type (and other attributes) for the variables.

 

Do you have (or can you create) a dataset that has all of the possible variables defined properly?

If so then just read in the first lines to find out what variables are in the current file and then use the "template" dataset to drive the reading of the data.

 

So first read the names from the first line of the text file.

data names;
  infile "file1.csv" dsd obs=1 ;
  varnum+1;
  input name :$32. @@ ;
run;

Now make a macro variable with the list of names in the order they appear in this particular text file.

proc sql noprint;
select nliteral(name) into :names separated by ' ' 
  from names
  order by varnum
;
quit;

Now use this macro variable to generate the code to read the actual data.

data want;
  if 0 then set template(keep=&names);
  infile "file1.csv" dsd firstobs=2 truncover;
  input (&names) (+0);
run;

You can easily use a list of the files to be read to generate these steps for every file.

toomanystepsint
Fluorite | Level 6

@Tom wrote:

Do you have (or can you create) a dataset that has all of the possible variables defined properly?

Possibly! However not all variables exist in one single file. I could do a proc contents after the proc import, combine the output of the proc contents, and from there get a master list of variables using proc sort and a list of variables with multiple data types using proc sql. I'm not sure how I'd use that information to implement a fix however. 

Reeza
Super User
Check out sashelp.vcolumn data set, it's a metadata table that contains the table and variable names that you can filter dynamically. I would get that table, filter for all your tables and sort by variable name. Look for cases where the variable names are the same but the types differ (could do via SQL query, max(type) ne min(type)) and then you've identified the troublesome variables and source files. Then you can fix them by modifying the import code for those ones manually or programmatically depending on the type of discrepancies you're facing.
ballardw
Super User

Talk to who ever is creating these files to see if there is documentation.

There may be patterns in the file naming that can be used to point them to a particular script to read them into SAS.

What type of files? Spreadsheet? Comma separated text? Other delimited text? Fixed-width columns? Named - each value has something like Accountnumber=1234567)? DBF? JSON?

 

WHAT will be rules for combining 5000 files into one data set?

 

The description of what you have, as I see it, looks like "job security" because no-one knows what you have and should not expect any "quick" resolution.

toomanystepsint
Fluorite | Level 6

@ballardw wrote:

What type of files?

I should specify these are csv files. 

 

The description of what you have, as I see it, looks like "job security" because no-one knows what you have and should not expect any "quick" resolution.


Haha, something I encounter in my line of work very often. In this case however it's because the data came from several sources across multiple years. The named columns are (usually) consistent within year, but not across year, which is what we're working on now. 

SASKiwi
PROC Star

You need to do some investigation across the 5,000 files. Surely there can't be 5,000 different file layouts. How many do you think there are? For example if it turns out you have 10 different file layouts then you use PROC IMPORT to generate 10 different DATA step programs for importing these.

Quentin
Super User

I would start buy trying to build a data dictionary.  I'm imagining building a table (SAS dataset) that has one record per file-variable, and gives attributes of each variable in each SAS dataset, so something like:

 

File   Variable   Length
A      Var1       8
A      Var2       $20
A      Var3       8
A      Var4       $58
B      Var1       8
B      Var3       $10
B      Var4       $58

If you've already used PROC IMPORT to read in all 5,000 files, then you should be able to build that data dictionary from dictionary.columns, or PROC CONTENTS, or whatever.

 

Once you have that data dictionary, you can use it for determining which variables need to be recoded, and how.  Maybe the rule is "if a variable is character is any dataset, then convert it to be character in every dataset, and set its length to the maximum length."  Then add a column to your dictionary, to make it more of a crosswalk, with the current variable attributes and the new variable attributes, e.g. :

 

File   Variable   Length    NewLength
A      Var1       8         8
A      Var2       $20       $20
A      Var3       8         $10
A      Var4       $58       $75
B      Var1       8         8
B      Var3       $10       $10
B      Var4       $75       $75

With that crosswalk file, you could generate the code to convert each variable into the new length, likely using a macro.

 

Then after all files have been converted to have the standard variable attributes, you can concatenate them.

 

It's nice when a file arrives with documentation.  But when it doesn't, it's often handy to build your own.

BASUG is hosting free webinars Next up: Mike Raithel presenting on validating data files on Wednesday July 17. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
donricardo
SAS Employee

Proc compare has an easy-to-use metadata comparison, maybe a bit easier than going into the vcolumn views, etc.    It's not the complete solution you're looking for, but maybe an easy starting point?

Reeza
Super User
If the output was in a more usable form I'd agree with you! One thing that's really due for an update in SAS.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 972 views
  • 12 likes
  • 7 in conversation