Hello,
I'm tasked with creating a process that will import many csv files with state addresses. For example the Iowa directory has 4 csv files.
All the tables have the same columns, but different data types and length.
I created a macro to import the csv files, but now when I try and stack them I get the types doesn't match error.
I created a empty table and using char(32767) so it will fit all rows, but now the table is very large.
What would be the best approach for this situation?
Thank you
The INFILE option FILENAME will capture the input file information. Since the automatic variable will not be saved you need to add an instruction to do that explicitly and since the variable will be character you should specify a value long enough to store the expected results. A skelton modifying the previous post would look like:
data want;
length FileKeep FileTemp $ 200;
infile "path\*.CSV" lrecl=32767
firstobs=2 dsd eov=skip missover
Filename=FileTemp;
/* informat and format statements go
here when I write code
*/
input @;
FileKeep = FileTemp;
if SKIP then SKIP=0;
Else do;
/* input and any manipulation code goes here*/
end;
run;
The length needs to be set Before the variable is created else you only capture 8 characters by default. I use 200 as an example. If you KNOW that the length of your file path plus name will be less or more then adjust.
Note the assignment FileKeep= FileTemp is AFTER and input statement. Until input executes the temporary variable isn't set. And is the case will all these infile variables the Filetemp will not be in the output.
Fix it upstream. Fix your import so the types and lengths match.
Basically read the file in with a data step rather than proc import.
In data terms this statement:
All the tables have the same columns, but different data types and length.
is null content. If they have the "same columns" then they should have the same data type.
Or if your "different" data types mean that one has stuff that looks like numbers but isn't actually such as Zipcodes, account numbers or product id's then read them all as text. Set the length to the largest you expect (and I usually add 5 to 10 more characters) just in case.
Then read ALL of the files with the same program. If the object is to read many files into a single data set then you can use some options in the data step code.
The option EOV can get you through this quite easily. Here is a skeleton of code to demostrate.
data want;
infile "path\*.CSV" lrecl=32767
firstobs=2 dsd eov=skip missover;
/* informat and format statements go
here when I write code
*/
input @;
if SKIP then SKIP=0;
Else do;
/* input and any manipulation code goes here*/
end;
run;
I am assuming that your data has a header row, the firstobs=2 skips that. The first input holds the line to see if it is the first of a new file. The option EOV sets the variable named SKIP to 1 if the first line of the next file. If so reset the value of the varaible and go to next line else read using your actual input statement.
Note that the variable SKIP will not be written into the output set.
The variable SKIP is created by
ADDRESS "numbers" are not numbers. If you are not going to do math with them, such as SUM or MEAN, set them as character.
And yes read them with a data step as that is basically the only way you are going to avoid the mismatched type and length issue.
When I read address data if I am lucky enough to have a separate "number" field I set it to be $25 because I know I'm going to get stuff like: Rural Route 23 or "Trailer Space #35".
And the example code I provided with suitable INFormat and INPUT statements will read ALL of the CSV files in the infile at one time.
Thank you so much for all your help ballardw.
Is there a way to add the file name to each row or is there a way to know which row came from which csv file?
I'm going to update my code with your suggestions. I will update back on my progress.
The INFILE option FILENAME will capture the input file information. Since the automatic variable will not be saved you need to add an instruction to do that explicitly and since the variable will be character you should specify a value long enough to store the expected results. A skelton modifying the previous post would look like:
data want;
length FileKeep FileTemp $ 200;
infile "path\*.CSV" lrecl=32767
firstobs=2 dsd eov=skip missover
Filename=FileTemp;
/* informat and format statements go
here when I write code
*/
input @;
FileKeep = FileTemp;
if SKIP then SKIP=0;
Else do;
/* input and any manipulation code goes here*/
end;
run;
The length needs to be set Before the variable is created else you only capture 8 characters by default. I use 200 as an example. If you KNOW that the length of your file path plus name will be less or more then adjust.
Note the assignment FileKeep= FileTemp is AFTER and input statement. Until input executes the temporary variable isn't set. And is the case will all these infile variables the Filetemp will not be in the output.
Thank you so much for your help.
That code has stacked the csv files and they are all the same type and length and it's much faster then proc import.
I appreciate all your help.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.