- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I recently received an insurance claims dataset and a SAS read-in file provided by the vendor. The data are in 596 .gz files (which expand to CSV) in a single folder with path "M:/Data/data_2019". The .gz files have names like this:
part-00000-bb75630-d120.csv
part-00001-bb75630-d120.csv
....
part-00596-bb75630-d120.csv
I am running SAS 9.4 mount 2, which I believe doesn't have the native support for running gzip. Unfortunately, the vendor's read-in file recommends using the approach below, which relies on being able to run the gzip command. The code in the read-in file is pasted below (note I've truncated the number of variables read into the data step for brevity and replaced the path to M:/Data/data_2019. Note also that this file seemingly would combine all 596 GZ files into a single output file, which is what I want).
FILENAME f1 PIPE 'gzip -cd M:/Data/data_2019/*.gz' LRECL=800;
DATA out.data_2019;
INFILE f1 DLM = '|' DSD END=EOF MISSOVER TRUNCOVER ;
INFORMAT id $16. claimnumber $16.;
input id claimnumber;
run;
An alternative would be to expand the 596 GZ files to 596 CSV files, but in that situation, I'm not sure how to modify the above code to read in all the CSVs in the folder. I'd appreciate any suggestions. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tom. Yes, when I tried to run the vendor's suggested read-in code, I got an error:
'gzip' is not recognized as an internal or external command, operable program or batch file.
Also, the data do all start on the first line - there are no headers.
I expanded the CSV and used your code for reading in the CSVs. It works, but it seems to read the files in a random order, e.g, file part 00540 before 00001. Is there a way to make it read it in a particular order? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Did the suggested method work? If not why? Di did you get an error message?
Do you have a gzip command? If not do you have some other command that can uncompress gzip files?
This code seems to assume the files do not have header rows. Is that true?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Let's answer the last question first as it is easiest. To read all of the csv files in a directory you can just use the support of wildcard in the filename by the INFILE command.
So you could just fix the INFILE statement. But you might also want to fix some of the other silly things in the example code. There is hardly ever any reason to use the MISSOVER option and it is doing nothing in your existing INFILE statement since you have overridden it with the more useful TRUNCOVER option.
Your code seemed to be trying to use an INFORMAT statement as if it was intended to define the variables. Use a LENGTH statement (or the LENGTH= attribute of the ATTRIB statement) to define the lengths (and types) of your variables. Use the INFORMAT statement just the attach any needed informats, like DATE. Reading character strings or normal numbers does not require attaching any special informats to the variables.
data out.data_2019;
infile 'M:\Data\data_2019\*.csv' dsd dlm='|' truncover;
length id $16 claimnumber $16;
input id claimnumber;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tom. Yes, when I tried to run the vendor's suggested read-in code, I got an error:
'gzip' is not recognized as an internal or external command, operable program or batch file.
Also, the data do all start on the first line - there are no headers.
I expanded the CSV and used your code for reading in the CSVs. It works, but it seems to read the files in a random order, e.g, file part 00540 before 00001. Is there a way to make it read it in a particular order? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why would the order the files are read matter?
If there is some information that is only coded in the name then use the FILENAME= option of the INFILE statement to capture the names of the files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
They're already pre-sorted, which is why I would need to read them in a specific order.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@chuakp wrote:
They're already pre-sorted, which is why I would need to read them in a specific order.
You can control the order by using a list of the filenames to drive the reading.
For example if you have a dataset named HAVE with the variable named FILENAME containing the filenames to read.
data want;
set have ;
infile csv filevar=filename dsd dlm='|' truncover end=eof;
do while (not eof);
input .... ;
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I would un-compress all .gz(ip) files into .csv files (in a new folder) and then apply this code from the documentation.
SAS® 9.4 and SAS® Viya® 3.5 Programming Documentation
Macro Language Reference
Example 1: Import All CSV Files That Exist within a Directory
Good luck,
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
After submitting the macro (mentioned) in my above post, you will have one dataset per .csv file. The datasets are named DSNx
, where x
is from the counter named &CNT.
To append all these datasets vertically, do this:
data myoutlib.ALL_DSNx;
set work.DSN: ;
run;
Cheers,
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@sbxkoenk wrote:
After submitting the macro (mentioned) in my above post, you will have one dataset per .csv file. The datasets are named DSN
x
, wherex
is from the counter named &CNT.
To append all these datasets vertically, do this:
data myoutlib.ALL_DSNx; set work.DSN: ; run;
Cheers,
Koen
Note it would be better to replace the PROC IMPORT used in that example with a data step that reads the CSV file as it was designed to be read (proc import will have to guess at how to define the variables). You could also include a PROC APPEND step at that point to aggregate all of the individual files into one large dataset.