BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chuakp
Obsidian | Level 7

Hello, I recently received an insurance claims dataset and a SAS read-in file provided by the vendor. The data are in 596 .gz files (which expand to CSV) in a single folder with path "M:/Data/data_2019". The .gz files have names like this:

 

part-00000-bb75630-d120.csv

part-00001-bb75630-d120.csv

....

part-00596-bb75630-d120.csv

 

I am running SAS 9.4 mount 2, which I believe doesn't have the native support for running gzip. Unfortunately, the vendor's read-in file recommends using the approach below, which relies on being able to run the gzip command. The code in the read-in file is pasted below (note I've truncated the number of variables read into the data step for brevity and replaced the path to M:/Data/data_2019. Note also that this file seemingly would combine all 596 GZ files into a single output file, which is what I want).


FILENAME f1 PIPE 'gzip -cd M:/Data/data_2019/*.gz' LRECL=800; 

 

DATA out.data_2019;
INFILE f1 DLM = '|' DSD END=EOF MISSOVER TRUNCOVER ;  

INFORMAT id $16. claimnumber $16.;

input  id claimnumber;

run;

 

An alternative would be to expand the 596 GZ files to 596 CSV files, but in that situation, I'm not sure how to modify the above code to read in all the CSVs in the folder. I'd appreciate any suggestions. Thanks.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
chuakp
Obsidian | Level 7

Thanks Tom. Yes, when I tried to run the vendor's suggested read-in code, I got an error:

'gzip' is not recognized as an internal or external command, operable program or batch file.

 

Also, the data do all start on the first line - there are no headers.

 

I expanded the CSV and used your code for reading in the CSVs. It works, but it seems to read the files in a random order, e.g, file part 00540 before 00001. Is there a way to make it read it in a particular order?  Thanks.

View solution in original post

9 REPLIES 9
Tom
Super User Tom
Super User

Did the suggested method work?  If not why? Di did you get an error message?

Do you have a gzip command?  If not do you have some other command that can uncompress gzip files?

This code seems to assume the files do not have header rows.  Is that true?

Tom
Super User Tom
Super User

Let's answer the last question first as it is easiest.  To read all of the csv files in a directory you can just use the support of wildcard in the filename by the INFILE command.

So you could just fix the INFILE statement.  But you might also want to fix some of the other silly things in the example code.  There is hardly ever any reason to use the MISSOVER option and it is doing nothing in your existing INFILE statement since you have overridden it with the more useful TRUNCOVER option.

Your code seemed to be trying to use an INFORMAT statement as if it was intended to define the variables. Use a LENGTH statement (or the LENGTH= attribute of the ATTRIB statement) to define the lengths (and types) of your variables.  Use the INFORMAT statement just the attach any needed informats, like DATE.  Reading character strings or normal numbers does not require attaching any special informats to the variables.

data out.data_2019;
  infile 'M:\Data\data_2019\*.csv'  dsd dlm='|' truncover;  
  length id $16 claimnumber $16;
  input id claimnumber;
run;

 

chuakp
Obsidian | Level 7

Thanks Tom. Yes, when I tried to run the vendor's suggested read-in code, I got an error:

'gzip' is not recognized as an internal or external command, operable program or batch file.

 

Also, the data do all start on the first line - there are no headers.

 

I expanded the CSV and used your code for reading in the CSVs. It works, but it seems to read the files in a random order, e.g, file part 00540 before 00001. Is there a way to make it read it in a particular order?  Thanks.

Tom
Super User Tom
Super User

Why would the order the files are read matter? 

If there is some information that is only coded in the name then use the FILENAME= option of the INFILE statement to capture the names of the files.

 

chuakp
Obsidian | Level 7

They're already pre-sorted, which is why I would need to read them in a specific order.  

Tom
Super User Tom
Super User

@chuakp wrote:

They're already pre-sorted, which is why I would need to read them in a specific order.  


You can control the order by using a list of the filenames to drive the reading. 

For example if you have a dataset named HAVE with the variable named FILENAME containing the filenames to read.

data want;
  set have ;
  infile csv filevar=filename dsd dlm='|' truncover end=eof;
  do while (not eof);
    input .... ;
    output;
  end;
run;
sbxkoenk
SAS Super FREQ

Hello,

 

I would un-compress all .gz(ip) files into .csv files (in a new folder) and then apply this code from the documentation.

 

SAS® 9.4 and SAS® Viya® 3.5 Programming Documentation

Macro Language Reference

Example 1: Import All CSV Files That Exist within a Directory

https://go.documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=mcrolref&docsetTarget=...

 

Good luck,

Koen

 

sbxkoenk
SAS Super FREQ

After submitting the macro (mentioned) in my above post, you will have one dataset per .csv file. The datasets are named DSNx, where x is from the counter named &CNT.

 

To append all these datasets vertically, do this:

data myoutlib.ALL_DSNx;
 set work.DSN: ;
run;

Cheers,

Koen

 

Tom
Super User Tom
Super User

@sbxkoenk wrote:

After submitting the macro (mentioned) in my above post, you will have one dataset per .csv file. The datasets are named DSNx, where x is from the counter named &CNT.

 

To append all these datasets vertically, do this:

data myoutlib.ALL_DSNx;
 set work.DSN: ;
run;

Cheers,

Koen

 


Note it would be better to replace the PROC IMPORT used in that example with a data step that reads the CSV file as it was designed to be read (proc import will have to guess at how to define the variables).  You could also include a PROC APPEND step at that point to aggregate all of the individual files into one large dataset.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 3034 views
  • 0 likes
  • 3 in conversation