BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Season
Barite | Level 11

Can SAS import .csv.gz files instead of CSV files? My CSV file is so large that I therefore have to compress it to .csv.gz files and import them directly into statistical softwares. In addition, the file is so large that I have to select columns and rows with specific features in the course of import or else my computer would run out of memory. Can SAS do that?

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@Season wrote:

Can SAS import .csv.gz files instead of CSV files? My CSV file is so large that I therefore have to compress it to .csv.gz files and import them directly into statistical softwares. In addition, the file is so large that I have to select columns and rows with specific features in the course of import or else my computer would run out of memory. Can SAS do that?


SAS can READ a gzipped file.  Use the ZIP engine with the GZIP option on the INFILE statement.  Or on the FILENAME statement if you are using one.

Something like this:

data want;
  infile 'myfile.csv.gz' zip gzip dsd truncover firstobs=2;
  input var1 :$10. var2 var3 :date. ..... ;
  format var3 date9.;
run;

If you need help with GUESSING how to read the file then use this macro: https://github.com/sasutils/macros/blob/master/csv2ds.sas

(Remember to also get the %parmv() macro from the same site).

%csv2ds
("myfile.csv.gz" zip gzip 
,out=want
,replace=yes
);

 

View solution in original post

24 REPLIES 24
Kurt_Bremser
Super User

Define a file reference with the compression first, then use it in usual DATA step.

filename in zip "your path to the file" gzip;

data want;
infile in
  dsd
  lrecl=....
  dlm=","
  truncover
;
length
  /* define your variables here */
;
format
  /* set formats for variables that need it (dates, times, etc.) */
;
input
  /* list all variables */
;
keep
  /* list all variables you want to keep */
;
run;
Tom
Super User Tom
Super User

@Season wrote:

Can SAS import .csv.gz files instead of CSV files? My CSV file is so large that I therefore have to compress it to .csv.gz files and import them directly into statistical softwares. In addition, the file is so large that I have to select columns and rows with specific features in the course of import or else my computer would run out of memory. Can SAS do that?


SAS can READ a gzipped file.  Use the ZIP engine with the GZIP option on the INFILE statement.  Or on the FILENAME statement if you are using one.

Something like this:

data want;
  infile 'myfile.csv.gz' zip gzip dsd truncover firstobs=2;
  input var1 :$10. var2 var3 :date. ..... ;
  format var3 date9.;
run;

If you need help with GUESSING how to read the file then use this macro: https://github.com/sasutils/macros/blob/master/csv2ds.sas

(Remember to also get the %parmv() macro from the same site).

%csv2ds
("myfile.csv.gz" zip gzip 
,out=want
,replace=yes
);

 

Season
Barite | Level 11

Thank you so much for your solution! My questions are the same for both of you, so I will reply to one of the duo:

(1) Does it mean that in importing .csv.gz files, we have to specify the features of all variables manually?

(2) Is it possible to filter the observations in the course of importation instead of doing so post hoc, which is the usual practice? Can columns and rows both undergo filtering in the importation process?

Tom
Super User Tom
Super User

@Season wrote:

Thank you so much for your solution! My questions are the same for both of you, so I will reply to one of the duo:

(1) Does it mean that in importing .csv.gz files, we have to specify the features of all variables manually?

(2) Is it possible to filter the observations in the course of importation instead of doing so post hoc, which is the usual practice? Can columns and rows both undergo filtering in the importation process?


CSV files do not have metadata that details their contents.  So you cannot directly IMPORT them like you would a table from a database.  Instead you need to write code to READ them, the same as would for reading any other text file.

 

SAS does have a procedure called PROC IMPORT than you can use to help you GUESS how to read a CSV file.   Unfortunately the code it uses to examine the file to make the guesses does not handle compressed file, even though the code it generates to actual read the file would.

 

When you are READING in a file using a data step you can use the power of the data step to decide which observations are written out.  Look up the subsetting IF statement.

andreas_lds
Jade | Level 19

Please explain how you tried to read the file. 

 

I am not to familiar with gzip, but i doubt that reading a file without unzipping it is possible. At some point of the process the file needs to uncompressed, to allow sas to read it line-by-line.

 

So, how large is the file? How many variables are there?

Season
Barite | Level 11
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.
Tom
Super User Tom
Super User

@Season wrote:
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.

What guesses did R make for how to define the 40 variables?  You could use those to help you write the code.

Season
Barite | Level 11

@Tom wrote:

@Season wrote:
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.

What guesses did R make for how to define the 40 variables?  You could use those to help you write the code.


A function is applied before importation of the whole compressed file to read its first row, which stores the names of variables. No such functions seem available in SAS, are there?

Tom
Super User Tom
Super User

@Season wrote:

@Tom wrote:

@Season wrote:
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.

What guesses did R make for how to define the 40 variables?  You could use those to help you write the code.


A function is applied before importation of the whole compressed file to read its first row, which stores the names of variables. No such functions seems available in SAS, is there?


Huh? 

If you just want LOOK at the first line then write a data step to do that.

data _null_;
  infile 'myfile.csv.gz' zip gzip;
  input;
  put _infile_;
  stop;
run;

It is trivial to read the names from the first line of a text file into a dataset if you want.

data names ;
  length name $32 label $256 ;
  infile 'myfile.csv.gz' zip gzip dsd obs=1 ;
  input label @@ ;
  name = label;
run;

You might also want to use the LIST statement to get an idea what is in your file.  This step will dump the first 10 lines of the file to the SAS log.  The LIST statement will show the lines (and their length) and if any of the characters are non-printable it will also show the hexadecimal code for all of the characters in the line.

data _null_;
  infile 'myfile.csv.gz' zip gzip obs=10;
  input;
  list;
run;
Season
Barite | Level 11
Thank you so much! That saves the time of manually typing the variable names in the DATA step.
Tom
Super User Tom
Super User

@Season wrote:

@Tom wrote:

@Season wrote:
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.

What guesses did R make for how to define the 40 variables?  You could use those to help you write the code.


A function is applied before importation of the whole compressed file to read its first row, which stores the names of variables. No such functions seem available in SAS, are there?


Note the NAME is the only metadata that a CSV file does have.  What it does not have is any information about what types of variables those names represent.  Are they character strings? If so how long. Are the numbers? Perhaps dates?   That is where the guessing needs to happen.

Season
Barite | Level 11

@Tom wrote:

@Season wrote:

@Tom wrote:

@Season wrote:
It is possible to read it in R. The compressed file is about 30G, with some 40 variables.

What guesses did R make for how to define the 40 variables?  You could use those to help you write the code.


A function is applied before importation of the whole compressed file to read its first row, which stores the names of variables. No such functions seem available in SAS, are there?


Note the NAME is the only metadata that a CSV file does have.  What it does not have is any information about what types of variables those names represent.  Are they character strings? If so how long. Are the numbers? Perhaps dates?   That is where the guessing needs to happen.


Thank you for your reminder!

Tom
Super User Tom
Super User

Once you have the code to read your file you could define it as a data step view and then use that for your filtering process of selecting particular observations or variables.

 

For example you might create a data step like this one to make the view.

data full_file / view=full_file;
  infile 'myfile.csv.gz' zip gzip dsd truncover firstobs=2;
  length var1 $10 var2 8 var3 8 .... varlast $30 ;
  input var1--varlist;
  format var3 date9.;
run;

Then if you wanted a subset with just some of the variables and only the observations where the date was after some cutoff you could do something like:

data want;
  set full_file;
  if var3 > '01JAN2025'd ;
  keep var1 var3 varlast;
run;
Season
Barite | Level 11
So the selection process still takes place after the entire importation is done, right?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 24 replies
  • 1206 views
  • 16 likes
  • 4 in conversation