BookmarkSubscribeRSS Feed
dustychair
Pyrite | Level 9

Hi all,

I have 40 zipped folders and each of them has one csv file. 

How can I read all of them in SAS? (there are some examples but in those examples I have to type folders' names. But since there are 40 folders it's going to be hard to do that.)

 

Thank you,

 

13 REPLIES 13
Tom
Super User Tom
Super User

Not sure what a "zipped folder" is.  But if you have 40 zip files you want to read then you will need to either type their names in somehow or extract their names from the operating system somehow.   If the names follow a pattern you might be able to build the names with less typing, but you should probably test the resulting name for existence before trying to use it.  There are plenty of posts on this forum and in the SAS documentation about how to get a list of files from the operating system.

 

Do all of the CSV files have the same structure?  If so then just read them all at once.

data names;
  input filename :$100.;
  filename=cats('c:\downloads\',filename);
cards;
test.zip
test_zip2.zip
;

data want;
  set names;
  infile csv zip filevar=filename member='*' dsd firstobs=2 truncover end=eof;
  do while(not eof);
    input line $10.;
    output;
  end;
run;

 

Example:

NOTE: The infile CSV is:
      Filename=c:\downloads\test.zip,
      Member Name=*

NOTE: The infile CSV is:
      Filename=c:\downloads\test_zip2.zip,
      Member Name=*

NOTE: 3576 records were read from the infile CSV.
      The minimum record length was 0.
      The maximum record length was 32767.
NOTE: 679 records were read from the infile CSV.
      The minimum record length was 18.
      The maximum record length was 28714.
NOTE: There were 2 observations read from the data set WORK.NAMES.
NOTE: The data set WORK.WANT has 4255 observations and 1 variables.
sbxkoenk
SAS Super FREQ

Hello,

 

Go to https://blogs.sas.com/content

Enter as search terms these two: filename zip

And look at the entries written by @ChrisHemedinger (there are multiple !) in the search results.

 

Thanks,

Koen

dustychair
Pyrite | Level 9

Hi all,

I am still struggling with unzipping and reading zipped files. As I mentioned before I have 40 zipped folders in a regular (unzipped) folder. I am using @ChrisHemedinger 's example and changing the first statement since I have so many folders. I am not getting any results. Could you please give me an idea how to read zipped 40 folders.

 

Thanks,

 

filename inzip ZIP "c:\projects\*.zip";
 
/* Read the "members" (files) from the ZIP file */
data contents(keep=memname isFolder);
 length memname $200 isFolder 8;
 fid=dopen("inzip");
 if fid=0 then
  stop;
 memcount=dnum(fid);
 do i=1 to memcount;
  memname=dread(fid,i);
  /* check for trailing / in folder name */
  isFolder = (first(reverse(trim(memname)))='/');
  output;
 end;
 rc=dclose(fid);
run;
 

 

Tom
Super User Tom
Super User

What is the goal of this process?  Do you just want find the names of the files that the ZIP files contain?  Do you want extract them into actual files?  Do you want to read the contents?  If you want to read the contents of the files that have been compressed inside of ZIP files then you need to supply more information about what you expect to find in those ZIP files.  Does each ZIP file have only one actual file in it?  Does it have multiple files?  What types of files are they?

 

Your example code cannot work.  You need to treat each ZIP file as if it was a directory to read the names of the files it contains.

So you cannot use *.zip in the FILENAME statement.

You need a two step process.  First find the list of files in the real directory.  Then for each file that is a ZIP file you can then open that FILE as if it was a DIRECTORY to find the names of the files that it contains.

 

 

dustychair
Pyrite | Level 9
My purpose is not to find out the content. I want to read the files in SAS and do some analysis on them. So I want to save them as SAS data files. I know I need two steps, that was my first step but I failed. No idea how to do that
Tom
Super User Tom
Super User

@dustychair wrote:
My purpose is not to find out the content. I want to read the files in SAS and do some analysis on them. So I want to save them as SAS data files. I know I need two steps, that was my first step but I failed. No idea how to do that

So do all of the CSV files have the same structure? Do you know the structure of the CSV files?  Or are you going to have to use PROC IMPORT or some other tool to GUESS how to read the CSV files?

 

The code you just posted is close to what you need to file the list of ZIP files.  But you want to start from the actual directory and just generate the list of ZIP files that you find in that directory.  So here is code to get the list of ALL of the files in C:\project\ directory.  You might want to filter that list to just those with names that end with .zip or .ZIP.

filename folder "c:\projects\";
data names(keep=fileno filename);
  fid=dopen('folder');
  if fid=0 then stop;
  do fileno=1 to dnum(fid);
    length filename memname $256;
    memname=dread(fid,fileno);
    filename=catx('\',pathname('folder'),memname);
    output;
  end;
  rc=dclose(fid);
run;

Earlier in this thread I posted code showing how to read the data from multiple ZIP files when starting with a list of the ZIP files.  So try combining those two steps.

dustychair
Pyrite | Level 9
Hmm. I think I am very close but still not there. I am creating all_g5 file and it has only two columns. The first column is fileno and the second one is line. Here is what I have
filename folder "c:\projects\";
data names(keep=fileno filename);
fid=dopen('folder');
if fid=0 then stop;
do fileno=1 to dnum(fid);
length filename memname $256;
memname=dread(fid,fileno);
filename=catx('\',pathname('folder'),memname);
output;
end;
rc=dclose(fid);
run;

data all_g5;
set contents;
infile csv zip filevar=filename member='*' dsd firstobs=2 truncover end=eof;
do while(not eof);
input line $10.;
output;
end;
run;
Tom
Super User Tom
Super User

So it looks like it worked then.

You lost the FILENAME variable because you used it as the FILEVAR= option. SAS always drops such variables.  You can save its value by assigning it to another variable.

To create more variables just flesh out the INPUT statement to match the variables that are actually in your CSV files.

Since you have not shown any examples of what those files contain we cannot help you with that part.

dustychair
Pyrite | Level 9

@Tom Sorry, I did not understand which part to change. in the code Here is an example of the file that I am getting:
fileno line
1 NJnjssf21
1 NJnjssf21
1 NJnjssf21
1 NJnjssf21

Tom
Super User Tom
Super User

That looks part of the output SAS dataset.

What you need to look at is the actual CSV files so you can tell what they have in them. Once you know what they contain you can modify the data step to read the content into meaningful variables in a SAS dataset.

 

For example you could modify that last datastep that instead of trying to create a dataset it just writes the beginning of the files to the SAS log so you can look at what they contain. 

 

data _null_;
  set contents;
  infile csv zip filevar=filename member='*' dsd firstobs=1 truncover end=eof;
  put fileno= filename= ;
  do lineno=1 to 3 while(not eof);
       input ;
       put lineno= _infile_;
  end;
  put;
run;

Then you can figure out if the look like the all have the some information. Or if you need to create a different dataset for every CSV file. 

Questions you should figure out:

does the first line in each file have column headers?

Are the column headers the same in everyfile?  Are they in the same order?

Do the values look the same in the variables across the files?

How many columns are in the file?

What type of data does each column contain?

Tom
Super User Tom
Super User

@dustychair wrote:
My purpose is not to find out the content. I want to read the files in SAS and do some analysis on them. So I want to save them as SAS data files. I know I need two steps, that was my first step but I failed. No idea how to do that

If you have to GUESS what is in the files then you could try to use PROC IMPORT.  But it does not work well with ZIP files.  So to use PROC IMPORT you will first have to expand the ZIP into a real file.

You could try using this macro instead https://github.com/sasutils/macros/blob/master/csv2ds.sas as it can read a zip file directly.

dustychair
Pyrite | Level 9
Thank you for your response @Tom . Yes, the first line in each file has headers. Those headers are not the same in each file. Most of them in the same order. Some columns have numeric some have string values.

Also i am getting this error:

NOTE: The variable filename exists on an input data set, but was also specified in an I/O statement
option. The variable will not be included on any output data set.
ERROR: Open failure for C:\Users\Projects\ABDC_212.csv.zip during attempt to create a local file handle.
Tom
Super User Tom
Super User

So your problem just got 40 times harder.  Instead of reading 40 files with a structure you know you now have to write code to read 40 files with unknown structures.  

 

The first error is nothing, just telling you something you already knew from your earlier runs. SAS does not keep the variable named in the FILEVAR= option.

 

The second might just be an artifact of trying to open all 40 files at once. Hopefully it will not happen again when you try to read each file separately.

 

So the simplest first step is to read each one into its own dataset.  You can then try to figure out if there are some that can be combined.  And if you need to change the GUESSes that the code used to read the CSV made about how to define the variables so that they are consistent across the datasets.  For example for a character variable PROC IMPORT (or other tools for guessing how to read a CSV file) will set the length of the variable long enough to hold the values it sees in that one file.  So if if the same variable, like NAME or ADDRESS, appears in multiple files then each dataset could end up with the variable defined with different lengths.  Which will cause trouble if you need to combine the datasets.

 

You can use your list of files to generate code to read each one.  

You will need to decide what code to generate.  You will need to decide on a name for the dataset for each file.  If you are lucky the names of the files themselves can be used.  So if, like the example name in your error message, the names are less than 32 character, consist only of letter, digits and underscores, and do not start with a digit, then you can use the base name of the file as the dataset name.

 

Since some (all?) of the files are ZIP files it might be better to generate code to use the %CSV2DS() macro.  So first get the macro definition (and the definition of the %PARMV() macro it uses).

filename csv2ds url "https://raw.githubusercontent.com/sasutils/macros/master/csv2ds.sas";
filename parmv url "https://raw.githubusercontent.com/sasutils/macros/master/parmv.sas";
%include csv2ds;
%include parmv;

Now use your dataset to drive generating code.  So if your dataset with the filenames is named CONTENTS and the variable with the filename is named FILENAME then the code to write code to a file might look like this:

filename code temp;
data _null_;
  set contents;
  file code ;
  dsname=scan(filename,-1,'/\');
  dsname=scan(dsname,1,'.');
  put '%csv2ds(' filename :$quote. @;
  if lowcase(scan(filename,-1,'.'))='zip' then put "zip member='*'" @;
  put ',out=' dsname ',replace=1)';
run;

Now you can look at that file and see if it looks ok.

data _null_;
  infile code;
  input;
  put _infile_;
run;

So you should see lines like this:

%csv2ds("C:\Users\Projects\ABDC_212.csv.zip" zip member='*',out=ABCD_212,replace=1)

If so then try running it

%incldue code / source2;

If you want to make permanent datasets instead of WORK dataset then add a libref into the code generated so you get something like:

%csv2ds("C:\Users\Projects\ABDC_212.csv.zip" zip member='*',out=mylib.ABCD_212,replace=1)

where mylib is the name of the libref you created that points to where you want to write the SAS datasets.

 

If you want to use PROC IMPROT to guess how to read the files then you will need to add a step to extract the files from the ZIP files before pointing PROC IMPORT at the extracted file.

 

 

 

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 2594 views
  • 5 likes
  • 3 in conversation