BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jcinma
Fluorite | Level 6

I have a data file and code to read it in, but my company's servers are structured quite a bit differently than those at the company where the text file and sas program originated, so I have to adapt the code to work within our system. The data came to me in a zipped text file ("filename.txt.gz") because it's so enormous. It lives on a Linux server with SAS Grid capabilities and I link-in remotely to the Linux server from my Windows machine to access the data. I can do this interactively in SAS EG or I can batch submit programs to the SAS Grid/Linux server using bash code.

 

Complicating matters, of course, is that the data is so large that it's been broken into several txt.gz files: filename1.txt.gz, filename2.txt.gz, etc.

 

The SAS program I received to access the data say this:

 

FILENAME inmyfile PIPE 'gunzip -c /directory/filename.txt.gz';

DATA work.tempsasdata;
  INFILE inmyfile LRECL=4502 MISSOVER PAD;
  INPUT
      @@0001  var1  $char11
      @0012   var2  $char1.
      @@0013  var3  $char1.

(and so on)

              ;
RUN;

 

When I run a PROC CONTENTS on my resulting dataset, "tempsasdata," I see that it has all the right variables formatted exactly as the code intends, but the data isn't correctly input and there are only 1 or 2 obs, depending on the different adjustments I've tried in troubleshooting this. There should be over 200,000...

 

(The #4502 comes from the original code and is data-specific, so I'm not worried about that piece)

 

Code edits I've tried that haven't worked:

 

(1)

FILENAME inmyfile PIPE 'gunzip -c /linux_path/filename.txt.gz';


(2)

FILENAME inmyfile PIPE 'gunzip -c //linux_path/filename.txt.gz';


(3)

X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c /filename.txt.gz';   


(4)

X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c filename.txt.gz';   


(5)

X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c /directory/filename.txt.gz';

 

 

For all of these, I've been trying them interactively in SAS EG and not through batch submitting in PuTTY. I know my linux_path is correct because I don't get an error code on that line and I've used the X "cd "/linux_path/"; approach before successfully with other data on other projects. It's the zipped piece (or possibly the filename deliniation (filename1, filename2, filenameX, and so on) that is causing me problems here, yet other posts relating to gunzipped files, haven't been helpful thus far.

 

(Unfortunately I can't provide a data sample or more detailed filenames/paths because it's all very sensitive data and everything else is proprietary...)

 

Not sure what else to try or how to make this work. All ideas/theories/code appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
jcinma
Fluorite | Level 6

I should delete this whole entire thread. I tried some additional debugging and found that the filename passed on from the programmer doesn't include the appentions (filename01, filename02, ie. "01," "02," and so on) and that these are required to make it run. I was under the impression that the code as written was supposed to automatically read into SAS anything beginning with that filename in the directory. I thought maybe that's what PIPE did. But no. So I turned it into a macro and got it to work:

%MACRO inloop;
   %DO i=1 %TO 20;
      %IF &i<10 %THEN %LET j=0&i;

      FILENAME inmnyfile PIPE "gunzip -c &directory/filename&j..txt.gz";	

      DATA work.tempsasdata&j;
         INFILE inmnyfile LRECL=4502 MISSOVER PAD;

         INPUT
            @@0001  var1  $char11
            @0012   var2  $char1.
            @@0013  var3  $char1.

         (and so on)

                       ;
         RUN;


   %END;
%MEND inloop;

%INLOOP;

 

View solution in original post

9 REPLIES 9
ChrisHemedinger
Community Manager

If you have SAS 9.4 Maint 5, you can get SAS to read the GZ files directly -- no X command necessary.  See:

 

https://blogs.sas.com/content/sasdummy/2017/10/10/reading-writing-gzip-files-sas/

 

That might reduce the "moving parts" and make it easier to work regardless of which grid node you're on.

It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
jcinma
Fluorite | Level 6

I should delete this whole entire thread. I tried some additional debugging and found that the filename passed on from the programmer doesn't include the appentions (filename01, filename02, ie. "01," "02," and so on) and that these are required to make it run. I was under the impression that the code as written was supposed to automatically read into SAS anything beginning with that filename in the directory. I thought maybe that's what PIPE did. But no. So I turned it into a macro and got it to work:

%MACRO inloop;
   %DO i=1 %TO 20;
      %IF &i<10 %THEN %LET j=0&i;

      FILENAME inmnyfile PIPE "gunzip -c &directory/filename&j..txt.gz";	

      DATA work.tempsasdata&j;
         INFILE inmnyfile LRECL=4502 MISSOVER PAD;

         INPUT
            @@0001  var1  $char11
            @0012   var2  $char1.
            @@0013  var3  $char1.

         (and so on)

                       ;
         RUN;


   %END;
%MEND inloop;

%INLOOP;

 

ChrisHemedinger
Community Manager

No need to delete! INFILE can process wildcard specs, but only when pointing directly at files on disk, not the the PIPE / gunzip indirection.  Others can learn from your question -- just accept your own resolution as the solution.

It's time to register for SAS Innovate! Join your SAS user peers in Las Vegas on April 16-19 2024.
jcinma
Fluorite | Level 6

Am I to understand from your comment that there's a more elegant solution? Because I do have to do this a lot and I'd love to know what it is!

 

Also, thank you.

Tom
Super User Tom
Super User

Note that there is no need to use macro for this. You can read all of the files using a single data step.  You can either control with a DO loop (instead of %DO loop) or read the list of filename and control it with that.

data all_data ;
  infile "ls &directory/filename*.txt.gz" pipe truncover ;
  input filename $200.;
  infile txt zip gzip filevar=filename truncover end=eof;
  while (not eof);
      INPUT
         @0001  var1  $char11
         @0012  var2  $char1.
         @0013  var3  $char1.
         (and so on)
      ;
      OUTPUT;
  end;
run;

 If you are running an older version of SAS that does not support GZIP option on ZIP engine the you could use the PIPE engine instead.  Replace the infile statement with these two lines.

  filename='gunzip -c  '||quote(trim(filename)) ;
  infile txt pipe filevar=filename truncover end=eof;

 

jcinma
Fluorite | Level 6

What's the "eof" bit about?

Tom
Super User Tom
Super User

@jcinma wrote:

What's the "eof" bit about?


The way that SAS normally ends a data step is when it reads past the end of a an input file with an INPUT or SET/MERGE/UPDATE statement.

 

So you want to prevent SAS from reading past the end of any of the text files to prevent it from ending the data step after reading only one text file. 

 

This data step will end when the top input statement reads past the end of the list of file names.

jcinma
Fluorite | Level 6

I'm really liking the sound of this. But where/how do I write the list of filenames? Also, what if they're all the same but go up to 14 or something? My files are named like this:

 

In one instance I have something relatively basic:

julia01.txt.gz -- julia12.txt.gz

 

In other cases I have something more intricate like this:

jim01.file001.txt.gz -- jim01.file203.txt.gz

jim02.file001.txt.gz -- jim02.file098.txt.gz

--

jim14.file001.txt.gz -- jim14.file185.txt,gz

Tom
Super User Tom
Super User

@jcinma wrote:

I'm really liking the sound of this. But where/how do I write the list of filenames? Also, what if they're all the same but go up to 14 or something? My files are named like this:

 

In one instance I have something relatively basic:

julia01.txt.gz -- julia12.txt.gz

 

In other cases I have something more intricate like this:

jim01.file001.txt.gz -- jim01.file203.txt.gz

jim02.file001.txt.gz -- jim02.file098.txt.gz

--

jim14.file001.txt.gz -- jim14.file185.txt,gz


I would just put all of the files in a single directory that only contains those files.  Then the LS command is easy to find all of the files.

 

Or  you can just put the list as in-line data instead of executing an LS command.

....
infile cards truncover;
....
cards;
julia01.txt.gz
julia02.txt.gz
julia03.txt.gz
julia04.txt.gz
;

Or you could write a program to generate a dataset of filenames.

data files ;
 length filename $200;
 do i=1 to 203 ;
  filename=cats('jim01.file',put(i,z3.),'.txt.gz');
  output;
 end;
 do i=1 to 98 ;
  filename=cats('jim02.file',put(i,z3.),'.txt.gz');
  output;
 end;
run;

And use a SET statement instead of INFILE/INPUT to drive the steps that reads the files.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1665 views
  • 6 likes
  • 3 in conversation