- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a data file and code to read it in, but my company's servers are structured quite a bit differently than those at the company where the text file and sas program originated, so I have to adapt the code to work within our system. The data came to me in a zipped text file ("filename.txt.gz") because it's so enormous. It lives on a Linux server with SAS Grid capabilities and I link-in remotely to the Linux server from my Windows machine to access the data. I can do this interactively in SAS EG or I can batch submit programs to the SAS Grid/Linux server using bash code.
Complicating matters, of course, is that the data is so large that it's been broken into several txt.gz files: filename1.txt.gz, filename2.txt.gz, etc.
The SAS program I received to access the data say this:
FILENAME inmyfile PIPE 'gunzip -c /directory/filename.txt.gz';
DATA work.tempsasdata;
INFILE inmyfile LRECL=4502 MISSOVER PAD;
INPUT
@@0001 var1 $char11
@0012 var2 $char1.
@@0013 var3 $char1.
(and so on)
;
RUN;
When I run a PROC CONTENTS on my resulting dataset, "tempsasdata," I see that it has all the right variables formatted exactly as the code intends, but the data isn't correctly input and there are only 1 or 2 obs, depending on the different adjustments I've tried in troubleshooting this. There should be over 200,000...
(The #4502 comes from the original code and is data-specific, so I'm not worried about that piece)
Code edits I've tried that haven't worked:
(1)
FILENAME inmyfile PIPE 'gunzip -c /linux_path/filename.txt.gz';
(2)
FILENAME inmyfile PIPE 'gunzip -c //linux_path/filename.txt.gz';
(3)
X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c /filename.txt.gz';
(4)
X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c filename.txt.gz';
(5)
X "cd /linux_path/";
FILENAME inmyfile PIPE 'gunzip -c /directory/filename.txt.gz';
For all of these, I've been trying them interactively in SAS EG and not through batch submitting in PuTTY. I know my linux_path is correct because I don't get an error code on that line and I've used the X "cd "/linux_path/"; approach before successfully with other data on other projects. It's the zipped piece (or possibly the filename deliniation (filename1, filename2, filenameX, and so on) that is causing me problems here, yet other posts relating to gunzipped files, haven't been helpful thus far.
(Unfortunately I can't provide a data sample or more detailed filenames/paths because it's all very sensitive data and everything else is proprietary...)
Not sure what else to try or how to make this work. All ideas/theories/code appreciated!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I should delete this whole entire thread. I tried some additional debugging and found that the filename passed on from the programmer doesn't include the appentions (filename01, filename02, ie. "01," "02," and so on) and that these are required to make it run. I was under the impression that the code as written was supposed to automatically read into SAS anything beginning with that filename in the directory. I thought maybe that's what PIPE did. But no. So I turned it into a macro and got it to work:
%MACRO inloop;
%DO i=1 %TO 20;
%IF &i<10 %THEN %LET j=0&i;
FILENAME inmnyfile PIPE "gunzip -c &directory/filename&j..txt.gz";
DATA work.tempsasdata&j;
INFILE inmnyfile LRECL=4502 MISSOVER PAD;
INPUT
@@0001 var1 $char11
@0012 var2 $char1.
@@0013 var3 $char1.
(and so on)
;
RUN;
%END;
%MEND inloop;
%INLOOP;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you have SAS 9.4 Maint 5, you can get SAS to read the GZ files directly -- no X command necessary. See:
https://blogs.sas.com/content/sasdummy/2017/10/10/reading-writing-gzip-files-sas/
That might reduce the "moving parts" and make it easier to work regardless of which grid node you're on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I should delete this whole entire thread. I tried some additional debugging and found that the filename passed on from the programmer doesn't include the appentions (filename01, filename02, ie. "01," "02," and so on) and that these are required to make it run. I was under the impression that the code as written was supposed to automatically read into SAS anything beginning with that filename in the directory. I thought maybe that's what PIPE did. But no. So I turned it into a macro and got it to work:
%MACRO inloop;
%DO i=1 %TO 20;
%IF &i<10 %THEN %LET j=0&i;
FILENAME inmnyfile PIPE "gunzip -c &directory/filename&j..txt.gz";
DATA work.tempsasdata&j;
INFILE inmnyfile LRECL=4502 MISSOVER PAD;
INPUT
@@0001 var1 $char11
@0012 var2 $char1.
@@0013 var3 $char1.
(and so on)
;
RUN;
%END;
%MEND inloop;
%INLOOP;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No need to delete! INFILE can process wildcard specs, but only when pointing directly at files on disk, not the the PIPE / gunzip indirection. Others can learn from your question -- just accept your own resolution as the solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Am I to understand from your comment that there's a more elegant solution? Because I do have to do this a lot and I'd love to know what it is!
Also, thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that there is no need to use macro for this. You can read all of the files using a single data step. You can either control with a DO loop (instead of %DO loop) or read the list of filename and control it with that.
data all_data ;
infile "ls &directory/filename*.txt.gz" pipe truncover ;
input filename $200.;
infile txt zip gzip filevar=filename truncover end=eof;
while (not eof);
INPUT
@0001 var1 $char11
@0012 var2 $char1.
@0013 var3 $char1.
(and so on)
;
OUTPUT;
end;
run;
If you are running an older version of SAS that does not support GZIP option on ZIP engine the you could use the PIPE engine instead. Replace the infile statement with these two lines.
filename='gunzip -c '||quote(trim(filename)) ;
infile txt pipe filevar=filename truncover end=eof;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What's the "eof" bit about?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@jcinma wrote:
What's the "eof" bit about?
The way that SAS normally ends a data step is when it reads past the end of a an input file with an INPUT or SET/MERGE/UPDATE statement.
So you want to prevent SAS from reading past the end of any of the text files to prevent it from ending the data step after reading only one text file.
This data step will end when the top input statement reads past the end of the list of file names.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm really liking the sound of this. But where/how do I write the list of filenames? Also, what if they're all the same but go up to 14 or something? My files are named like this:
In one instance I have something relatively basic:
julia01.txt.gz -- julia12.txt.gz
In other cases I have something more intricate like this:
jim01.file001.txt.gz -- jim01.file203.txt.gz
jim02.file001.txt.gz -- jim02.file098.txt.gz
--
jim14.file001.txt.gz -- jim14.file185.txt,gz
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@jcinma wrote:
I'm really liking the sound of this. But where/how do I write the list of filenames? Also, what if they're all the same but go up to 14 or something? My files are named like this:
In one instance I have something relatively basic:
julia01.txt.gz -- julia12.txt.gz
In other cases I have something more intricate like this:
jim01.file001.txt.gz -- jim01.file203.txt.gz
jim02.file001.txt.gz -- jim02.file098.txt.gz
--
jim14.file001.txt.gz -- jim14.file185.txt,gz
I would just put all of the files in a single directory that only contains those files. Then the LS command is easy to find all of the files.
Or you can just put the list as in-line data instead of executing an LS command.
....
infile cards truncover;
....
cards;
julia01.txt.gz
julia02.txt.gz
julia03.txt.gz
julia04.txt.gz
;
Or you could write a program to generate a dataset of filenames.
data files ;
length filename $200;
do i=1 to 203 ;
filename=cats('jim01.file',put(i,z3.),'.txt.gz');
output;
end;
do i=1 to 98 ;
filename=cats('jim02.file',put(i,z3.),'.txt.gz');
output;
end;
run;
And use a SET statement instead of INFILE/INPUT to drive the steps that reads the files.