DATA Step, Macro, Functions and more

keep every nth record while reading in binary data file

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

keep every nth record while reading in binary data file

Suppose I have a very large (multiple GB) binary data file, that I'm want to read into SAS (using 9.3 at the moment). I don't want to read in the entire file, but, rather, I want to read in every nth record. 

 

For simple data files (ASCII), this can be done fairly easily using #. For example, suppose I have some file called test.dat containin g 3 data files/columns (x,y and z). The following reads in every 5th record:

 

filename in 'c:\users\userDesktop\test.dat';

data hold; infile in;
   input #5  x y z;
run;

 

 

Works fine. But, for some reason, if test.dat is a binary file, this approach doesn't seem to work. To read in the particular binary data file, I use something like the following input syntax:

 

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

Works fine. However, 

 

input #5 buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

doesn't work as expected (or really, at all...). 

 

I know I could probably do this using 2 steps: (i) read in the full binary file, and then (ii) use some 'tricks' with subsetting the data to keep only every nth record, but the original file is so large I'm trying to avoid having to read the entire thin in in the first place.

 

Suggestions/pointers to the obvious are welcomed. 


Accepted Solutions
Solution
‎06-20-2017 09:37 PM
Super User
Super User
Posts: 6,499

Re: keep every nth record while reading in binary data file

[ Edited ]

Just use a conditional OUTPUT statement.  It doesn't really add much to "read" every record since you are not doing direct access to the file anyway. You could use +4 to skip reading the fields you are dropping.

So to keep the 1st, 6th, 11th, ....record you could do this:

 

data MCMC;
  infile &MCMCfile linesize=&recl recfm=N;
  input +4 Chain1 RB8. Chain2 RB8. Chain3 RB8. +4 ;
  if mod(_n_,5) = 1 then output;
run;

 

View solution in original post


All Replies
Super User
Posts: 5,081

Re: keep every nth record while reading in binary data file

Until somebody else comes up with a better idea ... there are other ways to skip over lines when inputting data.  One possibility:

 

input //// buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

Another:

 

input;

input;

input;

input;

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

While this looks silly, it's easier to adapt if you want to read every 100th line instead of every 5th line.  For example:

 

do i=1 to 99;

   input;

end;

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

New Contributor
Posts: 4

Re: keep every nth record while reading in binary data file

Neat, except it doesn't work with binary files -- following from the log:

 

The '/' INPUT/PUT statement option is inconsistent with binary mode I/O. The execution of
the DATA STEP is being terminated.

SAS Super FREQ
Posts: 683

Re: keep every nth record while reading in binary data file

Are you using RECFM=N to read the file. If yes there are no records as the file is just a long string of bytes.

 

Does the file have just 3 data values and this group of values repeats till the end of the file, or are their many more different data values. Do you have "record layout" of the data values?

 

If you run the following code (with your file) what does the log file look like:

data test;
  infile "yourfile" ;
  input char $1.;
run;
New Contributor
Posts: 4

Re: keep every nth record while reading in binary data file

Here is how I read in this particular data file - &MCMCfile, and recl are macro variables set earlier in the program. Note I'm using recfm=N - but I explicitly set the linesize, which 'forces' records (i.e., splits the big string into discrete lines):

 

 

data MCMC;
	infile &MCMCfile linesize=&recl recfm=N;
	input buffer1 RB4. Chain1 RB8. Chain2 RB8. Chain3 RB8. buffer2 RB4.;

drop buffer1 buffer2;

 

Each line (record) contains output from an MCMC sampler at each step (so, same number of 'variables' per line, different values for each variable). 

 

As per OP, the following 2-step procedure works: (i) pull in the big binary file (into a data set I call MCMC), then (ii) subset it, keeping everynth record. For step (ii), I simply use the following (as one of a couple of approaches that probably would work), where thin is a macro variable I set earlier in the program. 

 

***************************
* thin the data set....   *
***************************;

data MCMC;
 do point = &thin to nobs by &thin;
    set MCMC point=point nobs=nobs;
	   output;
	end;
  stop;

 

This works fine, but as per OP, seems annoyingly inefficient, since I'm basically taking 2 steps for something I'd like to do in 1 step (during the infile stage). 

 

 

Solution
‎06-20-2017 09:37 PM
Super User
Super User
Posts: 6,499

Re: keep every nth record while reading in binary data file

[ Edited ]

Just use a conditional OUTPUT statement.  It doesn't really add much to "read" every record since you are not doing direct access to the file anyway. You could use +4 to skip reading the fields you are dropping.

So to keep the 1st, 6th, 11th, ....record you could do this:

 

data MCMC;
  infile &MCMCfile linesize=&recl recfm=N;
  input +4 Chain1 RB8. Chain2 RB8. Chain3 RB8. +4 ;
  if mod(_n_,5) = 1 then output;
run;

 

New Contributor
Posts: 4

Re: keep every nth record while reading in binary data file

This worked perfectly -- I thought I'd tried something like this, but apparently, messed something up in the attempt. Thanks very much.
Super User
Posts: 6,936

Re: keep every nth record while reading in binary data file

If you know that the binary file has a fixed record length, use recfm=f and a proper lrecl. Then you can read every record and only do the output when mod(_n_,5) = 0.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 147 views
  • 0 likes
  • 5 in conversation