BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
cooch17
Fluorite | Level 6

Suppose I have a very large (multiple GB) binary data file, that I'm want to read into SAS (using 9.3 at the moment). I don't want to read in the entire file, but, rather, I want to read in every nth record. 

 

For simple data files (ASCII), this can be done fairly easily using #. For example, suppose I have some file called test.dat containin g 3 data files/columns (x,y and z). The following reads in every 5th record:

 

filename in 'c:\users\userDesktop\test.dat';

data hold; infile in;
   input #5  x y z;
run;

 

 

Works fine. But, for some reason, if test.dat is a binary file, this approach doesn't seem to work. To read in the particular binary data file, I use something like the following input syntax:

 

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

Works fine. However, 

 

input #5 buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

doesn't work as expected (or really, at all...). 

 

I know I could probably do this using 2 steps: (i) read in the full binary file, and then (ii) use some 'tricks' with subsetting the data to keep only every nth record, but the original file is so large I'm trying to avoid having to read the entire thin in in the first place.

 

Suggestions/pointers to the obvious are welcomed. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Just use a conditional OUTPUT statement.  It doesn't really add much to "read" every record since you are not doing direct access to the file anyway. You could use +4 to skip reading the fields you are dropping.

So to keep the 1st, 6th, 11th, ....record you could do this:

 

data MCMC;
  infile &MCMCfile linesize=&recl recfm=N;
  input +4 Chain1 RB8. Chain2 RB8. Chain3 RB8. +4 ;
  if mod(_n_,5) = 1 then output;
run;

 

View solution in original post

7 REPLIES 7
Astounding
PROC Star

Until somebody else comes up with a better idea ... there are other ways to skip over lines when inputting data.  One possibility:

 

input //// buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

Another:

 

input;

input;

input;

input;

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

 

While this looks silly, it's easier to adapt if you want to read every 100th line instead of every 5th line.  For example:

 

do i=1 to 99;

   input;

end;

input buffer1 RB4. Chain1 RB8. Chain2 RB8.;

cooch17
Fluorite | Level 6

Neat, except it doesn't work with binary files -- following from the log:

 

The '/' INPUT/PUT statement option is inconsistent with binary mode I/O. The execution of
the DATA STEP is being terminated.

BrunoMueller
SAS Super FREQ

Are you using RECFM=N to read the file. If yes there are no records as the file is just a long string of bytes.

 

Does the file have just 3 data values and this group of values repeats till the end of the file, or are their many more different data values. Do you have "record layout" of the data values?

 

If you run the following code (with your file) what does the log file look like:

data test;
  infile "yourfile" ;
  input char $1.;
run;
cooch17
Fluorite | Level 6

Here is how I read in this particular data file - &MCMCfile, and recl are macro variables set earlier in the program. Note I'm using recfm=N - but I explicitly set the linesize, which 'forces' records (i.e., splits the big string into discrete lines):

 

 

data MCMC;
	infile &MCMCfile linesize=&recl recfm=N;
	input buffer1 RB4. Chain1 RB8. Chain2 RB8. Chain3 RB8. buffer2 RB4.;

drop buffer1 buffer2;

 

Each line (record) contains output from an MCMC sampler at each step (so, same number of 'variables' per line, different values for each variable). 

 

As per OP, the following 2-step procedure works: (i) pull in the big binary file (into a data set I call MCMC), then (ii) subset it, keeping everynth record. For step (ii), I simply use the following (as one of a couple of approaches that probably would work), where thin is a macro variable I set earlier in the program. 

 

***************************
* thin the data set....   *
***************************;

data MCMC;
 do point = &thin to nobs by &thin;
    set MCMC point=point nobs=nobs;
	   output;
	end;
  stop;

 

This works fine, but as per OP, seems annoyingly inefficient, since I'm basically taking 2 steps for something I'd like to do in 1 step (during the infile stage). 

 

 

Tom
Super User Tom
Super User

Just use a conditional OUTPUT statement.  It doesn't really add much to "read" every record since you are not doing direct access to the file anyway. You could use +4 to skip reading the fields you are dropping.

So to keep the 1st, 6th, 11th, ....record you could do this:

 

data MCMC;
  infile &MCMCfile linesize=&recl recfm=N;
  input +4 Chain1 RB8. Chain2 RB8. Chain3 RB8. +4 ;
  if mod(_n_,5) = 1 then output;
run;

 

cooch17
Fluorite | Level 6
This worked perfectly -- I thought I'd tried something like this, but apparently, messed something up in the attempt. Thanks very much.
Kurt_Bremser
Super User

If you know that the binary file has a fixed record length, use recfm=f and a proper lrecl. Then you can read every record and only do the output when mod(_n_,5) = 0.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1242 views
  • 0 likes
  • 5 in conversation