Hello everyone,
I have a question about how to revisit rows in a datastep. Let us assume that the dataset i too large to be read into the memory. Then for some optimization purpose, I need to revisit observations until convergence is obtained. I can actually get the code below to work, so my question here is about efficiency.
data maximize;
/*---here some initializing code--*/
dsid=open('bigdata');
do until (convergence)
do i=1 to attrn(dsid,'nobs');
if i=1 then obsid=fetchobs(dsid,1);
else obsid=fetch(dsid);
/*here some more code, which calculate the contributions to a function to be maximized, not relevant for my question*/
end
/*and here some code to calculate the newton-raphson step*/
end;
dsid=close(dsid);
run;
In this code I run through the dataset "bigdata" until I have convergence. I realize that reading data with a "set"-statement is much faster since it can use the buffers. However, it is not possible to use "set", since I do not know beforehand how many times I need to run through the dataset. Or, is there some clever trick of how to use "set" to read in data as many time as necessary?
Jacob
Maybe SET with POINT= option would suite your needs.
As data _null_ points out you can also use a "SET" within a do-loop - and you can use POINT if you need to directly address observations instead of just iterate through the data from top to bottom.
I'm not in front of SAS right now, so can't send any sample code. However, have a look in the doc for SET ... KEY= syntax.
I think you could do something like (totally untested):
data bigdata (index=obsnum);
set bigdata;
obsnum+1; * need an index variable to loop over all observations. ;
run;
data maximize;
/*---here some initializing code--*/
do obnum=1 to nobs while (not convergence);
set bigdata nobs=nobs key=obsnum / unique; * check if unique is needed or not... ;
/*here some more code, which calculate the contributions to a function to be maximized, not relevant for my question*/
end;
/*and here some code to calculate the newton-raphson step*/
run;
I didn't follow all your code, but essentially set ... key= uses the value of the key variable(s) within the PDV as a random lookup for the dataset in the set statement. A bit like the hash object, but without the memory constraints.
See the doc for the SET statement. I think it's example 7 or 8.
HTH,
Scott
Thank you for your answers.
I got the SET statement to work.
It is an interesting comparison in efficiency between the set-approach and my approach above. Unfortunately, the set statement does not seem to run faster through a dataset than what can be done with the "open-" and "fetch-" functions. On my laptop the "set-approach" took 55 seconds on the testdataset similated below, while the "open and fetch- approach" only took 15 seconds. Though, if I also need to have all the variables from the sourcedata as named variables, then the two approaches use equally amount of time (55 seconds in the example below). If the speed is not so important, then the set statement is more easy as it has many more useful features.
I think that I agree with Scott that the KEY= option also is a possibility. Though, my experience with the key= option is that is very inefficient in speed if I have to look up many times. But efficient if I have to look up a few times in a (possible large) dataset. Since I here has to go through the whole dataset I think it will be more efficient to go though the data sequentially than using the KEY= and indices.
You all get a star for "helpful answer". I like your answers even though my original approach seems most to be most efficient.
Here the comparison between the "set" and "open and fetch" approach:
/* create a test-dataset.*/
data bigdata;
array x{8};
do variable=1 to 10000000;
do j=1 to 8;x
end;
run;
/*run throug the dataset 10 times with open and fetch*/
data _NUUL_;
length x1-x8 variable j 8;
dsid=open('bigdata');
call set(dsid); /*uncommenting this statement will reduce the time time*/
/*here some code to initialize the optimization*/;
steps=10;
do until ( steps=0) /*convergence criterium*/;
do i=1 to attrn(dsid,'nobs');
if i=1 then obsid=fetchobs(dsid,1);
else obsid=fetch(dsid);
/*calculate contribution to the functions that should be maximized*/
end;
steps=steps-1;
/*calculate the new value, according to newton-raphson algorithm*/
end;
dsid=close(dsid);
run;
*the "set"-approach:;
data _NULL_;
steps=10;
do until(convergence);
do observation=1 to 10000000;
set bigdata point=observation;
end;
steps=steps-1;
convergence=(steps=0);
if (steps=0) then go to finalize;
end;
finalize: put convergence;
stop;
run;
I would examine the method of using Newton-Raphson as that is likely where most of the clock cycles go from what I remember of coding one in SAS 6.08.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.