Hello,
I am trying to develop a code that find the maximum value in a range of 1800 rows and subsequently condenses the data down to that one row, and then move on to the next 1800 and do the same.
Overall I have about 30 data sets, each of which has over 10 million rows, so am trying to use PROC SQL or PROC EXPAND as opposed to data step processing.
Can someone provide some help?
Thanks!
- jkurka
PROC SUMMARY has features that make this easy. First you can use a view to add the group variable to your data, virtually speaking. The IDGROUP OUTPUT statement option finds the MAX of age for each BY G(roup) created in the VIEW and the OUT sub-option defines the variable(s) to write to the data set that correspond to the MAX(AGE). The OBS sub-option creates the _OBS_ variable with the observation number.
%let gsize=3;
data rowsv / view=rowsv;
set sashelp.class;
g + mod(_n_,&gsize) eq 1;
run;
proc summary data=rowsv;
by g;
output out=max&gsize idgroup(max(age) obs out(name--weight)=);
run;
Is there some variable that identifies the 1800 rows?
Using PROC SQL and the monotonic () function I created a 'rows' variable, but that is all besides the variable (column) that has the data I want to aggregate.
I could provide an example for visualization. I'm sure that would help!
Say I'm starting with this and want to find the max within a range of each 3 variables:
rows var
1 0
2 2
3 4
4 1
5 4
6 2
7 0
8 5
9 1
I would like my output to show:
row var
1 4 --> max of first set of 3
4 4 --> max of next set of 3
7 5 --> max of next set 3
It's simple in a DATA step:
data want;
keep row maximum;
set have;
retain maximum 0;
maximum = max(maximum, var);
if mod(_n_, 1800) = 0 then do;
row = _n_;
output;
maximum = 0;
end;
run;
Note though that the data is read in the order it is stored. If the stored order changes then so do the maximums.
Won't this take a really long time to go through 10 million rows?
The actual time it takes depends on many factors. It should take about the same for a procedure to read the data as data step. Ideally you want to make one pass of the data.
PROC SUMMARY has features that make this easy. First you can use a view to add the group variable to your data, virtually speaking. The IDGROUP OUTPUT statement option finds the MAX of age for each BY G(roup) created in the VIEW and the OUT sub-option defines the variable(s) to write to the data set that correspond to the MAX(AGE). The OBS sub-option creates the _OBS_ variable with the observation number.
%let gsize=3;
data rowsv / view=rowsv;
set sashelp.class;
g + mod(_n_,&gsize) eq 1;
run;
proc summary data=rowsv;
by g;
output out=max&gsize idgroup(max(age) obs out(name--weight)=);
run;
Thank you data_null__. Yes I'm keeping track of both posts. Wasn't aware of the duplicate until now. I'm giving your solution a try now!
It would be a good idea to show an example of your data. If you have date-time and you want to group accordingly it might be better to use the date-time and a format or other rounding technique instead of observations number.
The data aren't timestamped. I've attached an excel example of a chuck of one file. As you can see it's simple, and just wanting the maximum value of every 1800 rows of the flt_z variable (last column).
I have the time that corresponds to the first row, but won't timestamp the full dataset due to it taking so long.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.