Solved: Re: How to find a max value while restricting number of rows

jkurka · Posted 02-11-2016 10:49 PM

Hello,

I am trying to develop a code that find the maximum value in a range of 1800 rows and subsequently condenses the data down to that one row, and then move on to the next 1800 and do the same.

Overall I have about 30 data sets, each of which has over 10 million rows, so am trying to use PROC SQL or PROC EXPAND as opposed to data step processing.

Can someone provide some help?

Thanks!

- jkurka

data_null__ · Posted 02-12-2016 04:37 AM

PROC SUMMARY has features that make this easy. First you can use a view to add the group variable to your data, virtually speaking. The IDGROUP OUTPUT statement option finds the MAX of age for each BY G(roup) created in the VIEW and the OUT sub-option defines the variable(s) to write to the data set that correspond to the MAX(AGE). The OBS sub-option creates the _OBS_ variable with the observation number.

%let gsize=3;

data rowsv / view=rowsv;
   set sashelp.class;
   g + mod(_n_,&gsize) eq 1;
   run;
proc summary data=rowsv;
   by g;
   output out=max&gsize idgroup(max(age) obs out(name--weight)=);
   run;

View solution in original post

Reeza · Posted 02-11-2016 11:10 PM

Is there some variable that identifies the 1800 rows?

jkurka · Posted 02-11-2016 11:32 PM

Using PROC SQL and the monotonic () function I created a 'rows' variable, but that is all besides the variable (column) that has the data I want to aggregate.

jkurka · Posted 02-11-2016 11:39 PM

I could provide an example for visualization. I'm sure that would help!

Say I'm starting with this and want to find the max within a range of each 3 variables:

rows var

1 0

2 2

3 4

4 1

5 4

6 2

7 0

8 5

9 1

I would like my output to show:

row var

1 4 --> max of first set of 3

4 4 --> max of next set of 3

7 5 --> max of next set 3

SASKiwi · Posted 02-12-2016 01:57 AM

It's simple in a DATA step:

data want;
  keep row maximum;
  set have;
  retain maximum 0;
  maximum = max(maximum, var);
  if mod(_n_, 1800) = 0 then do;
    row = _n_;
    output; 
    maximum = 0;
  end;
run;

Note though that the data is read in the order it is stored. If the stored order changes then so do the maximums.

jkurka · Posted 02-12-2016 09:36 AM

Won't this take a really long time to go through 10 million rows?

data_null__ · Posted 02-12-2016 09:45 AM

The actual time it takes depends on many factors. It should take about the same for a procedure to read the data as data step. Ideally you want to make one pass of the data.

data_null__ · Posted 02-12-2016 04:37 AM

PROC SUMMARY has features that make this easy. First you can use a view to add the group variable to your data, virtually speaking. The IDGROUP OUTPUT statement option finds the MAX of age for each BY G(roup) created in the VIEW and the OUT sub-option defines the variable(s) to write to the data set that correspond to the MAX(AGE). The OBS sub-option creates the _OBS_ variable with the observation number.

%let gsize=3;

data rowsv / view=rowsv;
   set sashelp.class;
   g + mod(_n_,&gsize) eq 1;
   run;
proc summary data=rowsv;
   by g;
   output out=max&gsize idgroup(max(age) obs out(name--weight)=);
   run;

jkurka · Posted 02-12-2016 09:47 AM

Thank you data_null__. Yes I'm keeping track of both posts. Wasn't aware of the duplicate until now. I'm giving your solution a try now!

data_null__ · Posted 02-12-2016 09:54 AM

It would be a good idea to show an example of your data. If you have date-time and you want to group accordingly it might be better to use the date-time and a format or other rounding technique instead of observations number.

jkurka · Posted 02-12-2016 10:08 AM

The data aren't timestamped. I've attached an excel example of a chuck of one file. As you can see it's simple, and just wanting the maximum value of every 1800 rows of the flt_z variable (last column).

I have the time that corresponds to the first row, but won't timestamp the full dataset due to it taking so long.

Registration is open