Re: How to find a maximum value while restricting number of rows

jkurka · Posted 02-11-2016 10:53 PM

Hello,

I am trying to develop a code that find the maximum value in a range of 1800 rows and subsequently condenses the data down to that one row, and then move on to the next 1800 and do the same.

Overall I have about 30 data sets, each of which has over 10 million rows, so am trying to use PROC SQL or PROC EXPAND as opposed to data step processing.

For example:

Proc expand factor (1:1800);

convert z=a / transformout ( max ) ;

run;

Can someone provide some help? Using SAS EG 9.4

Thanks!

- jkurka

LinusH · Posted 02-12-2016 02:20 AM

I'm not familiar with proc expand, so I can't value that option.
But by chunking up the data into 1800 Obs pieces, it sounds that a data step is most optimal way to do it. It requires only one pass of the data. You could embed this logic on macro if need to apply this logic to many data sets.
May I ask how the outcome will be used?

Data never sleeps

jkurka · Posted 02-12-2016 09:32 AM

A data step is going to take too long to process this over 10 million rows and at least 30 datasets won't it? The output variable will be matched to a time variable then plotted.

RW9 · Posted 02-12-2016 09:39 AM

Datastep is likely to be faster than SQL. However looking at your response, even taking the max of 1800 records is still going to give you 16666.xy points to plot, do you think its a good idea/even possible to plot that amount of data out to a graph? What is this data, can you subdivide it into smaller groupings, subject level, visit level, test or something? Why the 1800 records for instance? why the max, if that is the case can you not means your data by the subgroups and plot the result from that?

jkurka · Posted 02-12-2016 09:45 AM

This data is a person's movement. Movement is collected 30x per second and it's typical to compress that data into 1-min epochs (hence the 1800 = 30x/sec * 60 sec) and using the peak (maximum) value of that 1-min epoch. The 7 days worth of data per person results in over 10 million rows. I have the syntax prepared, and they are typically segemented into multiple graphs, each with only 48 hours worth of data.

RW9 · Posted 02-12-2016 10:00 AM

Ok, well I assume some things with your data, but if it just blocking out into time segments what about:

/* Create some test data */
data have;
  call streaminit(123);
  do pt=1 to 2;
    do time="01JAN2015T00:00"dt to "05JAN2015T00:00"dt;
      do i=1 to 30;
        res=rand('normal');
        output;
      end;
    end;
  end;
  format time datetime.;
run;

data want (drop=end_time res i);
  set have;
  retain max_res end_time;
  by pt;
  if first.pt then do;
    max_res=0;
    end_time=time+"00:01"t;
  end;
  if res > max_res then max_res=res;
  if time = end_time or last.pt then do;
    output;
    end_time=time+"00:01"t;
  end;
run;

This gets the max value per 1 minute blocks,

RW9 · Posted 02-12-2016 05:48 AM

I would agree with @LinusH here, sounds like datastep would be the best way to go. You may also do it in a couple of steps such like:

data want;
  set have;
  group=floor(_n_/1600);
run;

proc sort data=want;
  by group descending value;
run;

data want;
  set want;
  by group;
  if first.group;
run;

However, what if there is more than one row with the same value?

data_null__ · Posted 02-12-2016 08:12 AM

@RW9 wrote:

I would agree with @LinusH here, sounds like datastep would be the best way to go. You may also do it in a couple of steps such like:
data want;
  set have;
  group=floor(_n_/1600);
run;

proc sort data=want;
  by group descending value;
run;

data want;
  set want;
  by group;
  if first.group;
run;
However, what if there is more than one row with the same value?

You don't take into account the 10 million rows, you are suggesting a lot of reading and writing.

This question has been asked twice with the second asking mentioning the desire to have the observation number.

jkurka · Posted 02-12-2016 09:30 AM

Yes I posted under another different subsection of the SAS community. This is an example of what I'd like to do, but as data_null__ wrote, I need the max of 1800 rows for over 10 million rows per dataset, with at least 30 datasets. I don't need the observation number itself. If I could get by with processing every k rows that would be great!

rows var

1 0

2 2

3 4

4 1

5 4

6 2

7 0

8 5

9 1

I would like my output to show:

row var

1 4 --> max of first set of 3

4 4 --> max of next set of 3

7 5 --> max of next set 3

data_null__ · Posted 02-12-2016 09:40 AM

How are the 30 data sets to be processed? Separately, as one?. Did you see my reply to your duplicate question?

jkurka · Posted 02-12-2016 10:23 AM

Thank you for your post on the other thread!

%let epoch=1800 ;
DATA DATA_SVM / view=DATA_SVM ;
   set data_nostamps ;
   minute + mod(_n_,&epoch) eq 1 ;
RUN;
PROC SUMMARY data= DATA_SVM ;
   by minute ;
   output out=DATA_1minEPOCHS_&id idgroup(max(flt_z) obs out(x--flt_z)=);

RUN;

This worked great! And was faster than I expected. Much appreciated!

johnsville · Posted 02-12-2016 10:07 AM

/*
* Breaking down a sql solution to this problem...
* just for fun... !!!
*
* You have to have a sequence number on the data for this to work

* (even if you don't want one!)
* which maybe means you have to pass the data once to prepare
* unless you already have a sequence number
*
* You compute a panel, which is the sequence number modulo your batch size: 1800
*/

data new_copy ;
set old_copy ;
seq=_n_ ;
panel=MOD(_n_,10) ;
run;

/*
*
* Then proc sql and group by that panel.
*
*/
proc sql;
select panel ,max(value) as max_value
from new_copy
group by panel
order by 1,2
;
quit ;

run;

/*
* When you need to get the sequence number of the actual row that has the max value,
* add the sequence number into the aggregate function as
* in the second proc sql below.
* In the presence of duplicatoin, this will end up choosing that LAST occurance of that max value...
*/
proc sql;
create table get_these as
select panel ,max(CATX(',',put(value,Z9.),put(seq,Z9.))) as max_value
from new_copy
group by panel
order by 1,2
;
quit ;

run;

/*
* then you can match that data set back to your original data
* to get the whole row. Up to you!
*
* but whatever process you choose, most likely the data - however large -
* must either be passed multiple times, or else held in memory in total...
*
*/

Astounding · Posted 02-12-2016 04:02 PM

Are we making too much of a relatively easy problem? Wouldn't this do a sufficient job (and be blazingly fast)?

data want;

key = _n_;

do i=1 to 1800 until (done);

set have end=done;

maxvar = max(var, maxvar);

end;

drop i;

run;

Registration is open

SAS Training: Just a Click Away