BookmarkSubscribeRSS Feed
jkurka
Fluorite | Level 6

Hello,

I am trying to develop a code that find the maximum value in a range of 1800 rows and subsequently condenses the data down to that one row, and then move on to the next 1800 and do the same.

 

Overall I have about 30 data sets, each of which has over 10 million rows, so am trying to use PROC SQL or PROC EXPAND as opposed to data step processing.

 For example:

Proc expand factor (1:1800);

convert z=a / transformout ( max ) ;

run;

 

Can someone provide some help? Using SAS EG 9.4

Thanks!

- jkurka

12 REPLIES 12
LinusH
Tourmaline | Level 20
I'm not familiar with proc expand, so I can't value that option.
But by chunking up the data into 1800 Obs pieces, it sounds that a data step is most optimal way to do it. It requires only one pass of the data. You could embed this logic on macro if need to apply this logic to many data sets.
May I ask how the outcome will be used?
Data never sleeps
jkurka
Fluorite | Level 6

A data step is going to take too long to process this over 10 million rows and at least 30 datasets won't it? The output variable will be matched to a time variable then plotted.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Datastep is likely to be faster than SQL.  However looking at your response, even taking the max of 1800 records is still going to give you 16666.xy points to plot, do you think its a good idea/even possible to plot that amount of data out to a graph?  What is this data, can you subdivide it into smaller groupings, subject level, visit level, test or something?  Why the 1800 records for instance?  why the max, if that is the case can you not means your data by the subgroups and plot the result from that?

jkurka
Fluorite | Level 6

This data is a person's movement. Movement is collected 30x per second and it's typical to compress that data into 1-min epochs (hence the 1800 = 30x/sec * 60 sec) and using the peak (maximum) value of that 1-min epoch. The 7 days worth of data per person results in over 10 million rows. I have the syntax prepared, and they are typically segemented into multiple graphs, each with only 48 hours worth of data.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Ok, well I assume some things with your data, but if it just blocking out into time segments what about:

/* Create some test data */
data have;
  call streaminit(123);
  do pt=1 to 2;
    do time="01JAN2015T00:00"dt to "05JAN2015T00:00"dt;
      do i=1 to 30;
        res=rand('normal');
        output;
      end;
    end;
  end;
  format time datetime.;
run;

data want (drop=end_time res i);
  set have;
  retain max_res end_time;
  by pt;
  if first.pt then do;
    max_res=0;
    end_time=time+"00:01"t;
  end;
  if res > max_res then max_res=res;
  if time = end_time or last.pt then do;
    output;
    end_time=time+"00:01"t;
  end;
run;

This gets the max value per 1 minute blocks, 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

I would agree with @LinusH here, sounds like datastep would be the best way to go.  You may also do it in a couple of steps such like:

data want;
  set have;
  group=floor(_n_/1600);
run;

proc sort data=want;
  by group descending value;
run;

data want;
  set want;
  by group;
  if first.group;
run;

However, what if there is more than one row with the same value?

data_null__
Jade | Level 19

@RW9 wrote:

I would agree with @LinusH here, sounds like datastep would be the best way to go.  You may also do it in a couple of steps such like:

data want;
  set have;
  group=floor(_n_/1600);
run;

proc sort data=want;
  by group descending value;
run;

data want;
  set want;
  by group;
  if first.group;
run;

However, what if there is more than one row with the same value?


You don't take into account the 10 million rows, you are suggesting a lot of reading and writing.

This question has been asked twice with the second asking mentioning the desire to have the observation number.

jkurka
Fluorite | Level 6

Yes I posted under another different subsection of the SAS community. This is an example of what I'd like to do, but as data_null__ wrote, I need the max of 1800 rows for over 10 million rows per dataset, with at least 30 datasets. I don't need the observation number itself. If I could get by with processing every k rows that would be great!

rows         var

1              0

2              2

3              4

4              1

5              4

6              2

7              0

8              5

9              1

 

I would like my output to show:

row          var

1              4    --> max of first set of 3

4              4    --> max of next set of 3

7              5    --> max of next set 3

data_null__
Jade | Level 19

How are the 30 data sets to be processed?  Separately, as one?.  Did you see my reply to your duplicate question?  

jkurka
Fluorite | Level 6

Thank you for your post on the other thread!

%let epoch=1800 ;
DATA DATA_SVM / view=DATA_SVM ;
    set data_nostamps ;
    minute + mod(_n_,&epoch) eq 1 ;
RUN;
PROC SUMMARY data= DATA_SVM ;
    by minute ;
    output out=DATA_1minEPOCHS_&id idgroup(max(flt_z) obs out(x--flt_z)=);

RUN;

 

This worked great! And was faster than I expected. Much appreciated!

johnsville
Obsidian | Level 7


/*
* Breaking down a sql solution to this problem...
* just for fun... !!!
*
* You have to have a sequence number on the data for this to work

* (even if you don't want one!)
* which maybe means you have to pass the data once to prepare
* unless you already have a sequence number
*
* You compute a panel, which is the sequence number modulo your batch size: 1800
*/


data new_copy ;
set old_copy ;
seq=_n_ ;
panel=MOD(_n_,10) ;
run;

/*
*
* Then proc sql and group by that panel.
*
*/
proc sql;
select panel ,max(value) as max_value
from new_copy
group by panel
order by 1,2
;
quit ;

run;

/*
* When you need to get the sequence number of the actual row that has the max value,
* add the sequence number into the aggregate function as
* in the second proc sql below.
* In the presence of duplicatoin, this will end up choosing that LAST occurance of that max value...
*/
proc sql;
create table get_these as
select panel ,max(CATX(',',put(value,Z9.),put(seq,Z9.))) as max_value
from new_copy
group by panel
order by 1,2
;
quit ;

run;

/*
* then you can match that data set back to your original data
* to get the whole row. Up to you!
*
* but whatever process you choose, most likely the data - however large -
* must either be passed multiple times, or else held in memory in total...
*
*/

Astounding
PROC Star

Are we making too much of a relatively easy problem?  Wouldn't this do a sufficient job (and be blazingly fast)?

 

data want;

key = _n_;

do i=1 to 1800 until (done);

   set have end=done;

   maxvar = max(var, maxvar);

end;

drop i;

run;

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 1980 views
  • 0 likes
  • 6 in conversation