BookmarkSubscribeRSS Feed
gmorla2016
Calcite | Level 5

hi, i have a table in which I need to calculate the max among the previous 30 days for every record , is there a way in hive with sub queries?, see below for sample data

 

FIELD1FIELD2FIELD3DATEVARMAX VAR
USATXRTB201609301
USATXRTB201610011
USATXRTB201610021
USATXRTB201610031
USATXRTB201610041
USATXRTB201610052
USATXRTB201610062
USATXRTB201610072
USATXRTB201610082
USATXRTB201610091
USATXRTB201610102
USATXRTB201610112
USATXRTB201610122
USATXRTB201610132
USATXRTB201610142
USATXRTB201610152
USATXRTB201610162
USATXRTB201610172
USATXRTB201610182
USATXRTB201610192
USATXRTB201610202
USATXRTB201610212
USATXRTB201610222
USATXRTB201610232
USATXRTB201610242
USATXRTB201610253
USATXRTB201610262
USATXRTB201610272
USATXRTB201610282
USATXRTB201610292
USATXRTB2016103023
USATXRTB2016103123
USATXRTB2016110123
USATXRTB2016110223
USATXRTB2016110323
USATXRTB2016110423
USATXRTB2016110523
USATXRTB2016110623
USATXRTB2016110723
USATXRTB2016110823
USATXRTB2016110923
USATXRTB2016111023
USATXRTB2016111123
USATXRTB2016111223
USATXRTB2016111323
USATXRTB2016111423
USATXRTB2016111523
USATXRTB2016111623
USATXRTB2016111723
USATXRTB2016111823
USATXRTB2016111923
USATXRTB2016112023
USATXRTB2016112123
USATXRTB2016112223
USATXRTB2016112323
USATXRTB2016112423
USATXRTB2016112522
USATXRTB2016112622
USATXRTB2016112722
USATXRTB2016112822
USATXRTB2016112922
UKLONDONRTB201609301
UKLONDONRTB201610011
UKLONDONRTB201610021
UKLONDONRTB201610031
UKLONDONRTB201610041
UKLONDONRTB201610051
UKLONDONRTB201610061
UKLONDONRTB201610071
UKLONDONRTB201610081
UKLONDONRTB201610091
UKLONDONRTB201610101
UKLONDONRTB201610111
UKLONDONRTB201610121
UKLONDONRTB201610131
UKLONDONRTB201610141
UKLONDONRTB201610151
UKLONDONRTB201610161
UKLONDONRTB201610171
UKLONDONRTB201610181
UKLONDONRTB201610191
UKLONDONRTB201610201
UKLONDONRTB201610211
UKLONDONRTB201610221
UKLONDONRTB201610231
UKLONDONRTB201610241
UKLONDONRTB201610254
UKLONDONRTB201610262
UKLONDONRTB201610272
UKLONDONRTB201610282
UKLONDONRTB201610292
UKLONDONRTB2016103024
UKLONDONRTB2016103124
UKLONDONRTB2016110124
UKLONDONRTB2016110224
UKLONDONRTB2016110324
UKLONDONRTB2016110424
UKLONDONRTB2016110524
UKLONDONRTB2016110624
UKLONDONRTB2016110724
UKLONDONRTB2016110824
UKLONDONRTB2016110924
UKLONDONRTB2016111024
UKLONDONRTB2016111124
UKLONDONRTB2016111224
UKLONDONRTB2016111324
UKLONDONRTB2016111424
UKLONDONRTB2016111524
UKLONDONRTB2016111624
UKLONDONRTB2016111724
UKLONDONRTB2016111824
UKLONDONRTB2016111924
UKLONDONRTB2016112024
UKLONDONRTB2016112124
UKLONDONRTB2016112224
UKLONDONRTB2016112324
UKLONDONRTB2016112424
UKLONDONRTB2016112522
UKLONDONRTB2016112622
UKLONDONRTB2016112722
UKLONDONRTB2016112822
UKLONDONRTB2016112922
7 REPLIES 7
Reeza
Super User

Are you grouping by the first three variables? Also, is your date a SAS date, numeric with a date format?

 

A self join is what's typically used to compute these types of calculations. If you have SAS/ETS licensed, PROC EXPAND would be a straightforward option.

gmorla2016
Calcite | Level 5

Reeza, thanks for the reply.

 

yes i am grouping by first three variables and my sasdate is in dateformat. i dont have sas ets, can you guide me in writing the self join.

Reeza
Super User

Proc SQL;

create table want as select a.*, max(b.var) as maxvar

from have as a

join have as b

on a.var1=b.var1

..../*first thee vars go here*/

 

AND b.date between a.date and a.date+30 /*plus or minus depending on looking forward or back*/

group by a.var1, a.var2, a.var3, a.date, a.var /*all variables from table a*/;

quit;

mkeintz
PROC Star

Is your source table already ordered by date within the field1/field2/field3 groups?  You mention "hive" so maybe not.  But if it is, a DATA step is often more efficient than PROC SQL:

 

Starting with the 31st record per group the process gets maximum value of the PRIOR_30 array.

 

The PRIOR_30 array has these properties:

  1. It's _TEMPORARY_ which means it's automatically retained, but not written to the resulting data set.

  2. Array bounds are 0 through 29.  That's so it can continually hold the rolling 30 preceding days, where the position in the array is determined by dividing variable I by 30 and using the remainder [=mod(I,30)] to specify which array element.  The result is that each value of VAR replaces the 30th preceding value  (e.g. record 31 replaces record 1 in element 1 of the array, since each has a remainder of 1 when divided by 30). Note records 30,60 etc go into element 0.

Untested program:

 

data want ;
  array prior_30{0:29} _temporary_   ;
  set have;
  by field1 field2 field3 notsorted;
  I+1;
  if first.field3 then do;
    call missing(of prior_30{*});
    I=1;
  end;
  if I>30 then maxvar=max(of prior_30{*});
  prior_30{mod(I,30)}=var;
run;

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Reeza
Super User

@mkeintz solution Also assumes that you have no gaps in your data so you need a record for each date. There's nothing wrong with this, just something to consider. 

mkeintz
PROC Star

I agree on the issue of "holes" in the data series.  Here's a modificatioon that accomodates gaps in the date sequence.  In effect it acts as if the missing  dates were present but VAR1 had a missing  value:

 

data want (drop=_:);
  array prior_30{0:29} _temporary_   ;

  set have;
  by field1 field2 field3 notsorted;

  if first.field3 then do;
    call missing(of prior_30{*});
    _first_date=date;
  end;
  retain _first_date; 

  /* See if there are holes between current and preceding dates  */
  /* If so, assign missing values to corresponding array elements */
  _date_increment=dif(date);
  if _date_increment>1 then do _d=1 to _date_increment-1;
    prior_30{mod(date-_d,30)}=.;
  end;

  if date>=_first_date+30 then maxvar=max(of prior_30{*});
  prior_30{mod(date,30)}=var;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
If you don't have big table, try SQL.


data have;
infile cards truncover expandtabs;
input FIELD1 $ FIELD2 $ FIELD3 $ DATE :yymmdd10. VAR;
format date yymmdd10.;
cards;
USA	TX	RTB	20160930	1
USA	TX	RTB	20161001	1
USA	TX	RTB	20161002	1
USA	TX	RTB	20161003	1
USA	TX	RTB	20161004	1
USA	TX	RTB	20161005	2
USA	TX	RTB	20161006	2
USA	TX	RTB	20161007	2
USA	TX	RTB	20161008	2
USA	TX	RTB	20161009	1
;
run;
proc sql;
select *,(select max(var) from have 
where field1=a.field1 and field2=a.field2 and field3=field3
and date between a.date-30 and a.date) as max_val
 from have as a;
quit;


sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2087 views
  • 0 likes
  • 4 in conversation