Re: max value of previous 30 days for every record of a table

gmorla2016 · Posted 11-29-2016 06:29 PM

hi, i have a table in which I need to calculate the max among the previous 30 days for every record , is there a way in hive with sub queries?, see below for sample data

FIELD1	FIELD2	FIELD3	DATE	VAR	MAX VAR
USA	TX	RTB	20160930	1
USA	TX	RTB	20161001	1
USA	TX	RTB	20161002	1
USA	TX	RTB	20161003	1
USA	TX	RTB	20161004	1
USA	TX	RTB	20161005	2
USA	TX	RTB	20161006	2
USA	TX	RTB	20161007	2
USA	TX	RTB	20161008	2
USA	TX	RTB	20161009	1
USA	TX	RTB	20161010	2
USA	TX	RTB	20161011	2
USA	TX	RTB	20161012	2
USA	TX	RTB	20161013	2
USA	TX	RTB	20161014	2
USA	TX	RTB	20161015	2
USA	TX	RTB	20161016	2
USA	TX	RTB	20161017	2
USA	TX	RTB	20161018	2
USA	TX	RTB	20161019	2
USA	TX	RTB	20161020	2
USA	TX	RTB	20161021	2
USA	TX	RTB	20161022	2
USA	TX	RTB	20161023	2
USA	TX	RTB	20161024	2
USA	TX	RTB	20161025	3
USA	TX	RTB	20161026	2
USA	TX	RTB	20161027	2
USA	TX	RTB	20161028	2
USA	TX	RTB	20161029	2
USA	TX	RTB	20161030	2	3
USA	TX	RTB	20161031	2	3
USA	TX	RTB	20161101	2	3
USA	TX	RTB	20161102	2	3
USA	TX	RTB	20161103	2	3
USA	TX	RTB	20161104	2	3
USA	TX	RTB	20161105	2	3
USA	TX	RTB	20161106	2	3
USA	TX	RTB	20161107	2	3
USA	TX	RTB	20161108	2	3
USA	TX	RTB	20161109	2	3
USA	TX	RTB	20161110	2	3
USA	TX	RTB	20161111	2	3
USA	TX	RTB	20161112	2	3
USA	TX	RTB	20161113	2	3
USA	TX	RTB	20161114	2	3
USA	TX	RTB	20161115	2	3
USA	TX	RTB	20161116	2	3
USA	TX	RTB	20161117	2	3
USA	TX	RTB	20161118	2	3
USA	TX	RTB	20161119	2	3
USA	TX	RTB	20161120	2	3
USA	TX	RTB	20161121	2	3
USA	TX	RTB	20161122	2	3
USA	TX	RTB	20161123	2	3
USA	TX	RTB	20161124	2	3
USA	TX	RTB	20161125	2	2
USA	TX	RTB	20161126	2	2
USA	TX	RTB	20161127	2	2
USA	TX	RTB	20161128	2	2
USA	TX	RTB	20161129	2	2
UK	LONDON	RTB	20160930	1
UK	LONDON	RTB	20161001	1
UK	LONDON	RTB	20161002	1
UK	LONDON	RTB	20161003	1
UK	LONDON	RTB	20161004	1
UK	LONDON	RTB	20161005	1
UK	LONDON	RTB	20161006	1
UK	LONDON	RTB	20161007	1
UK	LONDON	RTB	20161008	1
UK	LONDON	RTB	20161009	1
UK	LONDON	RTB	20161010	1
UK	LONDON	RTB	20161011	1
UK	LONDON	RTB	20161012	1
UK	LONDON	RTB	20161013	1
UK	LONDON	RTB	20161014	1
UK	LONDON	RTB	20161015	1
UK	LONDON	RTB	20161016	1
UK	LONDON	RTB	20161017	1
UK	LONDON	RTB	20161018	1
UK	LONDON	RTB	20161019	1
UK	LONDON	RTB	20161020	1
UK	LONDON	RTB	20161021	1
UK	LONDON	RTB	20161022	1
UK	LONDON	RTB	20161023	1
UK	LONDON	RTB	20161024	1
UK	LONDON	RTB	20161025	4
UK	LONDON	RTB	20161026	2
UK	LONDON	RTB	20161027	2
UK	LONDON	RTB	20161028	2
UK	LONDON	RTB	20161029	2
UK	LONDON	RTB	20161030	2	4
UK	LONDON	RTB	20161031	2	4
UK	LONDON	RTB	20161101	2	4
UK	LONDON	RTB	20161102	2	4
UK	LONDON	RTB	20161103	2	4
UK	LONDON	RTB	20161104	2	4
UK	LONDON	RTB	20161105	2	4
UK	LONDON	RTB	20161106	2	4
UK	LONDON	RTB	20161107	2	4
UK	LONDON	RTB	20161108	2	4
UK	LONDON	RTB	20161109	2	4
UK	LONDON	RTB	20161110	2	4
UK	LONDON	RTB	20161111	2	4
UK	LONDON	RTB	20161112	2	4
UK	LONDON	RTB	20161113	2	4
UK	LONDON	RTB	20161114	2	4
UK	LONDON	RTB	20161115	2	4
UK	LONDON	RTB	20161116	2	4
UK	LONDON	RTB	20161117	2	4
UK	LONDON	RTB	20161118	2	4
UK	LONDON	RTB	20161119	2	4
UK	LONDON	RTB	20161120	2	4
UK	LONDON	RTB	20161121	2	4
UK	LONDON	RTB	20161122	2	4
UK	LONDON	RTB	20161123	2	4
UK	LONDON	RTB	20161124	2	4
UK	LONDON	RTB	20161125	2	2
UK	LONDON	RTB	20161126	2	2
UK	LONDON	RTB	20161127	2	2
UK	LONDON	RTB	20161128	2	2
UK	LONDON	RTB	20161129	2	2

Reeza · Posted 11-29-2016 06:31 PM

Are you grouping by the first three variables? Also, is your date a SAS date, numeric with a date format?

A self join is what's typically used to compute these types of calculations. If you have SAS/ETS licensed, PROC EXPAND would be a straightforward option.

gmorla2016 · Posted 11-29-2016 06:42 PM

Reeza, thanks for the reply.

yes i am grouping by first three variables and my sasdate is in dateformat. i dont have sas ets, can you guide me in writing the self join.

Reeza · Posted 11-29-2016 07:03 PM

Proc SQL;

create table want as select a.*, max(b.var) as maxvar

from have as a

join have as b

on a.var1=b.var1

..../*first thee vars go here*/

AND b.date between a.date and a.date+30 /*plus or minus depending on looking forward or back*/

group by a.var1, a.var2, a.var3, a.date, a.var /*all variables from table a*/;

quit;

mkeintz · Posted 11-29-2016 09:53 PM

Is your source table already ordered by date within the field1/field2/field3 groups? You mention "hive" so maybe not. But if it is, a DATA step is often more efficient than PROC SQL:

Starting with the 31st record per group the process gets maximum value of the PRIOR_30 array.

The PRIOR_30 array has these properties:

It's _TEMPORARY_ which means it's automatically retained, but not written to the resulting data set.
Array bounds are 0 through 29. That's so it can continually hold the rolling 30 preceding days, where the position in the array is determined by dividing variable I by 30 and using the remainder [=mod(I,30)] to specify which array element. The result is that each value of VAR replaces the 30th preceding value (e.g. record 31 replaces record 1 in element 1 of the array, since each has a remainder of 1 when divided by 30). Note records 30,60 etc go into element 0.

Untested program:

data want ;
  array prior_30{0:29} _temporary_   ;
  set have;
  by field1 field2 field3 notsorted;
  I+1;
  if first.field3 then do;
    call missing(of prior_30{*});
    I=1;
  end;
  if I>30 then maxvar=max(of prior_30{*});
  prior_30{mod(I,30)}=var;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Reeza · Posted 11-29-2016 11:13 PM

@mkeintz solution Also assumes that you have no gaps in your data so you need a record for each date. There's nothing wrong with this, just something to consider.

mkeintz · Posted 11-30-2016 10:19 AM

I agree on the issue of "holes" in the data series. Here's a modificatioon that accomodates gaps in the date sequence. In effect it acts as if the missing dates were present but VAR1 had a missing value:

data want (drop=_:);
  array prior_30{0:29} _temporary_   ;

  set have;
  by field1 field2 field3 notsorted;

  if first.field3 then do;
    call missing(of prior_30{*});
    _first_date=date;
  end;
  retain _first_date; 

  /* See if there are holes between current and preceding dates  */
  /* If so, assign missing values to corresponding array elements */
  _date_increment=dif(date);
  if _date_increment>1 then do _d=1 to _date_increment-1;
    prior_30{mod(date-_d,30)}=.;
  end;

  if date>=_first_date+30 then maxvar=max(of prior_30{*});
  prior_30{mod(date,30)}=var;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 11-30-2016 04:50 AM

If you don't have big table, try SQL.


data have;
infile cards truncover expandtabs;
input FIELD1 $ FIELD2 $ FIELD3 $ DATE :yymmdd10. VAR;
format date yymmdd10.;
cards;
USA	TX	RTB	20160930	1
USA	TX	RTB	20161001	1
USA	TX	RTB	20161002	1
USA	TX	RTB	20161003	1
USA	TX	RTB	20161004	1
USA	TX	RTB	20161005	2
USA	TX	RTB	20161006	2
USA	TX	RTB	20161007	2
USA	TX	RTB	20161008	2
USA	TX	RTB	20161009	1
;
run;
proc sql;
select *,(select max(var) from have 
where field1=a.field1 and field2=a.field2 and field3=field3
and date between a.date-30 and a.date) as max_val
 from have as a;
quit;

Classroom Training Available!