Solved: Re: Defining a variable within a date interval

Bright · Posted 06-19-2021 12:43 PM

Hello,

I have a dataset looks like this:

Location	Date	Number of warnings (N)	Number of correct warnings (T)
A	06/19/2020	3	2
A	06/29/2020	5	3
A	06/30/2020	1	1
A	07/1/2020	4	3
A	07/05/2020	7	1
A	07/15/2020	2	0
A	07/19/2020	3	1
A	….	…	…
B	05/12/2020	4	4
B	05/24/2020	1	0
B	05/27/2020	3	2
B	…	…	….

For each specific “Date” I want to calculate R=Sum(T)/Sum(N) during the last month. For example, if we consider Location “A” and Date “07/19/2020”, I want to have R=(0+1+3+1+3+2)/(2+7+4+1+5+3).

Any help is appreciated!

Thanks

mkeintz · Posted 06-19-2021 05:04 PM

Make 2 arrays, NVALS and TVALS, indexed by date. Set the array bounds to be the lowest and highest expected dates. Then for each new observation, sum the preceding 30 elements of the two arrays, and get the ration of those sums:

data want (drop=_:);
  set have;
  by location;
  array tvals{%sysevalf("01jan2019"d):%sysevalf("31dec2020"d)} _temporary_;
  array nvals{%sysevalf("01jan2019"d):%sysevalf("31dec2020"d)} _temporary_;
  if first.location=1 then call missing(of tvals{*},of nvals{*},tsum,nsum);
  tvals{date}=t;
  nvals{date}=n;
  do _d=1 to 30;
    tsum=sum(0,tsum,tvals{date-_d});
    nsum=sum(0,nsum,nvals{date-_d});
  end;
  ratio=divide(tsum,nsum);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

sbxkoenk · Posted 06-19-2021 01:40 PM

Hello,

Here you are.

A hash table look-up would be much faster though.

Solution published here is fine unless you have Gigabytes of data.

PROC DATASETS library=WORK memtype=DATA NoList;
 delete have0-have2 want ; run;
QUIT;

data have0;
infile cards delimiter='|';
input Location $ Date $ Nbr_warnings_N Nbr_correct_warnings_T;
datedate=input(date,mmddyy10.);
datedateMinus1Month=INTNX('MONTH',datedate,-1,'SAMEDAY');
format datedate datedateMinus1Month date9.; 
cards;
A|06/19/2020|3|2
A|06/29/2020|5|3
A|06/30/2020|1|1
A|07/01/2020|4|3
A|07/05/2020|7|1
A|07/15/2020|2|0
A|07/19/2020|3|1
B|05/12/2020|4|4
B|05/24/2020|1|0
B|05/27/2020|3|2
;
run;

PROC SQL noprint;
 create table work.have1 as
 select   t1.Location , t1.DateDate , t1.datedateMinus1Month 
        , t2.DateDate as DateDateCountsQ 
        , t2.Nbr_warnings_N as NcountsQ , t2.Nbr_correct_warnings_T as TcountsQ
 from
   work.have0(drop=Date) t1
 , work.have0            t2
 where     t1.Location = t2.Location
       AND t1.DateDateMinus1Month <= t2.DateDate < t1.DateDate
 order by t1.Location, t1.DateDate  
;
QUIT;

PROC MEANS data=work.have1 SUM noprint NWAY;
 CLASS Location DateDate;
 var NcountsQ TcountsQ;
 output out=have2 sum= / autoname;
run;

data want;
 set have2(DROP=_:);
 R=TcountsQ_Sum/NcountsQ_Sum;
 if Location='A' And DateDate='19JUL2020'd then R_check=(0+1+3+1+3+2)/(2+7+4+1+5+3);
 else R_check=.;
run;
/* end of program */

Cheers,

Koen

Bright · Posted 06-19-2021 03:56 PM

Thanks Koen, I have a large dataset ...

mkeintz · Posted 06-19-2021 04:14 PM

What is your definition of "last month". In your example, you apparently mean all the dates including the date exactly one month prior (i.e. for 7/19/2020 totals, include 6/19/2020). Is that correct?

On the date 1/31/2020 you want not only all of January, but also the last date of December, right?

If so, then for October 30, 2020 you would want Sep 30-Oct30. But consider October 31, 2020. There is no Sep 31, so would it be Oct 1- Oct31, or Sep 30-Oct31? In such cases, what is your definition of "the date exactly one month prior"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Bright · Posted 06-19-2021 04:19 PM

I want 30 days back, excluding the day itself.

mkeintz · Posted 06-19-2021 05:04 PM

Make 2 arrays, NVALS and TVALS, indexed by date. Set the array bounds to be the lowest and highest expected dates. Then for each new observation, sum the preceding 30 elements of the two arrays, and get the ration of those sums:

data want (drop=_:);
  set have;
  by location;
  array tvals{%sysevalf("01jan2019"d):%sysevalf("31dec2020"d)} _temporary_;
  array nvals{%sysevalf("01jan2019"d):%sysevalf("31dec2020"d)} _temporary_;
  if first.location=1 then call missing(of tvals{*},of nvals{*},tsum,nsum);
  tvals{date}=t;
  nvals{date}=n;
  do _d=1 to 30;
    tsum=sum(0,tsum,tvals{date-_d});
    nsum=sum(0,nsum,nvals{date-_d});
  end;
  ratio=divide(tsum,nsum);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 06-20-2021 01:26 AM

Below should work without consuming too much resources.

data have;
  infile datalines truncover;
  input Location $ Date:mmddyy10. N_warnings N_correct_warnings;
  format date date9.;
  datalines;
A 06/19/2020 3 2
A 06/29/2020 5 3
A 06/30/2020 1 1
A 7/01/2020 4 3
A 7/05/2020 7 1
A 07/15/2020 2 0
A 07/19/2020 3 1
A 07/21/2020 0 0
A 07/22/2020 0 0
A 08/01/2020 0 0
A 08/05/2020 0 0
A 08/15/2020 0 0
B 5/12/2020 4 4
B 05/24/2020 1 0
B 05/27/2020 3 2
;

/* &lb: look_back days */
%let lb=30;

data want(drop=_:);
  set have;
  by location date;

  /* arrays to store values of last &lb days */
  array nw  {&lb} 8 _temporary_;
  array ncw {&lb} 8 _temporary_;
  array dt  {&lb} 8 _temporary_;

  if first.location then 
    do;
      call missing(of nw[*], of ncw[*], of dt[*]);
    end;

  else
  if not first.location then
    do;
      /* delete values for array elements not meeting date range condition */
      dt[mod(date,&lb)+1]=date;
      do _i=1 to dim(dt);
        if dt[_i]=date or .<dt[_i]<date-&lb then
          do;
            _ind=mod(dt[_i],&lb)+1;
            call missing(nw[_ind]);
            call missing(ncw[_ind]);
            call missing(dt[_ind]);
          end; 
      end;
    end;
  
  /* create desired sums and ratios */
  nw_sum  =sum(of nw[*]);
  ncw_sum =sum(of ncw[*]);
  ratio=divide(sum(of ncw[*]),sum(of nw[*]));

  /* prep for next row: populate array elements with current row values */
  /* - only at the end to not include into calculations for current row */
  _ind=mod(date,&lb)+1;
  nw[_ind]  =N_warnings;
  ncw[_ind] =N_correct_warnings;
  dt[_ind]  =date;

run;

PaigeMiller · Posted 06-20-2021 06:50 AM

A PROC SUMMARY solution (because PROC SUMMARY is my favorite PROC 😁 ) with a MULTILABEL format. I didn't do the division at the end, but you can add that in yourself.

data have0;
infile cards delimiter='|' truncover;
input Location $ Date :mmddyy10. Nbr_warnings_N Nbr_correct_warnings_T;
cards;
A|06/19/2020|3|2
A|06/29/2020|5|3
A|06/30/2020|1|1
A|07/01/2020|4|3
A|07/05/2020|7|1
A|07/15/2020|2|0
A|07/19/2020|3|1
B|05/12/2020|4|4
B|05/24/2020|1|0
B|05/27/2020|3|2
;
data _null_;
    set have0 end=eof;
    if _n_=1 then call execute('proc format; value df (multilabel)');
    call execute(cats(date-31,'-',date-1,'=',date));
    if eof then call execute ('; run;');
run;
proc summary data=have0 nway;
    class location;
    class date / mlf;
    var nbr_warnings_n nbr_correct_warnings_t;
    output out=_sums_ sum=/autoname;
    format date df.;
run;
data _sums1_;
    set _sums_;
    ndate=input(left(date),7.);
    format ndate mmddyy10.;
run;

--
Paige Miller

Ksharp · Posted 06-20-2021 07:52 AM

Assuming there were no tie for DATE .

data have;
  infile datalines truncover;
  input Location $ Date:mmddyy10. N T;
  format date date9.;
  datalines;
A 06/19/2020 3 2
A 06/29/2020 5 3
A 06/30/2020 1 1
A 7/01/2020 4 3
A 7/05/2020 7 1
A 07/15/2020 2 0
A 07/19/2020 3 1
A 07/21/2020 0 0
A 07/22/2020 0 0
A 08/01/2020 0 0
A 08/05/2020 0 0
A 08/15/2020 0 0
B 5/12/2020 4 4
B 05/24/2020 1 0
B 05/27/2020 3 2
;

data want;
 if _n_=1 then do;
  if 0 then set have;
  declare hash h();
  h.definekey('date');
  h.definedata('N','T');
  h.definedone();
 end;
set have;
by Location;
if first.Location then h.clear();
h.add();
sum_n=0;sum_t=0;
do i=date-1 to intnx('month',date,-1,'s') by -1;
  if h.find(key:i)=0 then do;sum_n+n;sum_t+t;end;
end;
want=divide(sum_t,sum_n);
drop i sum_: ;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!