Re: Issues with Arrays and Dates

Kaushansky · Posted 10-08-2023 07:41 PM

Hello I am trying to compare 2 dates using arrays and create a counter when the dates are within 30 days of each other. It does not seem to work. Attached is my SAS dataset and below is my code:

data master_utp_3;

set sas.master_utp2;

by luid;

if first.luid then do;

*-- Array for TRANS DATE to compare against impressions and clicks for time difference --*;

array trans_dt {*} trans_event_dt;

*-- Array for DISPLAY DATE to compare against impressions and clicks for time difference --*;

array dis_dt [*] display_imps_event_dt;

end;

*-- Checking to ensure the impression or click was 30-days prior to the transaction --*;

do i=1 to dim(trans_dt) until(_check_pre_trans_display>0);

_check_pre_trans_display = (intck('day',display_imps_event_dt,trans_dt(i)))<=30

and

(intck('day',display_imps_event_dt,trans_dt(i)))>=0;

end;

if _check_pre_trans_display>=1 then do;

pre_display_imps+1;

end;

run;

Tom · Posted 10-08-2023 07:59 PM

Why would you make an array with only one variable in it? Why not just reference the variable directly instead of going to the trouble of accessing it indirectly via an index into an array with only one element?

Kaushansky · Posted 10-08-2023 09:40 PM

I cannot reference it directly, I need column display_imps_event_dt to check whether it is within 30 days of any of the values in column trans_event_dt and the rows are different. This is just one case (ID), other IDs will have different values and rows, etc.

Tom · Posted 10-08-2023 10:25 PM

Arrays have nothing to do with OBSERVATIONS. They are used to reference multiple variables of the observation you are currently processing. Do you have multiple variables in your dataset? or do you want to transpose the data from multiple observations into multiple variables?

I cannot see your data since you posted a FILE instead of CODE. Also please provide a verbal description of what you have as I am not sure what an "impression" and a "click" are. Nor how you have coded those in your dataset.

Patrick · Posted 10-08-2023 08:08 PM

The ARRAY statement is used to associate a name with a list of variables.

An array statement requires the data to be on a single row in multiple variables. That's not your data structure and though using an array statement won't work.

Given your data structure can you please show how the desired result should look like?

For example for row 10 would you want a count of any trans_event_date that's within 30 days of 2022-04-04? Or something else?

Kaushansky · Posted 10-08-2023 09:37 PM

Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?

Patrick · Posted 10-08-2023 11:08 PM

@Kaushansky wrote:

Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?

An array works within a single row (observation) for multiple variables but not over multiple observations for a single variable.

You haven't provided the desired result which is why below code provides options based on guessing.

Spoiler

data work.have;
  infile datalines dsd truncover;
  input luid:BEST12. trans_event_dt:YYMMDD10. display_imps_event_dt:YYMMDD10.;
  format luid BEST12. trans_event_dt YYMMDD10. display_imps_event_dt YYMMDD10.;
datalines4;
1019715261,2023-01-20,
1019715261,2023-01-24,
1019715261,2023-02-03,
1019715261,2023-03-01,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-29,
1019715261,,2022-04-04
1019715261,,2022-08-13
1019715261,,2022-09-01
1019715261,,2022-09-27
1019715261,,2022-09-27
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-03-03
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-13
1019715261,,2023-03-17
1019715261,,2023-03-17
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-21
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,
1019715261,,
;;;;

data work.want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'work.have(where=(not missing(trans_event_dt)))', multidata:'y');
      h1.defineKey('luid');
      h1.defineData('trans_event_dt');
      h1.defineDone();
    end;

  set have;
  by luid;

  if first.luid then count1=0;
  count2=0;
  h1.reset_dup();
  do while(h1.do_over() eq 0);
    if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
      do;
        count1+1;
        count2=sum(count2,1);
      end;
  end;

  if first.luid then count3=0;
  count4=0;
  h1.reset_dup();
  do while(h1.do_over() eq 0);
    if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
      do;
        count3+1;
        count4=sum(count4,1);
        leave;
      end;
  end;

run;

data work.have; infile datalines dsd truncover; input luid:BEST12. trans_event_dt:YYMMDD10. display_imps_event_dt:YYMMDD10.; format luid BEST12. trans_event_dt YYMMDD10. display_imps_event_dt YYMMDD10.; datalines4; 1019715261,2023-01-20, 1019715261,2023-01-24, 1019715261,2023-02-03, 1019715261,2023-03-01, 1019715261,2023-03-25, 1019715261,2023-03-25, 1019715261,2023-03-25, 1019715261,2023-03-25, 1019715261,2023-03-29, 1019715261,,2022-04-04 1019715261,,2022-08-13 1019715261,,2022-09-01 1019715261,,2022-09-27 1019715261,,2022-09-27 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-02-13 1019715261,,2023-03-03 1019715261,,2023-03-11 1019715261,,2023-03-11 1019715261,,2023-03-11 1019715261,,2023-03-11 1019715261,,2023-03-11 1019715261,,2023-03-13 1019715261,,2023-03-17 1019715261,,2023-03-17 1019715261,,2023-03-18 1019715261,,2023-03-18 1019715261,,2023-03-18 1019715261,,2023-03-18 1019715261,,2023-03-18 1019715261,,2023-03-18 1019715261,,2023-03-21 1019715261,,2023-03-25 1019715261,,2023-03-25 1019715261,,2023-03-25 1019715261,,2023-03-25 1019715261,,2023-03-25 1019715261,, 1019715261,, ;;;; data work.want; if _n_=1 then do; dcl hash h1(dataset:'work.have(where=(not missing(trans_event_dt)))', multidata:'y'); h1.defineKey('luid'); h1.defineData('trans_event_dt'); h1.defineDone(); end; set have; by luid; if first.luid then count1=0; count2=0; h1.reset_dup(); do while(h1.do_over() eq 0); if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then do; count1+1; count2=sum(count2,1); end; end; if first.luid then count3=0; count4=0; h1.reset_dup(); do while(h1.do_over() eq 0); if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then do; count3+1; count4=sum(count4,1); leave; end; end; run;

Kaushansky · Posted 10-09-2023 11:11 AM

Thank you!

I think the HASH structure works, albeit I am not familiar with it as it is new. Could you explain what it is actually doing?

Tom · Posted 10-08-2023 11:02 PM

create a counter when the dates are within 30 days of each other

Counter of what? The number of "trans" within 30 days of the "imp"? Or the other way around?

Is that 30 days before? 30 days after? Is it 15 before and 15 after?

What about the same day? 29 days before plus today? 15 before and 14 after? 14 before and 15 after? 29 days after plus today?

Or do you want to count DAYS? Perhaps: How many of the 30 preceding days had the other type of activity?

Tom · Posted 10-08-2023 11:58 PM

So I did download that dataset and look at it. You appear to have two datasets mashed into one as no observations have a value in both date fields.

So it is probably easier to understand what you are trying to do it we separate them. Let's use easier to type variable names and dataset names also.

So you have TRANS (transactions or orders perhaps) dataset and an IMPRESSIONS (site visits?) dataset that each have an ID variable and a DATE.

data trans;
  input id :$12. date :yymmdd. ;
  format date yymmdd10.;
cards;
1019715261 2023-01-20 
1019715261 2023-01-24 
1019715261 2023-02-03 
1019715261 2023-03-01 
1019715261 2023-03-25 
1019715261 2023-03-25  
1019715261 2023-03-25 
1019715261 2023-03-25 
1019715261 2023-03-29 

data impressions;
  input id :$12. date :yymmdd.;
  format date yymmdd10.;
cards;
1019715261 2022-04-04
1019715261 2022-08-13
1019715261 2022-09-01
1019715261 2022-09-27
1019715261 2022-09-27
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-03-03
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-13
1019715261 2023-03-17
1019715261 2023-03-17
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-21
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
;

So SQL is a good language for describing set operations like number of x within some criteria.

Here is a query to count the total number of IMPRESSIONS and the number of distinct days they occur on that fall within the date range that starts 30 before the TRANS date and includes the TRANS date.

proc sql;
create table count1 as
  select a.id,a.date,count(b.id) as n_imps,count(distinct b.date) as imp_days
  from trans a
    left join impressions b
    on a.id = b.id and a.date-30 <= b.date <= a.date
  group by 1,2
;
quit;

Result:

Obs        id              date    n_imps    imp_days

 1     1019715261    2023-01-20       0          0
 2     1019715261    2023-01-24       0          0
 3     1019715261    2023-02-03       0          0
 4     1019715261    2023-03-01      15          1
 5     1019715261    2023-03-25      84          7
 6     1019715261    2023-03-29      21          7

That is getting confused by the repetitions of TRANS records on the same DATE. Let's collapse those first.

proc sql;
create table count1 as
  select a.id,a.date,a.n_trans,count(b.id) as n_imps,count(distinct b.date) as imp_days
  from (select id,date,count(*) as n_trans from trans group by id,date) a
    left join impressions b
    on a.id = b.id and a.date-30 <= b.date <= a.date
  group by 1,2,3
;
quit;

Result

Obs        id              date    n_trans    n_imps    imp_days

 1     1019715261    2023-01-20       1          0          0
 2     1019715261    2023-01-24       1          0          0
 3     1019715261    2023-02-03       1          0          0
 4     1019715261    2023-03-01       1         15          1
 5     1019715261    2023-03-25       4         21          7
 6     1019715261    2023-03-29       1         21          7

Is that something like what you are looking for?

Is your data small enough that the SQL query runs quickly enough? or do you still need help in crafting something using data step logic that might run faster?

Kaushansky · Posted 10-09-2023 11:01 AM

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

Tom · Posted 10-09-2023 11:04 AM

@Kaushansky wrote:

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

To use the SQL query with a single dataset as input then just use the dataset as the inputs for both halves of the join, with appropriate renaming for variable references and filtering out the missing values.

Tom · Posted 10-09-2023 12:08 PM

@Kaushansky wrote:

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

You can brute force it with a temporary array that uses the DATE as the index.

Just make the array large enough for the range of dates you expect (or pre calculate the range you have).

data have;
  input id :$12. trans :yymmdd. imps :yymmdd.;
  format trans imps yymmdd10.;
cards;
1019715261 2023-01-20 .
1019715261 2023-01-24 .
1019715261 2023-02-03 .
1019715261 2023-03-01 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-29 .
1019715261 . 2022-04-04
1019715261 . 2022-08-13
1019715261 . 2022-09-01
1019715261 . 2022-09-27
1019715261 . 2022-09-27
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-03-03
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-13
1019715261 . 2023-03-17
1019715261 . 2023-03-17
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-21
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . .
1019715261 . .
;

%let minday=%sysfunc(mdy(1,1,2022));
%let maxday=%sysfunc(mdy(12,31,2023));

data want;
  array tranday [&minday:&maxday] _temporary_;
  array impday [&minday:&maxday] _temporary_;
  call missing(of tranday[*] impday[*]);
  do until(last.id);
    set have ;
    by id;
    min_trans=min(min_trans,trans);
    max_trans=max(max_trans,trans);
    if trans then tranday[trans]+1;
    if imps then impday[imps]+1;
  end;
  do date=min_trans to max_trans ;
    n_trans=tranday[date];
    n_imps=0;
    n_impdays=0;
    if n_trans then do date2=date-30 to date;
      n_imps + impday[date2];
      n_impdays + (impday[date2]>0);
    end;
    if n_trans then output;
  end;
  format date yymmdd10.;
  keep id date n_: ;
run;

Result

Obs        id              date    n_trans    n_imps    n_impdays

 1     1019715261    2023-01-20       1          0          0
 2     1019715261    2023-01-24       1          0          0
 3     1019715261    2023-02-03       1          0          0
 4     1019715261    2023-03-01       1         15          1
 5     1019715261    2023-03-25       4         21          7
 6     1019715261    2023-03-29       1         21          7

Classroom Training Available!