BookmarkSubscribeRSS Feed
Kaushansky
Obsidian | Level 7

Hello I am trying to compare 2 dates using arrays and create a counter when the dates are within 30 days of each other. It does not seem to work. Attached is my SAS  dataset and below is my code:

 

data master_utp_3; 
set sas.master_utp2;
by luid;
 
if first.luid then do;
*-- Array for TRANS DATE to compare against impressions and clicks for time difference --*;
  array trans_dt {*} trans_event_dt;
*-- Array for DISPLAY DATE to compare against impressions and clicks for time difference --*;
  array dis_dt [*] display_imps_event_dt;
end;
 
*-- Checking to ensure the impression or click was 30-days prior to the transaction --*;
do i=1 to dim(trans_dt) until(_check_pre_trans_display>0);
 
_check_pre_trans_display = (intck('day',display_imps_event_dt,trans_dt(i)))<=30
and
(intck('day',display_imps_event_dt,trans_dt(i)))>=0;
end;
 
if _check_pre_trans_display>=1 then do;
pre_display_imps+1; 
end;
run;
12 REPLIES 12
Tom
Super User Tom
Super User

Why would you make an array with only one variable in it?  Why not just reference the variable directly instead of going to the trouble of accessing it indirectly via an index into an array with only one element?

Kaushansky
Obsidian | Level 7

I cannot reference it directly, I need column display_imps_event_dt to check whether it is within 30 days of any of the values in column trans_event_dt and the rows are different. This is just one case (ID), other IDs will have different values and rows, etc.

Tom
Super User Tom
Super User

Arrays have nothing to do with OBSERVATIONS.   They are used to reference multiple variables of the observation you are currently processing.  Do you have multiple variables in your dataset?  or do you want to transpose the data from multiple observations into multiple variables?

 

I cannot see your data since you posted a FILE instead of CODE.  Also please provide a verbal description of what you have as I am not sure what an "impression" and a "click" are.  Nor how you have coded those in your dataset.

Patrick
Opal | Level 21

The ARRAY statement is used to associate a name with a list of variables

 

An array statement requires the data to be on a single row in multiple variables. That's not your data structure and though using an array statement won't work.

Patrick_0-1696809789569.png

 

Given your data structure can you please show how the desired result should look like?

For example for row 10 would you want a count of any trans_event_date that's within 30 days of 2022-04-04? Or something else? 

Kaushansky
Obsidian | Level 7

Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?

Patrick
Opal | Level 21

@Kaushansky wrote:

Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?


An array works within a single row (observation) for multiple variables but not over multiple observations for a single variable. 

You haven't provided the desired result which is why below code provides options based on guessing.

Spoiler
data work.have;
  infile datalines dsd truncover;
  input luid:BEST12. trans_event_dt:YYMMDD10. display_imps_event_dt:YYMMDD10.;
  format luid BEST12. trans_event_dt YYMMDD10. display_imps_event_dt YYMMDD10.;
datalines4;
1019715261,2023-01-20,
1019715261,2023-01-24,
1019715261,2023-02-03,
1019715261,2023-03-01,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-29,
1019715261,,2022-04-04
1019715261,,2022-08-13
1019715261,,2022-09-01
1019715261,,2022-09-27
1019715261,,2022-09-27
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-03-03
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-13
1019715261,,2023-03-17
1019715261,,2023-03-17
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-21
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,
1019715261,,
;;;;

data work.want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'work.have(where=(not missing(trans_event_dt)))', multidata:'y');
      h1.defineKey('luid');
      h1.defineData('trans_event_dt');
      h1.defineDone();
    end;

  set have;
  by luid;

  if first.luid then count1=0;
  count2=0;
  h1.reset_dup();
  do while(h1.do_over() eq 0);
    if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
      do;
        count1+1;
        count2=sum(count2,1);
      end;
  end;

  if first.luid then count3=0;
  count4=0;
  h1.reset_dup();
  do while(h1.do_over() eq 0);
    if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
      do;
        count3+1;
        count4=sum(count4,1);
        leave;
      end;
  end;

run;
Kaushansky
Obsidian | Level 7

Thank you!

I think the HASH structure works, albeit I am not familiar with it as it is new. Could you explain what it is actually doing?

Tom
Super User Tom
Super User

create a counter when the dates are within 30 days of each other

Counter of what?  The number of "trans" within 30 days of the "imp"?  Or the other way around?

Is that 30 days before?  30 days after?  Is it 15 before and 15 after?

 

What about the same day?   29 days before plus today? 15 before and 14 after?  14 before and 15 after? 29 days after plus today?

 

Or do you want to count DAYS?  Perhaps: How many of the 30 preceding days had the other type of activity?

 

Tom
Super User Tom
Super User

So I did download that dataset and look at it.  You appear to have two datasets mashed into one as no observations have a value in both date fields.

 

So it is probably easier to understand what you are trying to do it we separate them.  Let's use easier to type variable names and dataset names also.

 

So you have TRANS (transactions or orders perhaps) dataset and an IMPRESSIONS (site visits?) dataset that each have an ID variable and a DATE.

data trans;
  input id :$12. date :yymmdd. ;
  format date yymmdd10.;
cards;
1019715261 2023-01-20 
1019715261 2023-01-24 
1019715261 2023-02-03 
1019715261 2023-03-01 
1019715261 2023-03-25 
1019715261 2023-03-25  
1019715261 2023-03-25 
1019715261 2023-03-25 
1019715261 2023-03-29 

data impressions;
  input id :$12. date :yymmdd.;
  format date yymmdd10.;
cards;
1019715261 2022-04-04
1019715261 2022-08-13
1019715261 2022-09-01
1019715261 2022-09-27
1019715261 2022-09-27
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-03-03
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-13
1019715261 2023-03-17
1019715261 2023-03-17
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-21
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
;

So SQL is a good language for describing set operations like number of x within some criteria.

Here is a query to count the total number of IMPRESSIONS and the number of distinct days they occur on that fall within the date range that starts 30 before the TRANS date and includes the TRANS date.

proc sql;
create table count1 as
  select a.id,a.date,count(b.id) as n_imps,count(distinct b.date) as imp_days
  from trans a
    left join impressions b
    on a.id = b.id and a.date-30 <= b.date <= a.date
  group by 1,2
;
quit;

Result:

Obs        id              date    n_imps    imp_days

 1     1019715261    2023-01-20       0          0
 2     1019715261    2023-01-24       0          0
 3     1019715261    2023-02-03       0          0
 4     1019715261    2023-03-01      15          1
 5     1019715261    2023-03-25      84          7
 6     1019715261    2023-03-29      21          7

That is getting confused by the repetitions of TRANS records on the same DATE.  Let's collapse those first.

proc sql;
create table count1 as
  select a.id,a.date,a.n_trans,count(b.id) as n_imps,count(distinct b.date) as imp_days
  from (select id,date,count(*) as n_trans from trans group by id,date) a
    left join impressions b
    on a.id = b.id and a.date-30 <= b.date <= a.date
  group by 1,2,3
;
quit;

Result

Obs        id              date    n_trans    n_imps    imp_days

 1     1019715261    2023-01-20       1          0          0
 2     1019715261    2023-01-24       1          0          0
 3     1019715261    2023-02-03       1          0          0
 4     1019715261    2023-03-01       1         15          1
 5     1019715261    2023-03-25       4         21          7
 6     1019715261    2023-03-29       1         21          7

 

Is that something like what you are looking for?

Is your data small enough that the SQL query runs quickly enough?  or do you still need help in crafting something using data step logic that might run faster?

Kaushansky
Obsidian | Level 7

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

 

 

Tom
Super User Tom
Super User

@Kaushansky wrote:

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

 

 


To use the SQL query with a single dataset as input then just use the dataset as the inputs for both halves of the join, with appropriate renaming for variable references and filtering out the missing values.

Tom
Super User Tom
Super User

@Kaushansky wrote:

Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).

 

 


You can brute force it with a temporary array that uses the DATE as the index.

Just make the array large enough for the range of dates you expect (or pre calculate the range you have).

data have;
  input id :$12. trans :yymmdd. imps :yymmdd.;
  format trans imps yymmdd10.;
cards;
1019715261 2023-01-20 .
1019715261 2023-01-24 .
1019715261 2023-02-03 .
1019715261 2023-03-01 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-29 .
1019715261 . 2022-04-04
1019715261 . 2022-08-13
1019715261 . 2022-09-01
1019715261 . 2022-09-27
1019715261 . 2022-09-27
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-03-03
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-13
1019715261 . 2023-03-17
1019715261 . 2023-03-17
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-21
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . .
1019715261 . .
;

%let minday=%sysfunc(mdy(1,1,2022));
%let maxday=%sysfunc(mdy(12,31,2023));

data want;
  array tranday [&minday:&maxday] _temporary_;
  array impday [&minday:&maxday] _temporary_;
  call missing(of tranday[*] impday[*]);
  do until(last.id);
    set have ;
    by id;
    min_trans=min(min_trans,trans);
    max_trans=max(max_trans,trans);
    if trans then tranday[trans]+1;
    if imps then impday[imps]+1;
  end;
  do date=min_trans to max_trans ;
    n_trans=tranday[date];
    n_imps=0;
    n_impdays=0;
    if n_trans then do date2=date-30 to date;
      n_imps + impday[date2];
      n_impdays + (impday[date2]>0);
    end;
    if n_trans then output;
  end;
  format date yymmdd10.;
  keep id date n_: ;
run;

Result

Obs        id              date    n_trans    n_imps    n_impdays

 1     1019715261    2023-01-20       1          0          0
 2     1019715261    2023-01-24       1          0          0
 3     1019715261    2023-02-03       1          0          0
 4     1019715261    2023-03-01       1         15          1
 5     1019715261    2023-03-25       4         21          7
 6     1019715261    2023-03-29       1         21          7

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1443 views
  • 0 likes
  • 3 in conversation