Hello I am trying to compare 2 dates using arrays and create a counter when the dates are within 30 days of each other. It does not seem to work. Attached is my SAS dataset and below is my code:
Why would you make an array with only one variable in it? Why not just reference the variable directly instead of going to the trouble of accessing it indirectly via an index into an array with only one element?
I cannot reference it directly, I need column display_imps_event_dt to check whether it is within 30 days of any of the values in column trans_event_dt and the rows are different. This is just one case (ID), other IDs will have different values and rows, etc.
Arrays have nothing to do with OBSERVATIONS. They are used to reference multiple variables of the observation you are currently processing. Do you have multiple variables in your dataset? or do you want to transpose the data from multiple observations into multiple variables?
I cannot see your data since you posted a FILE instead of CODE. Also please provide a verbal description of what you have as I am not sure what an "impression" and a "click" are. Nor how you have coded those in your dataset.
The ARRAY statement is used to associate a name with a list of variables.
An array statement requires the data to be on a single row in multiple variables. That's not your data structure and though using an array statement won't work.
Given your data structure can you please show how the desired result should look like?
For example for row 10 would you want a count of any trans_event_date that's within 30 days of 2022-04-04? Or something else?
Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?
@Kaushansky wrote:
Yes, exactly row 10 (2022-04-04) should be within 30 days of any of the column trans_event_dt (dates). Doesn't an Array capture all of the row instances in a set for comparison?
An array works within a single row (observation) for multiple variables but not over multiple observations for a single variable.
You haven't provided the desired result which is why below code provides options based on guessing.
data work.have;
infile datalines dsd truncover;
input luid:BEST12. trans_event_dt:YYMMDD10. display_imps_event_dt:YYMMDD10.;
format luid BEST12. trans_event_dt YYMMDD10. display_imps_event_dt YYMMDD10.;
datalines4;
1019715261,2023-01-20,
1019715261,2023-01-24,
1019715261,2023-02-03,
1019715261,2023-03-01,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-25,
1019715261,2023-03-29,
1019715261,,2022-04-04
1019715261,,2022-08-13
1019715261,,2022-09-01
1019715261,,2022-09-27
1019715261,,2022-09-27
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-02-13
1019715261,,2023-03-03
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-11
1019715261,,2023-03-13
1019715261,,2023-03-17
1019715261,,2023-03-17
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-18
1019715261,,2023-03-21
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,2023-03-25
1019715261,,
1019715261,,
;;;;
data work.want;
if _n_=1 then
do;
dcl hash h1(dataset:'work.have(where=(not missing(trans_event_dt)))', multidata:'y');
h1.defineKey('luid');
h1.defineData('trans_event_dt');
h1.defineDone();
end;
set have;
by luid;
if first.luid then count1=0;
count2=0;
h1.reset_dup();
do while(h1.do_over() eq 0);
if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
do;
count1+1;
count2=sum(count2,1);
end;
end;
if first.luid then count3=0;
count4=0;
h1.reset_dup();
do while(h1.do_over() eq 0);
if trans_event_dt-30 <= display_imps_event_dt <= trans_event_dt then
do;
count3+1;
count4=sum(count4,1);
leave;
end;
end;
run;
Thank you!
I think the HASH structure works, albeit I am not familiar with it as it is new. Could you explain what it is actually doing?
create a counter when the dates are within 30 days of each other
Counter of what? The number of "trans" within 30 days of the "imp"? Or the other way around?
Is that 30 days before? 30 days after? Is it 15 before and 15 after?
What about the same day? 29 days before plus today? 15 before and 14 after? 14 before and 15 after? 29 days after plus today?
Or do you want to count DAYS? Perhaps: How many of the 30 preceding days had the other type of activity?
So I did download that dataset and look at it. You appear to have two datasets mashed into one as no observations have a value in both date fields.
So it is probably easier to understand what you are trying to do it we separate them. Let's use easier to type variable names and dataset names also.
So you have TRANS (transactions or orders perhaps) dataset and an IMPRESSIONS (site visits?) dataset that each have an ID variable and a DATE.
data trans;
input id :$12. date :yymmdd. ;
format date yymmdd10.;
cards;
1019715261 2023-01-20
1019715261 2023-01-24
1019715261 2023-02-03
1019715261 2023-03-01
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-29
data impressions;
input id :$12. date :yymmdd.;
format date yymmdd10.;
cards;
1019715261 2022-04-04
1019715261 2022-08-13
1019715261 2022-09-01
1019715261 2022-09-27
1019715261 2022-09-27
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-02-13
1019715261 2023-03-03
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-11
1019715261 2023-03-13
1019715261 2023-03-17
1019715261 2023-03-17
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-18
1019715261 2023-03-21
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
1019715261 2023-03-25
;
So SQL is a good language for describing set operations like number of x within some criteria.
Here is a query to count the total number of IMPRESSIONS and the number of distinct days they occur on that fall within the date range that starts 30 before the TRANS date and includes the TRANS date.
proc sql;
create table count1 as
select a.id,a.date,count(b.id) as n_imps,count(distinct b.date) as imp_days
from trans a
left join impressions b
on a.id = b.id and a.date-30 <= b.date <= a.date
group by 1,2
;
quit;
Result:
Obs id date n_imps imp_days 1 1019715261 2023-01-20 0 0 2 1019715261 2023-01-24 0 0 3 1019715261 2023-02-03 0 0 4 1019715261 2023-03-01 15 1 5 1019715261 2023-03-25 84 7 6 1019715261 2023-03-29 21 7
That is getting confused by the repetitions of TRANS records on the same DATE. Let's collapse those first.
proc sql;
create table count1 as
select a.id,a.date,a.n_trans,count(b.id) as n_imps,count(distinct b.date) as imp_days
from (select id,date,count(*) as n_trans from trans group by id,date) a
left join impressions b
on a.id = b.id and a.date-30 <= b.date <= a.date
group by 1,2,3
;
quit;
Result
Obs id date n_trans n_imps imp_days 1 1019715261 2023-01-20 1 0 0 2 1019715261 2023-01-24 1 0 0 3 1019715261 2023-02-03 1 0 0 4 1019715261 2023-03-01 1 15 1 5 1019715261 2023-03-25 4 21 7 6 1019715261 2023-03-29 1 21 7
Is that something like what you are looking for?
Is your data small enough that the SQL query runs quickly enough? or do you still need help in crafting something using data step logic that might run faster?
Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).
@Kaushansky wrote:
Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).
To use the SQL query with a single dataset as input then just use the dataset as the inputs for both halves of the join, with appropriate renaming for variable references and filtering out the missing values.
@Kaushansky wrote:
Not exactly. The SQL is good to compare when the columns of dates are aligned, but....The dataset by design, has one set of dates (columns) in a separate column from another and does not align by row. Is there a way, perhaps using the LAG() function or a new function HASH() to store column (trans_event_dt) and (display_imps_event_dt) then compare the ensure that column (display_imps_event_dt) occurs 30-days prior to (trans_event_dt).
You can brute force it with a temporary array that uses the DATE as the index.
Just make the array large enough for the range of dates you expect (or pre calculate the range you have).
data have;
input id :$12. trans :yymmdd. imps :yymmdd.;
format trans imps yymmdd10.;
cards;
1019715261 2023-01-20 .
1019715261 2023-01-24 .
1019715261 2023-02-03 .
1019715261 2023-03-01 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-25 .
1019715261 2023-03-29 .
1019715261 . 2022-04-04
1019715261 . 2022-08-13
1019715261 . 2022-09-01
1019715261 . 2022-09-27
1019715261 . 2022-09-27
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-02-13
1019715261 . 2023-03-03
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-11
1019715261 . 2023-03-13
1019715261 . 2023-03-17
1019715261 . 2023-03-17
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-18
1019715261 . 2023-03-21
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . 2023-03-25
1019715261 . .
1019715261 . .
;
%let minday=%sysfunc(mdy(1,1,2022));
%let maxday=%sysfunc(mdy(12,31,2023));
data want;
array tranday [&minday:&maxday] _temporary_;
array impday [&minday:&maxday] _temporary_;
call missing(of tranday[*] impday[*]);
do until(last.id);
set have ;
by id;
min_trans=min(min_trans,trans);
max_trans=max(max_trans,trans);
if trans then tranday[trans]+1;
if imps then impday[imps]+1;
end;
do date=min_trans to max_trans ;
n_trans=tranday[date];
n_imps=0;
n_impdays=0;
if n_trans then do date2=date-30 to date;
n_imps + impday[date2];
n_impdays + (impday[date2]>0);
end;
if n_trans then output;
end;
format date yymmdd10.;
keep id date n_: ;
run;
Result
Obs id date n_trans n_imps n_impdays 1 1019715261 2023-01-20 1 0 0 2 1019715261 2023-01-24 1 0 0 3 1019715261 2023-02-03 1 0 0 4 1019715261 2023-03-01 1 15 1 5 1019715261 2023-03-25 4 21 7 6 1019715261 2023-03-29 1 21 7
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.