A Proc REPORT variable DEFINED with feature / GROUP will cause summary statistics to be computed. The default statistic is SUM, perfect for 'counting' a 0/1 flag variable.
Example:
data have_apriori_summary(keep=hotel_id guest_id is_repeated_guest);
call streaminit (123);
do hotel_id = 1 to 10;
do index = 1 to 100;
guest_id + 1;
is_repeated_guest = rand ('uniform') < 0.40;
output;
end;
end;
run;
proc report data=have_apriori_summary;
title "Guest retention results from apriori per guest summary";
column hotel_id is_repeated_guest;
define hotel_id / group;
define is_repeated_guest / 'Repeated Guests';
where hotel_id in (1,3,4);
run;
Output
Consider a slightly different scenario:
Suppose the data set contains transactional information, such as CHECK_IN_DATE, ROOM_ID, and GUEST_ID and the REPEAT flag is a marker for some customer service action at the desk. Taking a sum of the flag values over the transactions will NOT give the number of DISTINCT guests, the SUM will be the number of repeated guest visits. You will need to reduce the transaction data to a summary in order to count the number of distinct guests.
Example:
data have_checkin_log(keep=hotel_id room_id date guest_id is_repeated_guest);
call streaminit (123);
array tracker(7500) _temporary_;
do index = 1 to dim(tracker);
tracker(index) = rand('uniform') < 0.15;
end;
do hotel_id = 1 to 10;
do room_id = 1 to 50;
do date = '01jan2020'd to '31jan2020'd;
guest_id = rand('integer',1,7500);
is_repeated_guest = tracker(guest_id);
output;
end;
end;
end;
format date yymmdd10.;
run;
proc sql;
create table repeated_guests_summary as
select
hotel_id,
guest_id,
is_repeated_guest,
count(*) as repeated_visits_count
from have_checkin_log
where is_repeated_guest
group by hotel_id, guest_id
;
quit;
proc report data=repeated_guests_summary;
title "Repeated Guests and Visits";
column hotel_id is_repeated_guest repeated_visits_count;
define hotel_id / group;
define is_repeated_guest / 'Repeated Guests';
where hotel_id in (1,3,4);
run;
Output
... View more