I have a huge dataset. Trying something like below.
I have 10 customers who gets repeated in 3 months. Trying to capture the repeated customers in each month and new customers.
For Jan (Cust 1,cust3, cust4 repeats in other months)--- Overlap
For Jan( Cust 2 and cust 10 are new)
Given Dataset:
Month Cust
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
Output want:
Month Total Overlap New
jan 5 3 2
feb 3 3 0
mar 3 2 1
Thanks for checking
Hi @Kalai2008 Pretty straight forward SQL
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month
order by a.month;
quit;
Hi @Kalai2008 Pretty straight forward SQL
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month
order by a.month;
quit;
I agree with you @Reeza . SQL is ready meals and convenience at least for this solution. But since we are doing an equi and non equi filter in the sub-setting pattern makes the SQL optimizer choose the SORT MERGE(Magic=102) JOIN algorithm and optimizes.
Of course, I am certain a more programming solution will likely give you better performance but for a long and narrow dataset, I am holding faith that this approach should suffice, though your point is well taken 🙂
Hi @Kalai2008 Should you have enough memory , you could try HASH
data want ;
if _n_=1 then do;
if 0 then set have have(rename=(month=_month));
dcl hash H (dataset:'have(rename=(month=_month))',multidata:'y') ;
h.definekey ("cust") ;
h.definedata ("_month") ;
h.definedone () ;
end;
do Total=1 by 1 until(last.month);
set have;
by month notsorted;
do rc=h.find() by 0 while(rc=0);
if month ne _month then do; Overlap=sum(Overlap,1);leave;end;
rc=h.find_next();
end;
end;
New=Total-Overlap;
drop rc _month cust;
run;
Assuming the data has been sorted by month.
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sort data=have;by month;run;
proc sql;
create table cust as
select distinct cust from have group by cust having count(distinct month)=1;
create table month as
select month,count(distinct cust) as total from have group by month;
quit;
data temp;
if _n_=1 then do;
if 0 then set month;
declare hash h(dataset:'cust',hashexp:20);
h.definekey('cust');
h.definedone();
end;
set have;
by month;
if first.month then new=0;
if h.check()=0 then new+1;
if last.month;
keep month new;
run;
data want;
merge month temp;
by month ;
overlap=total-new;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.