I have a huge dataset. Trying something like below.
I have 10 customers who gets repeated in 3 months. Trying to capture the repeated customers in each month and new customers.
For Jan (Cust 1,cust3, cust4 repeats in other months)--- Overlap
For Jan( Cust 2 and cust 10 are new)
Given Dataset:
Month Cust
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
Output want:
Month Total Overlap New
jan 5 3 2
feb 3 3 0
mar 3 2 1
Thanks for checking
Hi @Kalai2008 Pretty straight forward SQL
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month
order by a.month;
quit;
Hi @Kalai2008 Pretty straight forward SQL
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month
order by a.month;
quit;
I agree with you @Reeza . SQL is ready meals and convenience at least for this solution. But since we are doing an equi and non equi filter in the sub-setting pattern makes the SQL optimizer choose the SORT MERGE(Magic=102) JOIN algorithm and optimizes.
Of course, I am certain a more programming solution will likely give you better performance but for a long and narrow dataset, I am holding faith that this approach should suffice, though your point is well taken 🙂
Hi @Kalai2008 Should you have enough memory , you could try HASH
data want ;
if _n_=1 then do;
if 0 then set have have(rename=(month=_month));
dcl hash H (dataset:'have(rename=(month=_month))',multidata:'y') ;
h.definekey ("cust") ;
h.definedata ("_month") ;
h.definedone () ;
end;
do Total=1 by 1 until(last.month);
set have;
by month notsorted;
do rc=h.find() by 0 while(rc=0);
if month ne _month then do; Overlap=sum(Overlap,1);leave;end;
rc=h.find_next();
end;
end;
New=Total-Overlap;
drop rc _month cust;
run;
Assuming the data has been sorted by month.
data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sort data=have;by month;run;
proc sql;
create table cust as
select distinct cust from have group by cust having count(distinct month)=1;
create table month as
select month,count(distinct cust) as total from have group by month;
quit;
data temp;
if _n_=1 then do;
if 0 then set month;
declare hash h(dataset:'cust',hashexp:20);
h.definekey('cust');
h.definedone();
end;
set have;
by month;
if first.month then new=0;
if h.check()=0 then new+1;
if last.month;
keep month new;
run;
data want;
merge month temp;
by month ;
overlap=total-new;
run;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.