Solved: Re: Help in getting unique values

Kalai2008 · Posted 01-07-2020 01:00 PM

I have a huge dataset. Trying something like below.

I have 10 customers who gets repeated in 3 months. Trying to capture the repeated customers in each month and new customers.

For Jan (Cust 1,cust3, cust4 repeats in other months)--- Overlap

For Jan( Cust 2 and cust 10 are new)

Given Dataset:

Month Cust

jan cust1

jan cust2

jan cust3

jan cust4

jan cust5

feb cust1

feb cust3

feb cust4

mar cust3

mar cust4

mar cust10

Output want:

Month Total Overlap New

jan 5 3 2

feb 3 3 0

mar 3 2 1

Thanks for checking

novinosrin · Posted 01-07-2020 02:03 PM

Hi @Kalai2008 Pretty straight forward SQL



data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select  a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month 
order by a.month;
quit;

View solution in original post

Reeza · Posted 01-07-2020 01:24 PM

How do you overlap with future dates? It makes sense to look backwards for overlap but how does looking forward help here? In January, you have 5 unique individuals, 3 purchased again later on but as of January they hadn't....and in March you look backwards not forwards.

Are you sure this is what you want?

Kalai2008 · Posted 01-07-2020 01:26 PM

Yes, since this is historical data.

Reeza · Posted 01-07-2020 01:28 PM

So you want to look both forward and backward in times for overlap?

Kalai2008 · Posted 01-07-2020 01:29 PM

yes Correct.

novinosrin · Posted 01-07-2020 02:03 PM

Hi @Kalai2008 Pretty straight forward SQL



data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sql;
create table want as
select  a.month,count(distinct b.cust) as overlap,(count(distinct a.cust)-calculated overlap) as New
from have a left join have b
on a.cust=b.cust and a.month ne b.month
group by a.month 
order by a.month;
quit;

Reeza · Posted 01-07-2020 02:14 PM

My concern with a full SQL approach with self joins is that it won't scale well with a huge data set....

novinosrin · Posted 01-07-2020 02:24 PM

I agree with you @Reeza . SQL is ready meals and convenience at least for this solution. But since we are doing an equi and non equi filter in the sub-setting pattern makes the SQL optimizer choose the SORT MERGE(Magic=102) JOIN algorithm and optimizes.

Of course, I am certain a more programming solution will likely give you better performance but for a long and narrow dataset, I am holding faith that this approach should suffice, though your point is well taken 🙂

Kalai2008 · Posted 01-07-2020 03:13 PM

Thank you, performance is very slow and code is still running.

novinosrin · Posted 01-07-2020 04:10 PM

Hi @Kalai2008 Should you have enough memory , you could try HASH


data want ;
 if _n_=1 then do;
   if 0 then set have have(rename=(month=_month));
   dcl hash H (dataset:'have(rename=(month=_month))',multidata:'y') ;
   h.definekey  ("cust") ;
   h.definedata ("_month") ;
   h.definedone () ;
 end;
 do Total=1 by 1 until(last.month);
  set have;
  by month notsorted;
  do rc=h.find() by 0 while(rc=0);
   if month ne _month then do; Overlap=sum(Overlap,1);leave;end;
   rc=h.find_next();
  end;
 end;
 New=Total-Overlap;
 drop rc _month cust;
run;

Kalai2008 · Posted 01-08-2020 08:58 AM

Awesome ....Thank you. It worked..

Ksharp · Posted 01-08-2020 07:00 AM

Assuming the data has been sorted by month.


data have;
input month $ cust $;
cards;
jan cust1
jan cust2
jan cust3
jan cust4
jan cust5
feb cust1
feb cust3
feb cust4
mar cust3
mar cust4
mar cust10
;
proc sort data=have;by month;run;
proc sql;
create table cust as
 select distinct cust from have group by cust having count(distinct month)=1;
create table month as
 select month,count(distinct cust) as total from have group by month;
quit;

data temp;
 if _n_=1 then do;
  if 0 then set month;
  declare hash h(dataset:'cust',hashexp:20);
  h.definekey('cust');
  h.definedone();
 end;
set have;
by month;
if first.month then new=0;
if h.check()=0 then new+1;
if last.month;
keep month new;
run;

data want;
 merge month temp;
 by month ;
 overlap=total-new;
run;

Kalai2008 · Posted 01-08-2020 08:57 AM

Thank you.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away