Solved: Need Help with Summing Distinct in a Database

jackie32 · Posted 09-18-2014 01:56 PM

Hi SAS Community,

I had a question for you- I tried to look through to forums to see if something similar has been answered before but couldn't find anything- I apologize if it has! I'm completely new to SAS and have just been figuring things out as I go along and have run into a bit of a road-bump. I've posted the details below.

So I have a dataset that looks like the following:

Customer ID	Year	Supplier
12014	2001	88888
12014	2001	88888
22222	2002	88888
11111	2001	99999
41243	2003	99999
11111	2004	99999
41531	2004	88888

What I want to do is sum the number of unique customers that each supplier has by year, to get a database that looks like the following:

Supplier	Year	# of Unique Customers
88888	2001	1
88888	2002	1
88888	2003	0
88888	2004	1
99999	2001	1
99999	2002	0
99999	2003	1
99999	2004	1

If a customer comes one year and again the next, they would count as unique. If they make multiple purchases in one year, they are not considered unique. I have really no idea how to get from the data I have to the data I want. Any advice would be much appreciated! Thank you.

ballardw · Posted 09-18-2014 02:31 PM

You don't say if you want a dataset or output. This will create output, prefix with a "create table <name> as" before the first select. "Have" represents the name of your current dataset.

proc sql;

select supplier, year, count(*) as customers

from (select distinct supplier,year, customerid from have)

group by supplier, year;

quit;

View solution in original post

ballardw · Posted 09-18-2014 02:31 PM

You don't say if you want a dataset or output. This will create output, prefix with a "create table <name> as" before the first select. "Have" represents the name of your current dataset.

proc sql;

select supplier, year, count(*) as customers

from (select distinct supplier,year, customerid from have)

group by supplier, year;

quit;

stat_sas · Posted 09-18-2014 02:33 PM

data have;
input CustomerID Year Supplier;
datalines;
12014 2001 88888
12014 2001 88888
22222 2002 88888
11111 2001 99999
41243 2003 99999
11111 2004 99999
41531 2004 88888
;

data base;
do supplier=88888 to 99999 by 11111;
do year=2001 to 2004;
output;
end;
end;
run;

proc sql;
create table want as
select a.supplier,a.year,count(distinct b.customerid) as unique_customer from base a
left join have b
on a.supplier=b.supplier
and a.year=b.year
group by a.supplier,a.year;
quit;

PGStats · Posted 09-18-2014 02:53 PM

The only problem for solving this with SQL is getting zeros for suppliers/years without data. It can be done this way:

proc sql;

create table want as

select s.supplier, y.year, coalesce(c.nbCust, 0) as nbCust

from

(select distinct supplier from have) as s cross join

(select distinct year from have) as y left join

(select supplier, year, count(distinct customerid) as nbCust

from have

group by supplier, year) as c

on s.supplier=c.supplier and y.year=c.year;

select * from want;

quit;

PG

Haikuo · Posted 09-18-2014 02:58 PM

Ok, the more, the merrier. Freshly learned from ,

data have;

input Customer_ID Year Supplier;

cards;

12014 2001 88888

22222 2002 88888

11111 2001 99999

41243 2003 99999

11111 2004 99999

41531 2004 88888

;

proc sql;

create table want as

select a.supplier, a.year, count(distinct b.customer_id) as ct from

(select distinct * from have (keep=supplier) , have (keep=year) ) a

natural left join have b

group by a.supplier,a.year

order by a.supplier,a.year

;

quit;

Haikuo

PGStats · Posted 09-18-2014 03:09 PM

Hi , the crossproduct join

select distinct * from have (keep=supplier) , have (keep=year)

might generate a large table before being shrunk by the distinct clause (unless it is somehow optimized out). I prefer to apply distinct before doing the crossproduct.

PG

Haikuo · Posted 09-18-2014 03:12 PM

Good Point, PG!

Update: if anyone is interested, PG is saying the above statement is better off replace by the following:

Select * from (select distinct supplier from have), (select distinct year from have)

Reeza · Posted 09-18-2014 03:28 PM

I've always been a fan of the double proc freq, just because its easy to understand

proc freq data=have noprint;

table supplier*year*customer_id/out=temp;

run;

proc freq data=temp noprint;

table supplier*year/out=want sparse;

run;

Haikuo · Posted 09-18-2014 03:56 PM

Nice!

Thanks for sharing, Reeza!

jackie32 · Posted 09-18-2014 04:11 PM

Thank you all so much- all your input has been incredibly helpful and I greatly appreciate it!

Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

Re: Need Help with Summing Distinct in a Database

SAS Innovate 2025: Call for Content

Classroom Training Available!