Hi all,
I have 3 datasets:
1. STORE INFO
2. STORE TRANSACTIONS
3. CUSTOMER INFO
I'm attempting to combine all datasets to produce a final dataset at the store-level that combines all relevant information to show:
STORE|ZIP CODE|OPEN_YEAR|CLOSE_YEAR|FIRST_transaction_date|LAST_transaction_date
proc sql;
create view store_open_close as
select
l.store_id,
l.zip_cd,
min(l.year) as open_year,
max(l.year) as close_year,
min(r.transaction_date) as min_transaction_date format=date9.,
max(r.transaction_date) as max_transaction_date format=date9.
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
group by l.store_id, l.zip_cd
;
quit;
Using the three datasets below:
/*STORE INFORMATION*/
data have0;
infile cards truncover expandtabs;
input STORE_ID $ zip_cd year;
format zip_cd z5.;
cards;
A0001 99993 2010
A0001 99993 2011
A0001 99994 2012
A0001 99994 2013
A0001 99994 2014
A0001 99995 2015
A0002 33999 2011
A0002 33999 2012
A0002 33999 2013
A0002 33995 2014
A0002 33993 2015
A0003 33999 2010
A0003 33999 2011
A0003 33999 2012
A0003 33999 2013
A0003 33999 2014
A0003 33999 2015
A0004 33999 2014
A0004 33999 2015
A0005 33999 2011
A0005 33999 2012
A0005 33995 2013
;
/*TRANSACTION INFORMATION*/
data have1;
infile cards truncover expandtabs;
input STORE_ID $ customer_id $ transaction_id $ transaction_date :date9.;
format transaction_date date8.;
cards;
A0001 aa 1 22jan2010
A0001 aa 2 22jan2010
A0001 ay 3 22jan2010
A0001 ab 4 11Oct2010
A0001 cc 5 16jun2011
A0001 dd 6 19apr2012
A0001 ff 7 04dec2012
A0001 ee 8 10feb2013
A0001 ee 9 10feb2013
A0001 ef 10 09nov2014
A0001 ff 11 04dec2014
A0001 ff 12 07dec2014
A0001 gg 13 22Oct2015
A0001 ds 14 30nov2015
A0002 ad 1 11jun2011
A0002 aq 2 11apr2014
A0002 qr 3 11feb2015
A0003 er 1 09nov2010
A0003 hh 2 04dec2011
A0003 hh 3 22jan2012
A0003 hd 4 11Oct2012
A0003 sd 5 16jun2013
A0003 cv 6 19apr2014
A0003 nb 7 10feb2015
A0003 wr 8 09nov2015
A0003 yt 9 04dec2015
A0003 ui 10 04dec2015
A0004 ol 1 09nov2014
A0004 rr 2 04dec2014
A0004 gh 3 04dec2015
;
/*CUSTOMER INFORMATION*/
data have2;
infile cards truncover expandtabs;
input customer_id $ customer_race $ ;
cards;
aa white
ay white
ab black
cc latino
dd other
ee white
ef white
ff white
gg black
ds black
ad black
aq latino
qr latino
er latino
hh other
hd latino
sd other
cv other
nb other
wr white
yt white
ui white
ol white
rr white
gh black
;
I want to create a final dataset that has 1 line per STORE_ID, ZIP_CODE that contains extra information such as:
The issue I encounter is two-fold using the code below:
1. I get multiple rows, 1 for each year sums instead of 1 row for the entire period they were open.
2. To get distinct customers and by race, I cannot employ this method because the same customer will show up in two+ years potentially (e.g., once in 2012 and again in 2013).
proc sql;
create table long as
select
l.*,
count(distinct r.transaction_id) as no_transactions,
count(distinct r.customer_id) as no_customers_overall,
count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
count(case when (t.customer_race='black') then t.customer_race end) as no_black,
count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
count(case when (t.customer_race='other') then t.customer_race end) as no_other,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
have2 t
on r.customer_id=t.customer_id
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
;
quit;
If someone could help troubleshoot my code or point out where I messed up, it would be helpful. Thank you!
Does below return your desired result?.
Change to your SQL: Year removed from the group by clause and the wild card in the select clause replaced with an explicit list of variables.
With Group By: In the select clause don't use a wild card and only use variables that are either part of the group by clause or calculated using some aggregation function like count(), sum(), etc.
proc sql;
create table long as
select
/* l.*,*/
l.store_id,
l.zip_cd,
count(distinct r.transaction_id) as no_transactions,
count(distinct r.customer_id) as no_customers_overall,
count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
count(case when (t.customer_race='black') then t.customer_race end) as no_black,
count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
count(case when (t.customer_race='other') then t.customer_race end) as no_other,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
have2 t
on r.customer_id=t.customer_id
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id,
l.zip_cd,
/* l.year, */
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
;
quit;
The simplest way to do this is expand all the data first, then summarize as needed.
I would first merge have1 and have2 to 'have12' to pull in the ethnicity, and during that step also create year variable based on the transaction date. Then you can merge have0 and have12 by store_id and year to get a full list of all transactions including all necessary variables. Then it's just a matter of using proc freq or proc means to summarize the data over your desired categories.
Does below return your desired result?.
Change to your SQL: Year removed from the group by clause and the wild card in the select clause replaced with an explicit list of variables.
With Group By: In the select clause don't use a wild card and only use variables that are either part of the group by clause or calculated using some aggregation function like count(), sum(), etc.
proc sql;
create table long as
select
/* l.*,*/
l.store_id,
l.zip_cd,
count(distinct r.transaction_id) as no_transactions,
count(distinct r.customer_id) as no_customers_overall,
count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
count(case when (t.customer_race='black') then t.customer_race end) as no_black,
count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
count(case when (t.customer_race='other') then t.customer_race end) as no_other,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
have2 t
on r.customer_id=t.customer_id
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id,
l.zip_cd,
/* l.year, */
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
;
quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.