Solved: Re: Count instances between two dates with criteria imposed

A_Swoosh · Posted 10-22-2021 03:34 PM

Hi all,

I have 3 datasets:

1. STORE INFO

2. STORE TRANSACTIONS

3. CUSTOMER INFO

I'm attempting to combine all datasets to produce a final dataset at the store-level that combines all relevant information to show:

proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

Using the three datasets below:

/*STORE INFORMATION*/
data have0;
  infile cards truncover expandtabs;
  input STORE_ID $ zip_cd year;
  format zip_cd z5.;
  cards;
A0001 99993 2010 
A0001 99993 2011 
A0001 99994 2012 
A0001 99994 2013 
A0001 99994 2014 
A0001 99995 2015 
A0002 33999 2011 
A0002 33999 2012 
A0002 33999 2013 
A0002 33995 2014 
A0002 33993 2015 
A0003 33999 2010 
A0003 33999 2011 
A0003 33999 2012 
A0003 33999 2013 
A0003 33999 2014 
A0003 33999 2015 
A0004 33999 2014 
A0004 33999 2015 
A0005 33999 2011 
A0005 33999 2012 
A0005 33995 2013 
;
/*TRANSACTION INFORMATION*/
data have1;
  infile cards truncover expandtabs;
  input STORE_ID $ customer_id $ transaction_id $ transaction_date :date9.;
  format transaction_date date8.;
  cards;
A0001 aa 1 22jan2010  
A0001 aa 2 22jan2010  
A0001 ay 3 22jan2010  
A0001 ab 4 11Oct2010  
A0001 cc 5 16jun2011  
A0001 dd 6 19apr2012  
A0001 ff 7 04dec2012  
A0001 ee 8 10feb2013  
A0001 ee 9 10feb2013  
A0001 ef 10 09nov2014  
A0001 ff 11 04dec2014  
A0001 ff 12 07dec2014  
A0001 gg 13 22Oct2015  
A0001 ds 14 30nov2015  
A0002 ad 1 11jun2011  
A0002 aq 2 11apr2014  
A0002 qr 3 11feb2015  
A0003 er 1 09nov2010  
A0003 hh 2 04dec2011  
A0003 hh 3 22jan2012  
A0003 hd 4 11Oct2012  
A0003 sd 5 16jun2013  
A0003 cv 6 19apr2014  
A0003 nb 7 10feb2015  
A0003 wr 8 09nov2015  
A0003 yt 9 04dec2015  
A0003 ui 10 04dec2015  
A0004 ol 1 09nov2014  
A0004 rr 2 04dec2014  
A0004 gh 3 04dec2015  
;
/*CUSTOMER INFORMATION*/
data have2;
  infile cards truncover expandtabs;
  input customer_id $ customer_race $ ;
  cards;
aa white
ay white
ab black
cc latino
dd other
ee white
ef white
ff white
gg black
ds black
ad black 
aq latino
qr latino
er latino
hh other
hd latino
sd other
cv other
nb other
wr white
yt white
ui white
ol white
rr white
gh black
;

I want to create a final dataset that has 1 line per STORE_ID, ZIP_CODE that contains extra information such as:

number of transactions that took place while open (between min/max transaction dates)
number of distinct customers overall while open (between min/max transaction dates)
number of distinct customers by race categories while open (between min/max transaction dates)

The issue I encounter is two-fold using the code below:

1. I get multiple rows, 1 for each year sums instead of 1 row for the entire period they were open.

2. To get distinct customers and by race, I cannot employ this method because the same customer will show up in two+ years potentially (e.g., once in 2012 and again in 2013).

proc sql;
	create table long as
	select
	l.*,
	count(distinct r.transaction_id) as no_transactions,  
	count(distinct r.customer_id) as no_customers_overall,
	count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
	count(case when (t.customer_race='black') then t.customer_race end) as no_black,
	count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
	count(case when (t.customer_race='other') then t.customer_race end) as no_other,
	s.open_year,
	s.close_year,
	s.min_transaction_date,
	s.max_transaction_date
	from have0 l

	left join
	have1 r
	on l.store_id=r.store_id and l.year=year(r.transaction_date)

	left join
	have2 t
	on r.customer_id=t.customer_id 

	left join 
	store_open_close s
	on l.store_id=s.store_id and l.zip_cd=s.zip_cd

	group by 
	l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
	;
quit;

If someone could help troubleshoot my code or point out where I messed up, it would be helpful. Thank you!

Patrick · Posted 10-22-2021 09:55 PM

Does below return your desired result?.

Change to your SQL: Year removed from the group by clause and the wild card in the select clause replaced with an explicit list of variables.

With Group By: In the select clause don't use a wild card and only use variables that are either part of the group by clause or calculated using some aggregation function like count(), sum(), etc.

proc sql;
  create table long as
  select 
/*  l.*,*/
  l.store_id,
  l.zip_cd,
  count(distinct r.transaction_id) as no_transactions,  
  count(distinct r.customer_id) as no_customers_overall,
  count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
  count(case when (t.customer_race='black') then t.customer_race end) as no_black,
  count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
  count(case when (t.customer_race='other') then t.customer_race end) as no_other,
  s.open_year,
  s.close_year,
  s.min_transaction_date,
  s.max_transaction_date
  from have0 l

  left join
  have1 r
  on l.store_id=r.store_id and l.year=year(r.transaction_date)

  left join
  have2 t
  on r.customer_id=t.customer_id 

  left join 
  store_open_close s
  on l.store_id=s.store_id and l.zip_cd=s.zip_cd

  group by 
    l.store_id, 
    l.zip_cd, 
/*    l.year, */
    s.open_year, 
    s.close_year, 
    s.min_transaction_date, 
    s.max_transaction_date
  ;
quit;

View solution in original post

qatman28 · Posted 10-22-2021 06:00 PM

The simplest way to do this is expand all the data first, then summarize as needed.

I would first merge have1 and have2 to 'have12' to pull in the ethnicity, and during that step also create year variable based on the transaction date. Then you can merge have0 and have12 by store_id and year to get a full list of all transactions including all necessary variables. Then it's just a matter of using proc freq or proc means to summarize the data over your desired categories.

A_Swoosh · Posted 10-22-2021 07:32 PM

To that end, the proc sql code above accomplishes that. It's just the proc freq/means to summarize the data following the proc sql is where we differ, and to that point would not do what I am intended to accomplish in terms of the two issues I encounter.

A_Swoosh · Posted 10-23-2021 08:39 PM

Thanks @Patrick and @qatman28 .

Both the solutions worked.

Patrick · Posted 10-22-2021 09:55 PM

Does below return your desired result?.

Change to your SQL: Year removed from the group by clause and the wild card in the select clause replaced with an explicit list of variables.

With Group By: In the select clause don't use a wild card and only use variables that are either part of the group by clause or calculated using some aggregation function like count(), sum(), etc.

proc sql;
  create table long as
  select 
/*  l.*,*/
  l.store_id,
  l.zip_cd,
  count(distinct r.transaction_id) as no_transactions,  
  count(distinct r.customer_id) as no_customers_overall,
  count(distinct case when (t.customer_race='white' and t.customer_id ne '') then t.customer_id end) as no_white,
  count(case when (t.customer_race='black') then t.customer_race end) as no_black,
  count(case when (t.customer_race='latino') then t.customer_race end) as no_latino,
  count(case when (t.customer_race='other') then t.customer_race end) as no_other,
  s.open_year,
  s.close_year,
  s.min_transaction_date,
  s.max_transaction_date
  from have0 l

  left join
  have1 r
  on l.store_id=r.store_id and l.year=year(r.transaction_date)

  left join
  have2 t
  on r.customer_id=t.customer_id 

  left join 
  store_open_close s
  on l.store_id=s.store_id and l.zip_cd=s.zip_cd

  group by 
    l.store_id, 
    l.zip_cd, 
/*    l.year, */
    s.open_year, 
    s.close_year, 
    s.min_transaction_date, 
    s.max_transaction_date
  ;
quit;

Count instances between two dates with criteria imposed

Re: Count instances between two dates with criteria imposed

Re: Count instances between two dates with criteria imposed

Re: Count instances between two dates with criteria imposed

Re: Count instances between two dates with criteria imposed

Re: Count instances between two dates with criteria imposed

Ready to join fellow brilliant minds for the SAS Hackathon?

Click image to register for webinar

Classroom Training Available!