Hi all,
I'm trying to create a dataset that combines information from 2 datasets.The first dataset contains the information for the store (store number, location zip code, and year in operation). In this dataset, I encounter a situation that I'm not quite sure how to resolve. For example, I have a store that is operating in a zip code then it moves location the following year, and so forth, possibly moving upwards of 5 times over the 5 years of data I have.
I am trying to:
1. flag each instance where the store moves/closes by the year (e.g., open_2011, close_2012)
2. Create a unique id for each store_id that moves (e.g., A0001_A, A0001_B, and so on...)
STORE INFO DATASET EXAMPLE:
data have0;
infile cards truncover expandtabs;
input STORE_ID $ zip_cd year ;
format zip_cd z5. ;
cards;
A0001 99993 2010
A0001 99993 2011
A0001 99994 2012
A0001 99994 2013
A0001 99994 2014
A0001 99995 2015
A0002 33999 2011
A0002 33999 2012
A0002 33999 2013
A0002 33995 2014
A0002 33993 2015
A0003 33999 2010
A0003 33999 2011
A0003 33999 2012
A0003 33999 2013
A0003 33999 2014
A0003 33999 2015
A0004 33999 2014
A0004 33999 2015
A0005 33999 2011
A0005 33999 2012
A0005 33995 2013
;
run;
STORE TRANSACTION DATASET:
data have1;
infile cards truncover expandtabs;
input STORE_ID $ transaction_id $ transaction_date :date9. ;
format transaction_date date8. ;
cards;
A0001 1 22jan2010
A0001 2 11Oct2010
A0001 3 16jun2011
A0001 4 19apr2012
A0001 5 10feb2013
A0001 6 09nov2014
A0001 7 04dec2014
A0001 8 07dec2014
A0001 9 22Oct2015
A0001 10 30nov2015
A0002 1 11jun2011
A0002 2 11apr2014
A0002 3 11feb2015
A0003 1 09nov2010
A0003 2 04dec2011
A0003 3 22jan2012
A0003 4 11Oct2012
A0003 5 16jun2013
A0003 6 19apr2014
A0003 7 10feb2015
A0003 8 09nov2015
A0003 9 04dec2015
A0003 10 04dec2015
A0004 1 09nov2014
A0004 2 04dec2014
A0004 3 04dec2015
;
run;
3. Merge back to the store transactions dataset to count the number of transactions that occurred at that store while it was open and before it closed/moved.
And get something like this:
data want;
infile cards truncover expandtabs;
input STORE_ID_unique $ zip_cd no_transactions open_2010 open_2011 open_2012 open_2013 open_2014 open_2015 close_2010 close_2011 close_2012 close_2013 close_2014 close_2015 ;
format zip_cd z5.;
cards;
A0001_A 99993 3 . 0 0 0 0 0 0 1 0 0 0 0
A0001_B 99994 5 0 0 1 0 0 0 0 0 0 0 1 0
A0001_C 99995 2 0 0 0 0 0 1 0 0 0 0 0 .
A0002_A 33999 1 0 1 0 0 0 0 0 0 0 1 0 0
A0002_B 33995 1 0 0 0 0 1 0 0 0 0 0 1 0
A0002_C 33993 1 0 0 0 0 0 1 0 0 0 0 0 .
A0003 33999 10 . 0 0 0 0 0 0 0 0 0 0 .
A0004 33999 3 0 0 0 0 1 0 0 0 0 0 0 .
A0005_A 33999 0 0 1 0 0 0 0 0 0 1 0 0 0
A0005_B 33995 0 0 0 0 1 0 0 0 0 0 0 0 .
;
run;
I'm a little stuck on the correct procedures to accomplish this. If I'm not mistaken I should create a counter variable and/or concatenate a unique ID based on the store_id and counter_variable then create a conditional logic statement to consider each of these cases?
Any help would be appreciated.
Below should work.
proc sql;
create view store_open_close as
select
l.store_id,
l.zip_cd,
min(l.year) as open_year,
max(l.year) as close_year,
min(r.transaction_date) as min_transaction_date format=date9.,
max(r.transaction_date) as max_transaction_date format=date9.
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
group by l.store_id, l.zip_cd
;
quit;
proc sql;
create table long as
select
l.*,
count(*) as n_transactions,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
;
quit;
proc transpose data=long out=wide(drop=_name_) prefix=year_;
by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
id year;
var n_transactions;
run;
Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.
Below close to what you've asked for. Have a look if this serves your purpose.
As a comment: It's often easier to work with a long data structure instead of a wide one.
data have0;
infile cards truncover expandtabs;
input STORE_ID $ zip_cd year;
format zip_cd z5.;
cards;
A0001 99993 2010
A0001 99993 2011
A0001 99994 2012
A0001 99994 2013
A0001 99994 2014
A0001 99995 2015
A0002 33999 2011
A0002 33999 2012
A0002 33999 2013
A0002 33995 2014
A0002 33993 2015
A0003 33999 2010
A0003 33999 2011
A0003 33999 2012
A0003 33999 2013
A0003 33999 2014
A0003 33999 2015
A0004 33999 2014
A0004 33999 2015
A0005 33999 2011
A0005 33999 2012
A0005 33995 2013
;
data have1;
infile cards truncover expandtabs;
input STORE_ID $ transaction_id $ transaction_date :date9.;
format transaction_date date8.;
cards;
A0001 1 22jan2010
A0001 2 11Oct2010
A0001 3 16jun2011
A0001 4 19apr2012
A0001 5 10feb2013
A0001 6 09nov2014
A0001 7 04dec2014
A0001 8 07dec2014
A0001 9 22Oct2015
A0001 10 30nov2015
A0002 1 11jun2011
A0002 2 11apr2014
A0002 3 11feb2015
A0003 1 09nov2010
A0003 2 04dec2011
A0003 3 22jan2012
A0003 4 11Oct2012
A0003 5 16jun2013
A0003 6 19apr2014
A0003 7 10feb2015
A0003 8 09nov2015
A0003 9 04dec2015
A0003 10 04dec2015
A0004 1 09nov2014
A0004 2 04dec2014
A0004 3 04dec2015
;
proc sql;
create view store_open_close as
select
store_id,
zip_cd,
min(year) as open_year,
max(year) as close_year
from have0
group by store_id, zip_cd
;
quit;
proc sql;
create table long as
select
l.*,
count(*) as n_transactions,
s.open_year,
s.close_year
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id, l.zip_cd, l.year, s.open_year, s.close_year
;
quit;
proc transpose data=long out=wide(drop=_name_) prefix=year_;
by store_id zip_cd open_year close_year;
id year;
var n_transactions;
run;
Below should work.
proc sql;
create view store_open_close as
select
l.store_id,
l.zip_cd,
min(l.year) as open_year,
max(l.year) as close_year,
min(r.transaction_date) as min_transaction_date format=date9.,
max(r.transaction_date) as max_transaction_date format=date9.
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
group by l.store_id, l.zip_cd
;
quit;
proc sql;
create table long as
select
l.*,
count(*) as n_transactions,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
;
quit;
proc transpose data=long out=wide(drop=_name_) prefix=year_;
by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
id year;
var n_transactions;
run;
Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.
Actually, for the final store 'A0005' the number of transactions should be 0 instead?
And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.
@A_Swoosh wrote:
Actually, for the final store 'A0005' the number of transactions should be 0 instead?
And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.
@A_Swoosh Try below.
proc sql;
create view store_open_close as
select
l.store_id,
l.zip_cd,
min(l.year) as open_year,
max(l.year) as close_year,
min(r.transaction_date) as min_transaction_date format=date9.,
max(r.transaction_date) as max_transaction_date format=date9.
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
group by l.store_id, l.zip_cd
;
quit;
proc sql;
create table long as
select
l.store_id,
l.zip_cd,
l.year,
sum(case when missing(r.transaction_date) then 0 else 1 end) as n_transactions,
s.open_year,
s.close_year,
s.min_transaction_date,
s.max_transaction_date
from
have0 l
left join
have1 r
on l.store_id=r.store_id and l.year=year(r.transaction_date)
left join
store_open_close s
on l.store_id=s.store_id and l.zip_cd=s.zip_cd
group by
l.store_id, l.zip_cd, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date,
l.year
;
quit;
proc transpose data=long out=wide(drop=_name_) prefix=year_;
by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
id year;
var n_transactions;
run;
that didn't seem to fix it. I still get the same error "the id value year_2015' occurs twice in the same BY group. Additionally, I'm getting duplicate rows for each of these store_id and zip_cd.
I'm fine keeping it as a long dataset then I can just proc sql by store_id and zip_cd to count the number of transactions, and create any other indicators I need to distinguish these changes.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.