BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
A_Swoosh
Quartz | Level 8

Hi all,

 

I'm trying to create a dataset that combines information from 2 datasets.The first dataset contains the information for the store (store number, location zip code, and year in operation). In this dataset, I encounter a situation that I'm not quite sure how to resolve. For example, I have a store that is operating in a zip code then it moves location the following year, and so forth, possibly moving upwards of 5 times over the 5 years of data I have.

 

I am trying to:

1. flag each instance where the store moves/closes by the year (e.g., open_2011, close_2012)

  • My data spans from 2010-2015 and I use a 1-month period to determine if open/closed before 2010 and 2015 based on the transaction dataset since I don't have data to tell when it opened/closed before the study period (e.g., if max(transaction_date) le '30-NOV-2015'd; if min(transaction_date) ge '01-FEB-2010'd)

2. Create a unique id for each store_id that moves (e.g., A0001_A, A0001_B, and so on...)

 

STORE INFO DATASET EXAMPLE:

data have0;
 infile cards truncover expandtabs;
 input STORE_ID $ zip_cd year ;
 format zip_cd z5. ;
 cards;
A0001 99993 2010 
A0001 99993 2011 
A0001 99994 2012 
A0001 99994 2013 
A0001 99994 2014 
A0001 99995 2015 
A0002 33999 2011 
A0002 33999 2012 
A0002 33999 2013 
A0002 33995 2014 
A0002 33993 2015 
A0003 33999 2010 
A0003 33999 2011 
A0003 33999 2012 
A0003 33999 2013 
A0003 33999 2014 
A0003 33999 2015 
A0004 33999 2014 
A0004 33999 2015 
A0005 33999 2011 
A0005 33999 2012 
A0005 33995 2013 
;
run;

STORE TRANSACTION DATASET:

data have1;
 infile cards truncover expandtabs;
 input STORE_ID $ transaction_id $ transaction_date :date9. ;
 format transaction_date date8. ;
 cards;
A0001 1 22jan2010  
A0001 2 11Oct2010  
A0001 3 16jun2011  
A0001 4 19apr2012  
A0001 5 10feb2013  
A0001 6 09nov2014  
A0001 7 04dec2014  
A0001 8 07dec2014  
A0001 9 22Oct2015  
A0001 10 30nov2015  
A0002 1 11jun2011  
A0002 2 11apr2014  
A0002 3 11feb2015  
A0003 1 09nov2010  
A0003 2 04dec2011  
A0003 3 22jan2012  
A0003 4 11Oct2012  
A0003 5 16jun2013  
A0003 6 19apr2014  
A0003 7 10feb2015  
A0003 8 09nov2015  
A0003 9 04dec2015  
A0003 10 04dec2015  
A0004 1 09nov2014  
A0004 2 04dec2014  
A0004 3 04dec2015  
;
run;

3. Merge back to the store transactions dataset to count the number of transactions that occurred at that store while it was open and before it closed/moved.

And get something like this:

data want;
 infile cards truncover expandtabs;
 input STORE_ID_unique $ zip_cd no_transactions open_2010 open_2011 open_2012 open_2013 open_2014 open_2015 close_2010 close_2011 close_2012 close_2013 close_2014 close_2015 ;
 format  zip_cd z5.;
 cards;
A0001_A 99993 3 . 0 0 0 0 0 0 1 0 0 0 0
A0001_B 99994 5 0 0 1 0 0 0 0 0 0 0 1 0
A0001_C 99995 2 0 0 0 0 0 1 0 0 0 0 0 .
A0002_A 33999 1 0 1 0 0 0 0 0 0 0 1 0 0
A0002_B 33995 1 0 0 0 0 1 0 0 0 0 0 1 0
A0002_C 33993 1 0 0 0 0 0 1 0 0 0 0 0 .
A0003 33999 10 . 0 0 0 0 0 0 0 0 0 0 .
A0004 33999 3 0 0 0 0 1 0 0 0 0 0 0 .
A0005_A 33999 0 0 1 0 0 0 0 0 0 1 0 0 0
A0005_B 33995 0 0 0 0 1 0 0 0 0 0 0 0 .
;
run;

I'm a little stuck on the correct procedures to accomplish this. If I'm not mistaken I should create a counter variable and/or concatenate a unique ID based on the store_id and counter_variable then create a conditional logic statement to consider each of these cases?

 

Any help would be appreciated.

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Below should work.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;

Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.

View solution in original post

7 REPLIES 7
Patrick
Opal | Level 21

Below close to what you've asked for. Have a look if this serves your purpose.

As a comment: It's often easier to work with a long data structure instead of a wide one.

data have0;
  infile cards truncover expandtabs;
  input STORE_ID $ zip_cd year;
  format zip_cd z5.;
  cards;
A0001 99993 2010 
A0001 99993 2011 
A0001 99994 2012 
A0001 99994 2013 
A0001 99994 2014 
A0001 99995 2015 
A0002 33999 2011 
A0002 33999 2012 
A0002 33999 2013 
A0002 33995 2014 
A0002 33993 2015 
A0003 33999 2010 
A0003 33999 2011 
A0003 33999 2012 
A0003 33999 2013 
A0003 33999 2014 
A0003 33999 2015 
A0004 33999 2014 
A0004 33999 2015 
A0005 33999 2011 
A0005 33999 2012 
A0005 33995 2013 
;

data have1;
  infile cards truncover expandtabs;
  input STORE_ID $ transaction_id $ transaction_date :date9.;
  format transaction_date date8.;
  cards;
A0001 1 22jan2010  
A0001 2 11Oct2010  
A0001 3 16jun2011  
A0001 4 19apr2012  
A0001 5 10feb2013  
A0001 6 09nov2014  
A0001 7 04dec2014  
A0001 8 07dec2014  
A0001 9 22Oct2015  
A0001 10 30nov2015  
A0002 1 11jun2011  
A0002 2 11apr2014  
A0002 3 11feb2015  
A0003 1 09nov2010  
A0003 2 04dec2011  
A0003 3 22jan2012  
A0003 4 11Oct2012  
A0003 5 16jun2013  
A0003 6 19apr2014  
A0003 7 10feb2015  
A0003 8 09nov2015  
A0003 9 04dec2015  
A0003 10 04dec2015  
A0004 1 09nov2014  
A0004 2 04dec2014  
A0004 3 04dec2015  
;

proc sql;
  create view store_open_close as
  select 
    store_id,
    zip_cd,
    min(year) as open_year,
    max(year) as close_year
  from have0
  group by store_id, zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year;
  id year;
  var n_transactions;
run;
A_Swoosh
Quartz | Level 8
Hi, thank you for the reply. This gets close to what I'm looking for. Is there also a way to specify for each row in the wide dataset to create a variable that indicates the min(transaction_date) as the open date and max(transaction_date) as final date?
Patrick
Opal | Level 21

Below should work.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;

Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.

A_Swoosh
Quartz | Level 8
Thank you for the reply. This is exactly what I'm trying to accomplish. As for reference, I am dealing with high volume data (n= tens of millions) for the transaction data but I'm not sure what would need fixing?
A_Swoosh
Quartz | Level 8

Actually, for the final store 'A0005' the number of transactions should be 0 instead?

And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.

Patrick
Opal | Level 21

@A_Swoosh wrote:

Actually, for the final store 'A0005' the number of transactions should be 0 instead?

And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.


@A_Swoosh Try below.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select 
    l.store_id,
    l.zip_cd,
    l.year,
    sum(case when missing(r.transaction_date) then 0 else 1 end) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date,
    l.year
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;
A_Swoosh
Quartz | Level 8

@Patrick 

that didn't seem to fix it. I still get the same error "the id value year_2015' occurs twice in the same BY group. Additionally, I'm getting duplicate rows for each of these store_id and zip_cd.

 

I'm fine keeping it as a long dataset then I can just proc sql by store_id and zip_cd to count the number of transactions, and create any other indicators I need to distinguish these changes.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2249 views
  • 2 likes
  • 2 in conversation