Solved: Re: count logic counter variable

A_Swoosh · Posted 10-18-2021 02:33 PM

Hi all,

I'm trying to create a dataset that combines information from 2 datasets.The first dataset contains the information for the store (store number, location zip code, and year in operation). In this dataset, I encounter a situation that I'm not quite sure how to resolve. For example, I have a store that is operating in a zip code then it moves location the following year, and so forth, possibly moving upwards of 5 times over the 5 years of data I have.

I am trying to:

1. flag each instance where the store moves/closes by the year (e.g., open_2011, close_2012)

My data spans from 2010-2015 and I use a 1-month period to determine if open/closed before 2010 and 2015 based on the transaction dataset since I don't have data to tell when it opened/closed before the study period (e.g., if max(transaction_date) le '30-NOV-2015'd; if min(transaction_date) ge '01-FEB-2010'd)

2. Create a unique id for each store_id that moves (e.g., A0001_A, A0001_B, and so on...)

STORE INFO DATASET EXAMPLE:

data have0;
 infile cards truncover expandtabs;
 input STORE_ID $ zip_cd year ;
 format zip_cd z5. ;
 cards;
A0001 99993 2010 
A0001 99993 2011 
A0001 99994 2012 
A0001 99994 2013 
A0001 99994 2014 
A0001 99995 2015 
A0002 33999 2011 
A0002 33999 2012 
A0002 33999 2013 
A0002 33995 2014 
A0002 33993 2015 
A0003 33999 2010 
A0003 33999 2011 
A0003 33999 2012 
A0003 33999 2013 
A0003 33999 2014 
A0003 33999 2015 
A0004 33999 2014 
A0004 33999 2015 
A0005 33999 2011 
A0005 33999 2012 
A0005 33995 2013 
;
run;

STORE TRANSACTION DATASET:

data have1;
 infile cards truncover expandtabs;
 input STORE_ID $ transaction_id $ transaction_date :date9. ;
 format transaction_date date8. ;
 cards;
A0001 1 22jan2010  
A0001 2 11Oct2010  
A0001 3 16jun2011  
A0001 4 19apr2012  
A0001 5 10feb2013  
A0001 6 09nov2014  
A0001 7 04dec2014  
A0001 8 07dec2014  
A0001 9 22Oct2015  
A0001 10 30nov2015  
A0002 1 11jun2011  
A0002 2 11apr2014  
A0002 3 11feb2015  
A0003 1 09nov2010  
A0003 2 04dec2011  
A0003 3 22jan2012  
A0003 4 11Oct2012  
A0003 5 16jun2013  
A0003 6 19apr2014  
A0003 7 10feb2015  
A0003 8 09nov2015  
A0003 9 04dec2015  
A0003 10 04dec2015  
A0004 1 09nov2014  
A0004 2 04dec2014  
A0004 3 04dec2015  
;
run;

3. Merge back to the store transactions dataset to count the number of transactions that occurred at that store while it was open and before it closed/moved.

And get something like this:

data want;
 infile cards truncover expandtabs;
 input STORE_ID_unique $ zip_cd no_transactions open_2010 open_2011 open_2012 open_2013 open_2014 open_2015 close_2010 close_2011 close_2012 close_2013 close_2014 close_2015 ;
 format  zip_cd z5.;
 cards;
A0001_A 99993 3 . 0 0 0 0 0 0 1 0 0 0 0
A0001_B 99994 5 0 0 1 0 0 0 0 0 0 0 1 0
A0001_C 99995 2 0 0 0 0 0 1 0 0 0 0 0 .
A0002_A 33999 1 0 1 0 0 0 0 0 0 0 1 0 0
A0002_B 33995 1 0 0 0 0 1 0 0 0 0 0 1 0
A0002_C 33993 1 0 0 0 0 0 1 0 0 0 0 0 .
A0003 33999 10 . 0 0 0 0 0 0 0 0 0 0 .
A0004 33999 3 0 0 0 0 1 0 0 0 0 0 0 .
A0005_A 33999 0 0 1 0 0 0 0 0 0 1 0 0 0
A0005_B 33995 0 0 0 0 1 0 0 0 0 0 0 0 .
;
run;

I'm a little stuck on the correct procedures to accomplish this. If I'm not mistaken I should create a counter variable and/or concatenate a unique ID based on the store_id and counter_variable then create a conditional logic statement to consider each of these cases?

Any help would be appreciated.

Patrick · Posted 10-18-2021 06:16 PM

Below should work.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;

Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.

View solution in original post

Patrick · Posted 10-18-2021 03:25 PM

Below close to what you've asked for. Have a look if this serves your purpose.

As a comment: It's often easier to work with a long data structure instead of a wide one.

data have0;
  infile cards truncover expandtabs;
  input STORE_ID $ zip_cd year;
  format zip_cd z5.;
  cards;
A0001 99993 2010 
A0001 99993 2011 
A0001 99994 2012 
A0001 99994 2013 
A0001 99994 2014 
A0001 99995 2015 
A0002 33999 2011 
A0002 33999 2012 
A0002 33999 2013 
A0002 33995 2014 
A0002 33993 2015 
A0003 33999 2010 
A0003 33999 2011 
A0003 33999 2012 
A0003 33999 2013 
A0003 33999 2014 
A0003 33999 2015 
A0004 33999 2014 
A0004 33999 2015 
A0005 33999 2011 
A0005 33999 2012 
A0005 33995 2013 
;

data have1;
  infile cards truncover expandtabs;
  input STORE_ID $ transaction_id $ transaction_date :date9.;
  format transaction_date date8.;
  cards;
A0001 1 22jan2010  
A0001 2 11Oct2010  
A0001 3 16jun2011  
A0001 4 19apr2012  
A0001 5 10feb2013  
A0001 6 09nov2014  
A0001 7 04dec2014  
A0001 8 07dec2014  
A0001 9 22Oct2015  
A0001 10 30nov2015  
A0002 1 11jun2011  
A0002 2 11apr2014  
A0002 3 11feb2015  
A0003 1 09nov2010  
A0003 2 04dec2011  
A0003 3 22jan2012  
A0003 4 11Oct2012  
A0003 5 16jun2013  
A0003 6 19apr2014  
A0003 7 10feb2015  
A0003 8 09nov2015  
A0003 9 04dec2015  
A0003 10 04dec2015  
A0004 1 09nov2014  
A0004 2 04dec2014  
A0004 3 04dec2015  
;

proc sql;
  create view store_open_close as
  select 
    store_id,
    zip_cd,
    min(year) as open_year,
    max(year) as close_year
  from have0
  group by store_id, zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year;
  id year;
  var n_transactions;
run;

A_Swoosh · Posted 10-18-2021 04:56 PM

Hi, thank you for the reply. This gets close to what I'm looking for. Is there also a way to specify for each row in the wide dataset to create a variable that indicates the min(transaction_date) as the open date and max(transaction_date) as final date?

Patrick · Posted 10-18-2021 06:16 PM

Below should work.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select
    l.*,
    count(*) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, l.year, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;

Btw: If your source tables reside in a database and you're dealing with high volume data then above code needs some change in order for SAS to push all of the processing to the database and only transfer the result set back to SAS.

A_Swoosh · Posted 10-18-2021 06:49 PM

Thank you for the reply. This is exactly what I'm trying to accomplish. As for reference, I am dealing with high volume data (n= tens of millions) for the transaction data but I'm not sure what would need fixing?

A_Swoosh · Posted 10-18-2021 09:11 PM

Actually, for the final store 'A0005' the number of transactions should be 0 instead?

And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.

Patrick · Posted 10-18-2021 10:23 PM

@A_Swoosh wrote:

Actually, for the final store 'A0005' the number of transactions should be 0 instead?

And, I encountered an error on the last step that said "ERROR: the id value "year_2015" occurs twice in the same BY group. I suppose it's for a case when the open/close date fall under the same year.

@A_Swoosh Try below.


proc sql;
  create view store_open_close as
  select 
    l.store_id,
    l.zip_cd,
    min(l.year) as open_year,
    max(l.year) as close_year,
    min(r.transaction_date) as min_transaction_date format=date9.,
    max(r.transaction_date) as max_transaction_date format=date9.
  from 
    have0 l
    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)
  group by l.store_id, l.zip_cd
  ;
quit;

proc sql;
  create table long as
  select 
    l.store_id,
    l.zip_cd,
    l.year,
    sum(case when missing(r.transaction_date) then 0 else 1 end) as n_transactions,
    s.open_year,
    s.close_year,
    s.min_transaction_date,
    s.max_transaction_date
  from 
    have0 l

    left join
    have1 r
      on l.store_id=r.store_id and l.year=year(r.transaction_date)

    left join
    store_open_close s
      on l.store_id=s.store_id and l.zip_cd=s.zip_cd
    
  group by
    l.store_id, l.zip_cd, s.open_year, s.close_year, s.min_transaction_date, s.max_transaction_date,
    l.year
  ;
quit;

proc transpose data=long out=wide(drop=_name_) prefix=year_;
  by store_id zip_cd open_year close_year min_transaction_date max_transaction_date;
  id year;
  var n_transactions;
run;

A_Swoosh · Posted 10-19-2021 02:39 PM

@Patrick

that didn't seem to fix it. I still get the same error "the id value year_2015' occurs twice in the same BY group. Additionally, I'm getting duplicate rows for each of these store_id and zip_cd.

I'm fine keeping it as a long dataset then I can just proc sql by store_id and zip_cd to count the number of transactions, and create any other indicators I need to distinguish these changes.

Classroom Training Available!