Loop through all rows and create a counter based on conditions

pulpfiction123 · Posted 12-08-2023 09:23 AM

Hello everyone,

I have a transposed dataset with the IDs as columns, with each ID having an additional 'counter' column, and a Date column. Here's a sample of my dataset:

Data have;
Input DATE2 :ddmmyy10. ID_1 ID_2 Counter_ID_1 Counter_ID_2;
Format DATE2 ddmmyy10.;
datalines;
01/03/2019 630 . . .
01/04/2019 90 . . .
01/05/2019 2 . . .
01/05/2019 112 . . .
01/01/2023 . 20 . .
01/01/2023 . 100 . .
01/01/2023 . 2 . .
01/01/2023 . 10 . .
01/01/2023 . 420 . .
01/02/2023 . 98 . .
;
Run;

I want to loop through every row - for each record for every ID, add +1 to it's associated counter column for each previous record (sorted by date) that has a different date compared to it. Eg: If a record has 2 preceding records for the same ID that have a different date, and 1 with the same date, the value for the counter for that record should be 2. Here's what it should look like:

Data want;
Input DATE2 :ddmmyy10. ID_1 ID_2 Counter_ID_1 Counter_ID_2;
Format DATE2 ddmmyy10.;
datalines;
01/03/2019 630 . . .
01/04/2019 90 . 1 .
01/05/2019 2 . 2 .
01/05/2019 112 . 2 .
01/01/2023 . 20 . .
01/01/2023 . 100 . .
01/01/2023 . 2 . .
01/01/2023 . 10 . .
01/01/2023 . 420 . .
01/02/2023 . 98 . 5
;
Run;

I have the below code, which creates 2 arrays of the same dimensions (1 storing each ID, and 1 storing the associated counter columns). My code supposedly loops through each ID, and compares each row to all the previous rows using the lag function, adding +1 to the counter column if it's the same ID and a different date. However, the result is not entirely accurate and I am getting wrong results in cases where there are both preceding records with same and different dates. Below is a sample of my code:

Data _NULL_;
Call symput (‘nrows’, 10);
Run;

data test;
set have;
array IDS{*} ID_:;
array IDS2{*} Counter_:;
length tracker $100;
%macro mymacro;
Do i=1 to dim(accounts);
	%do x=1 %to &nrows.;
	/*Only getting observations for the same ID*/
	If IDS(i) ne .
	AND lag&x.(IDS(i)) ne.
	AND DATE2 ne lag&x.(DATE2)
	Then do;
		IDS2(i)+1;
	End;
%end;
End;
%mend;

%mymacro;
Drop i;
Run;

Can anyone tell me where I'm going wrong, or suggest an alternative way to do this? Thanks!

antonbcristina · Posted 12-08-2023 09:42 AM

I'm not entirely sure I understand the problem. I think your first task should be to fix the initial transpose of the data such that each observation contains one ID and date. It's harder to address observations for multiple IDs simultaneously, nor does it seem to be a part of the requirements. Once data is in this simpler form, I would suggest using a first. last. approach to get a simple counter for each ID/date combo. Some good examples here: https://documentation.sas.com/doc/en/lrcon/9.4/n01a08zkzy5igbn173zjz82zsi1s.htm

pulpfiction123 · Posted 12-09-2023 08:24 AM

Hi Cristina, thank you for the response and the suggestion. The source dataset (before I transformed it) does have only 1 ID column indexed by different ID values and transaction dates/amounts etc as you would expect. I have indeed tried this approach before this by using the first.ID and last.ID method and looping through each transaction to compare them. However, my issue is that each ID has a different number of transactions, and I have to compare each transaction to all the preceding transactions.

Eg: ID1 has 10 transactions, I need to compare transaction 1 to transactions 2-10, then transaction 2 to transactions 3-10, and so forth. I could not find a way to have a dynamic way to have this reflected in the LAG function for each loop. My other attempt was to loop through all the transactions regardless of ID, and ignore all loops with different IDs. But this took too long to run.

Hope this clarifies things, and happy to hear if there are better ways of going about this!

Tom · Posted 12-08-2023 10:40 AM

Why does the source data already have the counter variables? They appear to be all missing on every record? Is that expected?

Why did you transpose the data? Is there some special meaning to the values of ID_1 and ID_2 being on the same observation? If so you don't show any examples of that since on all of your records only one of ID_1 or ID_2 has a value.

Does it make any difference if the ID value appears in the first or the second variable? For example you have ID=2 in the first variable on 01MAY2019 and then in the second variable on 01JAN2023. So should the count be 2 on that record from 2023? Or 1 since it is the first time it appeared in the ID_2 column?

Your example results make no sense to me.

Why would the third record have a counter of 2? That is the first time that ID_1=2 appeared? Where did the 5 at the end come from? Why are you counting from zero instead of one, but storing the zeros as missing instead of zero?

pulpfiction123 · Posted 12-09-2023 08:17 AM

Hi Tom, thanks for your questions. Happy to clarify:

1) The data I provided is not the source data, the source data contains only one ID column with different ID values as you might expect. Each ID can have multiple transaction records with different dates and transaction values. The dataset I have provided in the post is what I got after transposing the data by Date, turning each ID into it's own column (ID1, ID2) with the values of those columns being the transaction values. The 'Counter' columns are all missing initially as I have just created them, there is one column per ID. The intention is to add +1 to each ID's 'Counter' column for each row based on the number of preceding records that have a different date. Eg: if person 1 has 10 transactions, I'll need to compare transaction 1 to transactions 2-10, transaction 2 to transactions 3-10 and so on - and add the number to that ID's counter column by transaction level.

2) The data was transposed because it was too difficult to loop through each ID and count how many transactions matched the criteria, as each ID can have a different number of transactions. My initial solution could not finish running for an hour on the full dataset (which contains a lot more IDs and records), whereas after transposing it ran in little time (albeit with inaccurate results). As the data is transposed by date,

3) The values for the first and second ID variables are the transaction values for each ID, and the amount is not as relevant as whether it is missing or not. If the value is populated, it means that this ID has a transaction of this amount on this date. If it is missing, then this ID does not have a transaction on this date. In the 2 sample cases I provided, ID1 and ID2 do not have any transactions on the same day, hence only one of each column is populated for each observation.

ID1 has 4 transactions in total - on 1st March, 1st April and 2 on 1st May. ID2 has 6 transactions in total, 5 on the same day, and 1 on 1st Feb.

Hence, in the expected results - ID1's Counter column has a value of 2 for the third record, as there are 2 preceding transactions (the dataset is sorted by date) for ID1 that have a different date. The fourth record should have a value of 2 as well for this reason, as the 3rd and 4th records have the same day and thus would not count.

Similarly for the second ID2 counter column, the 5 at the end is because for ID2 there are 5 preceding transactions with a different date, hence why it is 5, and hence why the previous rows were missing even in the counter column.

As to why the values are missing instead of 0, it's an oversight on my part - I should initialize the counter columns to be 0.

Hope that clarifies it, and apologies if the initial post was not very clear!

Tom · Posted 12-09-2023 01:46 PM

So you just want to sort by ID and DATE and count.

data want;
  set have;
  by id date;
  if first.id then counter=0;
  counter+1;
run;

But if the data is not sorted (or indexed) by ID to begin with then sorting could take a lot of time.

If the data is already sorted by DATE (which it had to be for your TRANSPOSE step to work) then you can use a temporary array or a HASH object to do it in without the sorting.

If the set of values of ID is smallish then an ARRAY is probably simpler and faster. For example if the ID values are integers with values from 1 to 999999 (so 6 digits) you just need an array with 999999 entries, which is not much memory nowadays.

data want ;
  set have ;
  by date;
  array counts[999999] _temporary_ (999999*0);
  counts[id]+1;
  counter=counts[id];
run;

Using a hash is the same, but takes more code as you need more code to define and work with the hash.

Let's so it with your example data:

data have ;
  input date :ddmmyy. id amount;
  format date yymmdd10. ;
datalines;
01/03/2019 1 630      
01/04/2019 1 90      
01/05/2019 1 2      
01/05/2019 1 112      
01/01/2023 2  20    
01/01/2023 2  100    
01/01/2023 2  2    
01/01/2023 2  10    
01/01/2023 2  420    
01/02/2023 2  98    
;

data want;
  set have;
  by date;
  array counts[1000] _temporary_ (1000*0);
  counts[id]+1;
  counter=counts[id];
run;

data want2;
  set have;
  by date;
  if _n_=1 then do;
    declare hash h();
    h.definekey('id');
    h.definedata('counter');
    h.definedone();
  end;
  if not h.add() then counter=0;
  counter+1;
  h.replace();
run;

proc print data=want; run;
proc print data=want2; run;

proc compare data=want compare=want2; run;

Result

Obs          date    id    amount    counter

  1    2019-03-01     1      630        1
  2    2019-04-01     1       90        2
  3    2019-05-01     1        2        3
  4    2019-05-01     1      112        4
  5    2023-01-01     2       20        1
  6    2023-01-01     2      100        2
  7    2023-01-01     2        2        3
  8    2023-01-01     2       10        4
  9    2023-01-01     2      420        5
 10    2023-02-01     2       98        6

PS Displaying dates in either DMY or MDY order will confuse half of your audience.

Tom · Posted 12-09-2023 01:52 PM

And if you want to count only the DAYS that have transactions instead of the transactions it is much easier if the data is sorted.

proc sort data=have;
  by id date; 
run;

data want;
  set have;
  by id date;
  if first.id then call missing(counter1,counter2);
  counter1+1;
  counter2+first.date;
run;

proc print;
run;

Obs          date    id    amount    counter1    counter2

  1    2019-03-01     1      630         1           1
  2    2019-04-01     1       90         2           2
  3    2019-05-01     1        2         3           3
  4    2019-05-01     1      112         4           3
  5    2023-01-01     2       20         1           1
  6    2023-01-01     2      100         2           1
  7    2023-01-01     2        2         3           1
  8    2023-01-01     2       10         4           1
  9    2023-01-01     2      420         5           1
 10    2023-02-01     2       98         6           2

Patrick · Posted 12-09-2023 10:31 PM

Piggybacking on all the code @Tom proposed here an approach with a cumulative count for days of transactions per ID (=count once per date and id).

The source data must be sorted by date but not by ID within a date - which is how transactional data is often ordered.

data have ;
  input date :ddmmyy. id amount;
  format date yymmdd10. ;
datalines;
01/03/2019 1 630      
01/04/2019 1 90      
01/05/2019 1 2      
01/05/2019 1 112      
01/01/2023 2  20    
01/01/2023 2  100    
01/01/2023 2  2    
01/01/2023 2  10    
01/01/2023 2  420    
01/02/2023 2  98    
;

data want2;
  if _n_=1 then do;
    /* hash for cumulative count per id and date */
    declare hash h_cnt();
    h_cnt.definekey('id');
    h_cnt.defineData('counter');
    h_cnt.definedone();

    /* hash to keep track for which id already counted within a date */
    declare hash h_id();
    h_id.definekey('id','date');
    h_id.definedone();
  end;

  set have;
  by date;

  /* not a must but keeps memory consumption lower */
  if first.date then h_id.clear();

  /* if new id set counter=0 else read current count from hash */
  if h_cnt.find() ne 0 then counter=0;

  /* if counter not increased yet for id and specific date */
  if h_id.check() ne 0 then
    do;
      _rc=h_id.add();
      counter+1;
      _rc=h_cnt.replace();
    end;
  drop _rc;
run;

proc print data=want2; 
run;

Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions

Re: Loop through all rows and create a counter based on conditions