Hello everyone!
I'd like to increment a specific column that corresponds to a character value of another column.
The data is
data have;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 0 0
2 C 0 0 0
2 C 0 0 0
;
run;
The data I want
data want;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 1 0 0
1 A 2 0 0
1 B 0 1 0
1 C 0 0 1
2 B 0 1 0
2 C 0 0 1
2 C 0 0 2
;
run;
I tried using call symput
data didnt_want;
set have;
by customer_ID product;
serial + 1;
if first.cusomer_ID or first.product then serial = 1;
call symput('current_product', product);
count_¤t_product + serial;
run;
but it turns out that the call symput is excecuted after the data step, and it does not store the current value of the product while processing each row.
How could I solve this without a ridiculus about of if statements (true data has many many products)?
Then use a single retained value to count which row this is for the product and then add the row value to the appropriate column.
You could eliminate the extra array by just putting the product values directly into the WHICHC() function call (note that then you can only use commas as the delimiter between the values instead of commas or spaces).
data have;
input customer_ID product $1. count_a count_b count_c ;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
row+1;
if first.product then row=1;
index=whichc(product,'A','B','C');
counts[index]+row;
drop index row;
run;
Results:
customer_ Obs ID product count_a count_b count_c 1 1 A 1 0 0 2 1 A 2 0 0 3 1 B 0 1 0 4 1 C 0 0 1 5 2 B 0 2 1 6 2 C 0 1 2 7 2 C 0 1 3
You could also remove the INDEX variable by using the result from the WHICHC() function call directly in the array reference. But keeping it makes the error clearer when you get issues with unexpected values of PRODUCT.
Does the answer have to be such that it scales well to data sets of arbitrary size (let's say a million records)?
Can the answer assume that there are just 3 products, A B and C, or could there be potentially hundreds or thousands?
Preferably.
There are dozens of products and there are millions of rows.
That is doable, but it's an odd structure. It looks like you only want one of the variables Count_A, Count_B, Count_C to have a non-zero value. Wouldn't it be clearer (and easier) to have just a single Count variable? i.e. would you be happy with:
data want;
input customer_ID product $1. count;
datalines;
1 A 1
1 A 2
1 B 1
1 C 1
2 B 1
2 C 1
2 C 2
;
run;
?
Thank you! But unfortunatly in the real case the other values might be non-zero and just the column corresponding to the current value of the product needs to be incremented.
@Termpu wrote:
Thank you! But unfortunatly in the real case the other values might be non-zero and just the column corresponding to the current value of the product needs to be incremented.
You don't show any examples of non-zeros in the other columns, I think we would need to see such examples if we are going to code this. I also don't understand from what you have set how you can end up with non-zeros in more than one column.
Thank you for the feedback.
The case might for example be, that have is like
data have;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
run;
and consequently want would be like
data want;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 1 0 0
1 A 2 0 0
1 B 0 1 0
1 C 0 0 1
2 B 0 2 1
2 C 0 1 2
2 C 0 1 3
;
run;
I'm sorry for not including that in the original post.
@Termpu wrote:
Thank you for the feedback.
The case might for example be, that have is like
data have; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 0 0 0 1 A 0 0 0 1 B 0 0 0 1 C 0 0 0 2 B 0 1 1 2 C 0 1 1 2 C 0 1 1 ; run;
and consequently want would be like
data want; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 1 0 0 1 A 2 0 0 1 B 0 1 0 1 C 0 0 1 2 B 0 2 1 2 C 0 1 2 2 C 0 0 3 ; run;
I'm sorry for not including that in the original post.
Again, this is not making sense. Why does count_a have zeros after the first two lines, while count_c has no more zeros after the first time C appears?
The customer_ID changed, and customer 2 had initially count_B=1 and count_C=1.
But yes, there was a mistake in my original "want", the count_B on the final row should've been 1.
Then use a single retained value to count which row this is for the product and then add the row value to the appropriate column.
You could eliminate the extra array by just putting the product values directly into the WHICHC() function call (note that then you can only use commas as the delimiter between the values instead of commas or spaces).
data have;
input customer_ID product $1. count_a count_b count_c ;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
row+1;
if first.product then row=1;
index=whichc(product,'A','B','C');
counts[index]+row;
drop index row;
run;
Results:
customer_ Obs ID product count_a count_b count_c 1 1 A 1 0 0 2 1 A 2 0 0 3 1 B 0 1 0 4 1 C 0 0 1 5 2 B 0 2 1 6 2 C 0 1 2 7 2 C 0 1 3
You could also remove the INDEX variable by using the result from the WHICHC() function call directly in the array reference. But keeping it makes the error clearer when you get issues with unexpected values of PRODUCT.
@Termpu wrote:
Thank you! But unfortunatly in the real case the other values might be non-zero and just the column corresponding to the current value of the product needs to be incremented.
That makes no sense. What happens if the first observations with PRODUCT=A has a count of 100 and the second observation started with a count of 200?
Show some examples of these in-coming counts and how they get adjusted.
Also explain what the heck is the real world problem that this technique is supposed to help with?
I don't know why this site does not show some of my responses.. so for the second time:
Ok, my dummy example falls short of the complexity of the real situation. You're right, the situation you described (and my reply allows) is not possible. If the first COUNT_A = 100 in the "have" dataset, then the second row also has COUNT_A=100, given that the customer_ID does not change.
As for the possible in-coming data, this might be possible:
data have; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 100 100 100 1 A 100 100 100 1 B 100 100 100 1 C 100 100 100 2 B 10 10 20 2 C 10 10 20 2 C 10 10 20 ; run;
requiring the want to be like:
data want; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 101 100 100 1 A 102 100 100 1 B 100 101 100 1 C 100 100 101 2 B 10 11 20 2 C 10 10 21 2 C 10 10 22 ; run;
And finally, the real problem:
Each row in the have dataset corresponds to a hypothetical order made by a customer and the counts are calculated based on the order history of the customer. With other features, the counts describe the customers historical behavior and they are used to form a customer profile. Now, the row being a new hypothetical order, the history has to be updated to show that order also in the history, hence the increment. The product is subscription based and consists of terms (or periods). One term corresponds to one row. The first row of each product corresponds to the first term of the subscription, and the second row to the second. Thus, we need another increment in the case of two consecutive instances of the same product.
And finally, the row is given as an input for a predictive model, that outputs me the thing that I need.
One might ask what your are going to do next with this data.
One way that may be a bit cumbersome if there are many products.
data have; input customer_ID product $1. ; datalines; 1 A 1 A 1 B 1 C 2 B 2 C 2 C ; proc sort data=have; by customer_ID product; run; data want; set have; array count(*) count_a Count_b count_c; retain temp count_a Count_b count_c; array vals (3) $ _temporary_ ('A','B','C'); by customer_ID product; if first.product then do i=1 to dim(count); count[i]=0; end; If first.product then temp=whichc(product,of vals(*)); if temp ne 0 then count[temp]= sum(count[temp],1); drop temp i; run;
Note that the Want data set will not work as intended if the Count variables are present because of the way that Retain, which keeps values across iterations of the data step works. If the Retained variable is in the source set then the value read is the one used for the current execution.
The sort, while possibly implied by your data would be needed to make sure all of your products are together to increment correctly. Otherwise you need to double the number of variables and complexity of the code.
The array with values is temporary and does not write the variables to the data set. The values must match the order of the variables on the Count array definition.
WHICHC is a function which returns the position number of the first value in the function in the remaining values so can be used to look up things in an array or set items. I have included a check for 0 in case you have a product in the data that you missed in the definitions earlier.
With Retained values you need to reset them when no longer needed, which in this case means each time the Product changes (see why the sort was needed).
But I really don't see any advantage to this data structure off hand.
You cannot RETAIN an existing variable (actually technically it is already being retained, but that is moot because the value is overwritten when the SET statement reads the next observation).
data have;
input customer_ID product $1. ;
datalines;
1 A
1 A
1 B
1 C
2 B
2 C
2 C
;
So to make NEW variables you can use a step like this:
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
array values [3] $1 _temporary_ ('A','B','C');
if first.product then call missing(of counts[*]);
index=whichc(product,of values(*));
counts[index]+1;
drop index;
run;
Which produces this output:
customer_ Obs ID product count_a count_b count_c 1 1 A 1 . . 2 1 A 2 . . 3 1 B . 1 . 4 1 C . . 1 5 2 B . 1 . 6 2 C . . 1 7 2 C . . 2
If you want the zeros then work a little harder when starting a new product.
if first.product then do index=1 to dim(counts);
counts[index]=0;
end;
If you don't know the set of unique values of PRODUCT in advance then you could use an extra pass through the data to find them. If the set of values is small enough you could put them into macro variables you could use to generate the data step.
proc sql noprint; select distinct quote(trim(product)) , cats('count_',product) , max(length(product)) into :values separated by ' ' , :counts separated by ' ' , :maxlen trimmed from have ; %let n_products=&sqlobs; quit; data want; set have; by customer_id product ; array counts &counts; array values [&n_products] $&maxlen _temporary_ (&values); if first.product then do index=1 to &n_products; counts[index]=0; end; index=whichc(product,of values(*)); counts[index]+1; drop index; run;
Result
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.