Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- How to increment a column based on the character value of a row?

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-24-2023 10:56 AM
(1907 views)

Hello everyone!

I'd like to increment a specific column that corresponds to a character value of another column.

The data is

```
data have;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 0 0
2 C 0 0 0
2 C 0 0 0
;
run;
```

The data I want

```
data want;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 1 0 0
1 A 2 0 0
1 B 0 1 0
1 C 0 0 1
2 B 0 1 0
2 C 0 0 1
2 C 0 0 2
;
run;
```

I tried using call symput

`data didnt_want;`

set have;

by customer_ID product;

serial + 1;

if first.cusomer_ID or first.product then serial = 1;

call symput('current_product', product);

count_¤t_product + serial;

run;

but it turns out that the call symput is excecuted after the data step, and it does not store the current value of the product while processing each row.

How could I solve this without a ridiculus about of if statements (true data has many many products)?

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Then use a single retained value to count which row this is for the product and then add the row value to the appropriate column.

You could eliminate the extra array by just putting the product values directly into the WHICHC() function call (note that then you can only use commas as the delimiter between the values instead of commas or spaces).

```
data have;
input customer_ID product $1. count_a count_b count_c ;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
row+1;
if first.product then row=1;
index=whichc(product,'A','B','C');
counts[index]+row;
drop index row;
run;
```

Results:

customer_ Obs ID product count_a count_b count_c 1 1 A 1 0 0 2 1 A 2 0 0 3 1 B 0 1 0 4 1 C 0 0 1 5 2 B 0 2 1 6 2 C 0 1 2 7 2 C 0 1 3

You could also remove the INDEX variable by using the result from the WHICHC() function call directly in the array reference. But keeping it makes the error clearer when you get issues with unexpected values of PRODUCT.

13 REPLIES 13

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Does the answer have to be such that it scales well to data sets of arbitrary size (let's say a million records)?

Can the answer assume that there are just 3 products, A B and C, or could there be potentially hundreds or thousands?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Preferably.

There are dozens of products and there are millions of rows.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

That is doable, but it's an odd structure. It looks like you only want one of the variables Count_A, Count_B, Count_C to have a non-zero value. Wouldn't it be clearer (and easier) to have just a single Count variable? i.e. would you be happy with:

```
data want;
input customer_ID product $1. count;
datalines;
1 A 1
1 A 2
1 B 1
1 C 1
2 B 1
2 C 1
2 C 2
;
run;
```

?

The Boston Area SAS Users Group is hosting ** free webinars**!

Next up:**Bart Jablonski and I** present 53 (+3) ways to do a table lookup on Wednesday Sep 18.

Register now at https://www.basug.org/events.

Next up:

Register now at https://www.basug.org/events.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Termpu wrote:

You don't show any examples of non-zeros in the other columns, I think we would need to see such examples if we are going to code this. I also don't understand from what you have set how you can end up with non-zeros in more than one column.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you for the feedback.

The case might for example be, that have is like

```
data have;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
run;
```

and consequently want would be like

```
data want;
input customer_ID product $1. count_A count_B count_C;
datalines;
1 A 1 0 0
1 A 2 0 0
1 B 0 1 0
1 C 0 0 1
2 B 0 2 1
2 C 0 1 2
2 C 0 1 3
;
run;
```

I'm sorry for not including that in the original post.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Termpu wrote:

Thank you for the feedback.

The case might for example be, that have is like

`data have; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 0 0 0 1 A 0 0 0 1 B 0 0 0 1 C 0 0 0 2 B 0 1 1 2 C 0 1 1 2 C 0 1 1 ; run;`

and consequently want would be like

`data want; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 1 0 0 1 A 2 0 0 1 B 0 1 0 1 C 0 0 1 2 B 0 2 1 2 C 0 1 2 2 C 0 0 3 ; run;`

I'm sorry for not including that in the original post.

Again, this is not making sense. Why does count_a have zeros after the first two lines, while count_c has no more zeros after the first time C appears?

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The customer_ID changed, and customer 2 had initially count_B=1 and count_C=1.

But yes, there was a mistake in my original "want", the count_B on the final row should've been 1.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Then use a single retained value to count which row this is for the product and then add the row value to the appropriate column.

You could eliminate the extra array by just putting the product values directly into the WHICHC() function call (note that then you can only use commas as the delimiter between the values instead of commas or spaces).

```
data have;
input customer_ID product $1. count_a count_b count_c ;
datalines;
1 A 0 0 0
1 A 0 0 0
1 B 0 0 0
1 C 0 0 0
2 B 0 1 1
2 C 0 1 1
2 C 0 1 1
;
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
row+1;
if first.product then row=1;
index=whichc(product,'A','B','C');
counts[index]+row;
drop index row;
run;
```

Results:

customer_ Obs ID product count_a count_b count_c 1 1 A 1 0 0 2 1 A 2 0 0 3 1 B 0 1 0 4 1 C 0 0 1 5 2 B 0 2 1 6 2 C 0 1 2 7 2 C 0 1 3

You could also remove the INDEX variable by using the result from the WHICHC() function call directly in the array reference. But keeping it makes the error clearer when you get issues with unexpected values of PRODUCT.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Termpu wrote:

That makes no sense. What happens if the first observations with PRODUCT=A has a count of 100 and the second observation started with a count of 200?

Show some examples of these in-coming counts and how they get adjusted.

Also explain what the heck is the real world problem that this technique is supposed to help with?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I don't know why this site does not show some of my responses.. so for the second time:

Ok, my dummy example falls short of the complexity of the real situation. You're right, the situation you described (and my reply allows) is not possible. If the first COUNT_A = 100 in the "have" dataset, then the second row also has COUNT_A=100, given that the customer_ID does not change.

As for the possible in-coming data, this might be possible:

data have; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 100 100 100 1 A 100 100 100 1 B 100 100 100 1 C 100 100 100 2 B 10 10 20 2 C 10 10 20 2 C 10 10 20 ; run;

requiring the want to be like:

data want; input customer_ID product $1. count_A count_B count_C; datalines; 1 A 101 100 100 1 A 102 100 100 1 B 100 101 100 1 C 100 100 101 2 B 10 11 20 2 C 10 10 21 2 C 10 10 22 ; run;

And finally, the real problem:

Each row in the have dataset corresponds to a hypothetical order made by a customer and the counts are calculated based on the order history of the customer. With other features, the counts describe the customers historical behavior and they are used to form a customer profile. Now, the row being a new hypothetical order, the history has to be updated to show that order also in the history, hence the increment. The product is subscription based and consists of terms (or periods). One term corresponds to one row. The first row of each product corresponds to the first term of the subscription, and the second row to the second. Thus, we need another increment in the case of two consecutive instances of the same product.

And finally, the row is given as an input for a predictive model, that outputs me the thing that I need.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

One might ask what your are going to do next with this data.

One way that may be a bit cumbersome if there are many products.

data have; input customer_ID product $1. ; datalines; 1 A 1 A 1 B 1 C 2 B 2 C 2 C ; proc sort data=have; by customer_ID product; run; data want; set have; array count(*) count_a Count_b count_c; retain temp count_a Count_b count_c; array vals (3) $ _temporary_ ('A','B','C'); by customer_ID product; if first.product then do i=1 to dim(count); count[i]=0; end; If first.product then temp=whichc(product,of vals(*)); if temp ne 0 then count[temp]= sum(count[temp],1); drop temp i; run;

Note that the Want data set will not work as intended if the Count variables are present because of the way that Retain, which keeps values across iterations of the data step works. If the Retained variable is in the source set then the value read is the one used for the current execution.

The sort, while possibly implied by your data would be needed to make sure all of your products are together to increment correctly. Otherwise you need to double the number of variables and complexity of the code.

The array with values is temporary and does not write the variables to the data set. The values must match the order of the variables on the Count array definition.

WHICHC is a function which returns the position number of the first value in the function in the remaining values so can be used to look up things in an array or set items. I have included a check for 0 in case you have a product in the data that you missed in the definitions earlier.

With Retained values you need to reset them when no longer needed, which in this case means each time the Product changes (see why the sort was needed).

But I really don't see any advantage to this data structure off hand.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You cannot RETAIN an existing variable (actually technically it is already being retained, but that is moot because the value is overwritten when the SET statement reads the next observation).

```
data have;
input customer_ID product $1. ;
datalines;
1 A
1 A
1 B
1 C
2 B
2 C
2 C
;
```

So to make NEW variables you can use a step like this:

```
data want;
set have;
by customer_id product ;
array counts count_a count_b count_c ;
array values [3] $1 _temporary_ ('A','B','C');
if first.product then call missing(of counts[*]);
index=whichc(product,of values(*));
counts[index]+1;
drop index;
run;
```

Which produces this output:

customer_ Obs ID product count_a count_b count_c 1 1 A 1 . . 2 1 A 2 . . 3 1 B . 1 . 4 1 C . . 1 5 2 B . 1 . 6 2 C . . 1 7 2 C . . 2

If you want the zeros then work a little harder when starting a new product.

```
if first.product then do index=1 to dim(counts);
counts[index]=0;
end;
```

If you don't know the set of unique values of PRODUCT in advance then you could use an extra pass through the data to find them. If the set of values is small enough you could put them into macro variables you could use to generate the data step.

proc sql noprint; select distinct quote(trim(product)) , cats('count_',product) , max(length(product)) into :values separated by ' ' , :counts separated by ' ' , :maxlen trimmed from have ; %let n_products=&sqlobs; quit; data want; set have; by customer_id product ; array counts &counts; array values [&n_products] $&maxlen _temporary_ (&values); if first.product then do index=1 to &n_products; counts[index]=0; end; index=whichc(product,of values(*)); counts[index]+1; drop index; run;

Result

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 16. Read more here about **why** you should contribute and **what is in it** for you!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.