Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Categorize observations based on cumulative sums for two variables

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-24-2021 04:51 AM
(630 views)

Dear SAS experts

I would like to create a categorical variable that include all observations that sum up to certain values based on two numerical variables.

Given this simple dataset:

data example;

input value1 value2;

datalines;

2 5

3 5

6 10

2 4

3 5

2 5

;

run;

I want to categorize the observations (as they are ordered here) based on running sums. When the running sum reaches >=5 for value1 AND >=10 for value2 a category should be created. Then, the process is repeated again. The resulting dataset should look like:

value1 value2 cat_var

2 5 1

3 5 1

6 10 2

2 4 3

3 5 3

2 5 3

For category 1 it takes two observations before the conditions described above are met. For category 2 the condition is already met in the 3^{rd} observation and therefore only one observation is included in category 2. For category 3, while the condition is met for value1 already in the 5^{th} observation for value2 it is met on the 6^{th} observation. Therefore 3 observations are included in category 3.

Does anyone have a suggestion on which syntax I can use to create such a categorical variable?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Try this

```
data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
data want(drop = s:);
set example;
if not s1 then cat_var + 1;
s1 + value1;
s2 + value2;
if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;
run;
```

Result:

value1 value2 cat_var 2 5 1 3 5 1 6 10 2 2 4 3 3 5 3 2 5 3

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Try this

```
data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
data want(drop = s:);
set example;
if not s1 then cat_var + 1;
s1 + value1;
s2 + value2;
if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;
run;
```

Result:

value1 value2 cat_var 2 5 1 3 5 1 6 10 2 2 4 3 3 5 3 2 5 3

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Dear PeterClemmensen

It appears to work in this small example and I will try it out in my large dataset which 'example' is meant to reflect.

Could you please briefly walk me through your code? I tried to understand (few comments/questions below) it but I do not understand all of it.

data want(drop = s:); /* Drop all variables which start with an s */

set example;

if not s1 then cat_var + 1; /* cat_var is created and increases by 1 when the condition to the left is met. What is meant by "if not s1"? */

s1 + value1; /* In (new) variables s1 and s2 cum sum of value1 and value2 are computed */

s2 + value2;

if s1 ge 5 and s2 ge 10 then do; /* It must have something to do with the categorization */

s1 = 0; s2 = 0;

end;

run;

Thank you

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Because of the SUM statements

`if not s1 then cat_var + 1;`

and

```
s1 + value1;
s2 + value2;
```

both s1 and s2 are automatically retained.

In the first iteration of the data step, s1 will still be missing, which is considered as a *boolean value of false*; at the end of a category, s1 is set to zero, which is also a boolean value of false, so in both these cases, the increment of the cat_var variable is done.

Since a SUM *statement* also works like the SUM *function*, missing values are considered as zero, so the retained variables effectively start with zeroes.

These are the underlying principles that let this code work.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey Kurt

I appreciate the explanation. It is starting to make sense to me.

I did however not understand:

"Since a SUM *statement* also works like the SUM *function*, missing values are considered as zero, so the retained variables effectively start with zeroes."

- Where do missing values come into play in the syntax in question?

Thank you

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Are you here referencing the first iteration were s1 is missing, but will be considered as 0?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Unless stated explicitly otherwise (a RETAIN statement can also set an initial value), retained variables start as missing values when the data step starts executing. This is what I was referring to.

You can see the operation of a SUM statement with this simple code:

```
data test;
input s;
s1 + s;
datalines;
.
0
1
2
;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Another quick code to illustrate boolean values:

```
data another_test;
input s;
length result $5;
if s then result = "True"; else result = "False";
datalines;
.
0
.a
1
-1
0.0003
;
```

Missing (Including the "special missing" values) or zero is false, anything else is true.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This is great. Makes a lot more sense now.

Thank you!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey Kurt

Could you please help me modify the code such that it works in a scenario when the first values of 'value1' and 'value2' are both 0?

If I run the code below, the first observation will be its own category, which it should not be.

data example;

input value1 value2;

datalines;

0 0

2 5

3 5

6 10

2 4

3 5

2 5

;

data want (drop = s:);

set example;

if not s1 then cat_var + 1;

s1 + value1;

s2 + value2;

if s1 ge 5 and s2 ge 10 then do;

s1 = 0; s2 = 0;

end;

Thank you

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.