🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Quartz | Level 8

## Categorize observations based on cumulative sums for two variables

Dear SAS experts

I would like to create a categorical variable that include all observations that sum up to certain values based on two numerical variables.

Given this simple dataset:

data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
run;

I want to categorize the observations (as they are ordered here) based on running sums. When the running sum reaches >=5 for value1 AND >=10 for value2 a category should be created. Then, the process is repeated again. The resulting dataset should look like:

value1 value2 cat_var

2         5         1
3         5         1
6         10       2
2         4         3
3         5         3
2         5         3

For category 1 it takes two observations before the conditions described above are met. For category 2 the condition is already met in the 3rd observation and therefore only one observation is included in category 2. For category 3, while the condition is met for value1 already in the 5th observation for value2 it is met on the 6th observation. Therefore 3 observations are included in category 3.

Does anyone have a suggestion on which syntax I can use to create such a categorical variable?

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Tourmaline | Level 20

## Re: Categorize observations based on cumulative sums for two variables

Try this

``````data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;

data want(drop = s:);
set example;

if not s1 then cat_var + 1;

s1 + value1;
s2 + value2;

if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;

run;``````

Result:

```value1  value2  cat_var
2       5       1
3       5       1
6       10      2
2       4       3
3       5       3
2       5       3```
9 REPLIES 9
Tourmaline | Level 20

## Re: Categorize observations based on cumulative sums for two variables

Try this

``````data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;

data want(drop = s:);
set example;

if not s1 then cat_var + 1;

s1 + value1;
s2 + value2;

if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;

run;``````

Result:

```value1  value2  cat_var
2       5       1
3       5       1
6       10      2
2       4       3
3       5       3
2       5       3```
Quartz | Level 8

## Re: Categorize observations based on cumulative sums for two variables

Dear PeterClemmensen

It appears to work in this small example and I will try it out in my large dataset which 'example' is meant to reflect.

Could you please briefly walk me through your code? I tried to understand (few comments/questions below) it but I do not understand all of it.

data want(drop = s:); /* Drop all variables which start with an s */
set example;

if not s1 then cat_var + 1; /* cat_var is created and increases by 1 when the condition to the left is met. What is meant by "if not s1"? */

s1 + value1; /* In (new) variables s1 and s2 cum sum of value1 and value2 are computed */
s2 + value2;

if s1 ge 5 and s2 ge 10 then do; /* It must have something to do with the categorization */
s1 = 0; s2 = 0;
end;

run;

Thank you

Super User

## Re: Categorize observations based on cumulative sums for two variables

Because of the SUM statements

``if not s1 then cat_var + 1;``

and

``````s1 + value1;
s2 + value2;``````

both s1 and s2 are automatically retained.

In the first iteration of the data step, s1 will still be missing, which is considered as a boolean value of false; at the end of a category, s1 is set to zero, which is also a boolean value of false, so in both these cases, the increment of the cat_var variable is done.

Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes.

These are the underlying principles that let this code work.

Quartz | Level 8

## Re: Categorize observations based on cumulative sums for two variables

Hey Kurt

I appreciate the explanation. It is starting to make sense to me.

I did however not understand:

"Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes."

- Where do missing values come into play in the syntax in question?

Thank you

Quartz | Level 8

## Re: Categorize observations based on cumulative sums for two variables

Are you here referencing the first iteration were s1 is missing, but will be considered as 0?

Super User

## Re: Categorize observations based on cumulative sums for two variables

Unless stated explicitly otherwise (a RETAIN statement can also set an initial value), retained variables start as missing values when the data step starts executing. This is what I was referring to.

You can see the operation of a SUM statement with this simple code:

``````data test;
input s;
s1 + s;
datalines;
.
0
1
2
;``````
Super User

## Re: Categorize observations based on cumulative sums for two variables

Another quick code to illustrate boolean values:

``````data another_test;
input s;
length result \$5;
if s then result = "True"; else result = "False";
datalines;
.
0
.a
1
-1
0.0003
;``````

Missing (Including the "special missing" values) or zero is false, anything else is true.

Quartz | Level 8

## Re: Categorize observations based on cumulative sums for two variables

This is great. Makes a lot more sense now.

Thank you!

Quartz | Level 8

## Re: Categorize observations based on cumulative sums for two variables

Hey Kurt

Could you please help me modify the code such that it works in a scenario when the first values of 'value1' and 'value2' are both 0?

If I run the code below, the first observation will be its own category, which it should not be.

data example;
input value1 value2;
datalines;
0 0
2 5
3 5
6 10
2 4
3 5
2 5
;

data want (drop = s:);
set example;

if not s1 then cat_var + 1;

s1 + value1;
s2 + value2;

if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;

Thank you

Discussion stats
• 9 replies
• 627 views
• 4 likes
• 3 in conversation