BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
mgrasmussen
Quartz | Level 8

Dear SAS experts

 

I would like to create a categorical variable that include all observations that sum up to certain values based on two numerical variables.

 

Given this simple dataset:

 

data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
run;

 

I want to categorize the observations (as they are ordered here) based on running sums. When the running sum reaches >=5 for value1 AND >=10 for value2 a category should be created. Then, the process is repeated again. The resulting dataset should look like:

 

value1 value2 cat_var

2         5         1
3         5         1
6         10       2
2         4         3
3         5         3
2         5         3

 

For category 1 it takes two observations before the conditions described above are met. For category 2 the condition is already met in the 3rd observation and therefore only one observation is included in category 2. For category 3, while the condition is met for value1 already in the 5th observation for value2 it is met on the 6th observation. Therefore 3 observations are included in category 3.

 

Does anyone have a suggestion on which syntax I can use to create such a categorical variable?

 

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Try this

 

data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;

data want(drop = s:);
   set example;
   
   if not s1 then cat_var + 1;
   
   s1 + value1;
   s2 + value2;
   
   if s1 ge 5 and s2 ge 10 then do;
      s1 = 0; s2 = 0;
   end;
   
run;

 

Result:

 

value1  value2  cat_var
2       5       1
3       5       1
6       10      2
2       4       3
3       5       3
2       5       3

View solution in original post

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

Try this

 

data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;

data want(drop = s:);
   set example;
   
   if not s1 then cat_var + 1;
   
   s1 + value1;
   s2 + value2;
   
   if s1 ge 5 and s2 ge 10 then do;
      s1 = 0; s2 = 0;
   end;
   
run;

 

Result:

 

value1  value2  cat_var
2       5       1
3       5       1
6       10      2
2       4       3
3       5       3
2       5       3
mgrasmussen
Quartz | Level 8

Dear PeterClemmensen

 

It appears to work in this small example and I will try it out in my large dataset which 'example' is meant to reflect.

 

Could you please briefly walk me through your code? I tried to understand (few comments/questions below) it but I do not understand all of it.

 

data want(drop = s:); /* Drop all variables which start with an s */
set example;

if not s1 then cat_var + 1; /* cat_var is created and increases by 1 when the condition to the left is met. What is meant by "if not s1"? */

s1 + value1; /* In (new) variables s1 and s2 cum sum of value1 and value2 are computed */
s2 + value2;

if s1 ge 5 and s2 ge 10 then do; /* It must have something to do with the categorization */
s1 = 0; s2 = 0;
end;

run;

 

Thank you

Kurt_Bremser
Super User

Because of the SUM statements

if not s1 then cat_var + 1;

and

s1 + value1;
s2 + value2;

both s1 and s2 are automatically retained.

In the first iteration of the data step, s1 will still be missing, which is considered as a boolean value of false; at the end of a category, s1 is set to zero, which is also a boolean value of false, so in both these cases, the increment of the cat_var variable is done.

Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes.

These are the underlying principles that let this code work.

mgrasmussen
Quartz | Level 8

Hey Kurt

 

I appreciate the explanation. It is starting to make sense to me.

 

I did however not understand:

 

"Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes."

- Where do missing values come into play in the syntax in question?

 

Thank you

mgrasmussen
Quartz | Level 8

Are you here referencing the first iteration were s1 is missing, but will be considered as 0?

Kurt_Bremser
Super User

Unless stated explicitly otherwise (a RETAIN statement can also set an initial value), retained variables start as missing values when the data step starts executing. This is what I was referring to.

You can see the operation of a SUM statement with this simple code:

data test;
input s;
s1 + s;
datalines;
.
0
1
2
;
Kurt_Bremser
Super User

Another quick code to illustrate boolean values:

data another_test;
input s;
length result $5;
if s then result = "True"; else result = "False";
datalines;
.
0
.a
1
-1
0.0003
;

Missing (Including the "special missing" values) or zero is false, anything else is true.

mgrasmussen
Quartz | Level 8

This is great. Makes a lot more sense now.

 

Thank you!

mgrasmussen
Quartz | Level 8

Hey Kurt

 

Could you please help me modify the code such that it works in a scenario when the first values of 'value1' and 'value2' are both 0? 

 

If I run the code below, the first observation will be its own category, which it should not be.

 

data example;
input value1 value2;
datalines;
0 0
2 5
3 5
6 10
2 4
3 5
2 5
;

data want (drop = s:);
set example;

if not s1 then cat_var + 1;

s1 + value1;
s2 + value2;

if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;

 

Thank you

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 631 views
  • 4 likes
  • 3 in conversation