Dear SAS experts
I would like to create a categorical variable that include all observations that sum up to certain values based on two numerical variables.
Given this simple dataset:
data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
run;
I want to categorize the observations (as they are ordered here) based on running sums. When the running sum reaches >=5 for value1 AND >=10 for value2 a category should be created. Then, the process is repeated again. The resulting dataset should look like:
value1 value2 cat_var
2 5 1
3 5 1
6 10 2
2 4 3
3 5 3
2 5 3
For category 1 it takes two observations before the conditions described above are met. For category 2 the condition is already met in the 3rd observation and therefore only one observation is included in category 2. For category 3, while the condition is met for value1 already in the 5th observation for value2 it is met on the 6th observation. Therefore 3 observations are included in category 3.
Does anyone have a suggestion on which syntax I can use to create such a categorical variable?
Thank you
Try this
data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
data want(drop = s:);
set example;
if not s1 then cat_var + 1;
s1 + value1;
s2 + value2;
if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;
run;
Result:
value1 value2 cat_var 2 5 1 3 5 1 6 10 2 2 4 3 3 5 3 2 5 3
Try this
data example;
input value1 value2;
datalines;
2 5
3 5
6 10
2 4
3 5
2 5
;
data want(drop = s:);
set example;
if not s1 then cat_var + 1;
s1 + value1;
s2 + value2;
if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;
run;
Result:
value1 value2 cat_var 2 5 1 3 5 1 6 10 2 2 4 3 3 5 3 2 5 3
Dear PeterClemmensen
It appears to work in this small example and I will try it out in my large dataset which 'example' is meant to reflect.
Could you please briefly walk me through your code? I tried to understand (few comments/questions below) it but I do not understand all of it.
data want(drop = s:); /* Drop all variables which start with an s */
set example;
if not s1 then cat_var + 1; /* cat_var is created and increases by 1 when the condition to the left is met. What is meant by "if not s1"? */
s1 + value1; /* In (new) variables s1 and s2 cum sum of value1 and value2 are computed */
s2 + value2;
if s1 ge 5 and s2 ge 10 then do; /* It must have something to do with the categorization */
s1 = 0; s2 = 0;
end;
run;
Thank you
Because of the SUM statements
if not s1 then cat_var + 1;
and
s1 + value1;
s2 + value2;
both s1 and s2 are automatically retained.
In the first iteration of the data step, s1 will still be missing, which is considered as a boolean value of false; at the end of a category, s1 is set to zero, which is also a boolean value of false, so in both these cases, the increment of the cat_var variable is done.
Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes.
These are the underlying principles that let this code work.
Hey Kurt
I appreciate the explanation. It is starting to make sense to me.
I did however not understand:
"Since a SUM statement also works like the SUM function, missing values are considered as zero, so the retained variables effectively start with zeroes."
- Where do missing values come into play in the syntax in question?
Thank you
Are you here referencing the first iteration were s1 is missing, but will be considered as 0?
Unless stated explicitly otherwise (a RETAIN statement can also set an initial value), retained variables start as missing values when the data step starts executing. This is what I was referring to.
You can see the operation of a SUM statement with this simple code:
data test;
input s;
s1 + s;
datalines;
.
0
1
2
;
Another quick code to illustrate boolean values:
data another_test;
input s;
length result $5;
if s then result = "True"; else result = "False";
datalines;
.
0
.a
1
-1
0.0003
;
Missing (Including the "special missing" values) or zero is false, anything else is true.
This is great. Makes a lot more sense now.
Thank you!
Hey Kurt
Could you please help me modify the code such that it works in a scenario when the first values of 'value1' and 'value2' are both 0?
If I run the code below, the first observation will be its own category, which it should not be.
data example;
input value1 value2;
datalines;
0 0
2 5
3 5
6 10
2 4
3 5
2 5
;
data want (drop = s:);
set example;
if not s1 then cat_var + 1;
s1 + value1;
s2 + value2;
if s1 ge 5 and s2 ge 10 then do;
s1 = 0; s2 = 0;
end;
Thank you
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.