Hello! I am looking for help creating a 0/1 categorical variable, where 0 = negative for the specified condition and 1 = positive for the specified condition. The statement I am using (pasted below) pulls in a bunch of different variables that must meet certain criteria to be slated into the "1" category. I want to verify that 1) what I am doing is correct, and my statements aren't overwriting any previous statements, and 2) see if there is a better way to do this, perhaps using an array of some kind (?) Any advice is welcome, and I would be happy to provide further explanations if necessary. Thank you!
*/ Number of persons identified as having experienced FM3strong = 1 plus VALUE = 1 or LABOR = 1;
OR number of persons answering yes to 2 or more of the following: R1strong, EP1strong, PL1strong, DC3strong, FM1strong or FM2strong, V3strong or V4strong, plus VALUE = 1 or LABOR = 1;
OR number of persons answering yes to at least 1 of the following: R1strong, EP1strong, PL1strong, DC3strong, FM1strong, FM2strong, V3strong, V4strong PLUS answering yes to 3 or more of the following: EP5med, EP6med, EP7med, FM4med, V8med, plus VALUE = 1 or LABOR = 1.
*/
TIPlifenum =.;
if FM3strong = 1 and VALUE = 1 then TIPlifenum = 1;
else if FM3strong = 1 and LABOR = 1 then TIPlifenum = 1;
else if sum(R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V3strong) ge 2 or
sum (R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V4strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V3strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V4strong) ge 2
and VALUE = 1 then TIPlifenum = 1;
else if sum(R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V3strong) ge 2 or
sum (R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V4strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V3strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V4strong) ge 2
and LABOR = 1 then TIPlifenum = 1;
else if sum (R1strong, EP1strong, PL1strong, FM1strong, FM2strong, V3strong, V4strong) ge 1 and sum(EP5med, EP6med, EP7med, FM4med, V8med) ge 3
and VALUE = 1 then TIPlifenum = 1;
else if sum (R1strong, EP1strong, PL1strong, FM1strong, FM2strong, V3strong, V4strong) ge 1 and sum(EP5med, EP6med, EP7med, FM4med, V8med) ge 3
and LABOR = 1 then TIPlifenum = 1;
else TIPlifenum = 0;
Hello @ab81,
In addition to the missing parentheses around the Boolean expressions combined with OR (as pointed out by @PhilC) you missed variable DC3strong in two of the last four sums. It seems that @PhilC has copied this mistake into his code.
@ab81 wrote:
(...) see if there is a better way to do this, perhaps using an array of some kind (?)
Yes, you can use arrays to abbreviate repetitive lists of variables:
data want;
set have;
array _a[*] R1strong EP1strong PL1strong DC3strong;
array _b[*] FM1strong FM2strong;
array _c[*] V3strong V4strong;
TIPlifenum=
(FM3strong = 1
| sum(0, of _a[*], max(0, of _b[*]), max(0, of _c[*]))>=2
| sum(0, of _a[*], of _b[*], of _c[*])>=1 & sum(0, EP5med, EP6med, EP7med, FM4med, V8med)>=3)
& (VALUE = 1 | LABOR = 1);
run;
This assumes 0-1 coded variables, but also allows for missing values. If you are sure that no missing values occur, you can delete the zeros in the SUM and MAX function arguments. Their only purpose is to avoid notes "Missing value were generated ..." in the log. They do not change the results.
In general, missing values can be harmful. For example, the expression
sum_strong0 + DC3strong
(contained in @PhilC's code) would be missing if sum_strong0 was missing and thus lead to ignoring a valid value DC3strong=1.
I used this input dataset HAVE and then PROC COMPARE on the "WANT" datasets to compare the suggested solutions:
data have;
do FM3strong = .,0,1;
do VALUE = .,0,1;
do LABOR = .,0,1;
do R1strong = .,0,1;
do EP1strong = .,0,1;
do PL1strong = .,0,1;
do DC3strong = .,0,1;
do FM1strong = .,0,1;
do FM2strong = .,0,1;
do V3strong = .,0,1;
do V4strong = .,0,1;
do EP5med = .,0,1;
do EP6med = .,0,1;
do EP7med = .,0,1;
do FM4med = .,0,1;
do V8med = .,0,1;
output;
end; end; end; end; end; end; end; end; end; end; end; end; end; end; end; end;
run; /* 3**16=43046721 obs. */
I see a way to simplify this. (edit: I'm assuming that the "or'd sums" have parentheses.) This reminds me of my college Discrete Mathematics class, my Algebra classes, and both Boolean algebra and the Distributive property. I'll leave it to you or someone else to prove:
TIPlifenum = (FM3strong OR sum_strong OR sum_strong2 ) and (VALUE OR LABOR ) ;
where:
sum_strong0= sum(R1strong, EP1strong, PL1strong);
sum_strong=
sum (
sum_strong0 + DC3strong,
max ( sum(FM1strong, V3strong),
sum(FM1strong, V4strong),
sum(FM2strong, V3strong),
sum(FM2strong, V4strong) )) ge 2;
sum_strong2=
sum(sum_strong0, FM1strong, FM2strong, V3strong, V4strong) ge 1 and
sum(EP5med, EP6med, EP7med, FM4med, V8med) ge 3;
Question: you are not running this algorithm on super large datasets are you? Or IOW, you are not optimizing this code for purposes of increased speed?
One more thing, Order of Operations, you'll have to take care are using these correctly. AND is higher on the order of operations so the AND here happens first
else if sum(R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V3strong) ge 2 or
sum (R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V4strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V3strong) ge 2 or
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V4strong) ge 2
AND VALUE = 1 then TIPlifenum = 1;
I think you might have intended for the ORs to happen before the AND. If so, you'll need parentheses.
else if sum(R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V3strong) ge 2 OR
sum (R1strong, EP1strong, PL1strong, DC3strong, FM1strong, V4strong) ge 2 OR
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V3strong) ge 2 OR
sum(R1strong, EP1strong, PL1strong, DC3strong, FM2strong, V4strong) ge 2
and VALUE = 1 then TIPlifenum = 1;
What type of values are in that plethora of variables? Are they all 0/1 coded ?
If so, you might consider evaluating some of the repeated things such as
sum (R1strong, EP1strong, PL1strong, DC3strong). If that is ge 2 then the values of the other variables don't matter.
With 1/0 coded variables then MAX(of list)=1 tells you that ate least one of the variables is 1.
Hello @ab81,
In addition to the missing parentheses around the Boolean expressions combined with OR (as pointed out by @PhilC) you missed variable DC3strong in two of the last four sums. It seems that @PhilC has copied this mistake into his code.
@ab81 wrote:
(...) see if there is a better way to do this, perhaps using an array of some kind (?)
Yes, you can use arrays to abbreviate repetitive lists of variables:
data want;
set have;
array _a[*] R1strong EP1strong PL1strong DC3strong;
array _b[*] FM1strong FM2strong;
array _c[*] V3strong V4strong;
TIPlifenum=
(FM3strong = 1
| sum(0, of _a[*], max(0, of _b[*]), max(0, of _c[*]))>=2
| sum(0, of _a[*], of _b[*], of _c[*])>=1 & sum(0, EP5med, EP6med, EP7med, FM4med, V8med)>=3)
& (VALUE = 1 | LABOR = 1);
run;
This assumes 0-1 coded variables, but also allows for missing values. If you are sure that no missing values occur, you can delete the zeros in the SUM and MAX function arguments. Their only purpose is to avoid notes "Missing value were generated ..." in the log. They do not change the results.
In general, missing values can be harmful. For example, the expression
sum_strong0 + DC3strong
(contained in @PhilC's code) would be missing if sum_strong0 was missing and thus lead to ignoring a valid value DC3strong=1.
I used this input dataset HAVE and then PROC COMPARE on the "WANT" datasets to compare the suggested solutions:
data have;
do FM3strong = .,0,1;
do VALUE = .,0,1;
do LABOR = .,0,1;
do R1strong = .,0,1;
do EP1strong = .,0,1;
do PL1strong = .,0,1;
do DC3strong = .,0,1;
do FM1strong = .,0,1;
do FM2strong = .,0,1;
do V3strong = .,0,1;
do V4strong = .,0,1;
do EP5med = .,0,1;
do EP6med = .,0,1;
do EP7med = .,0,1;
do FM4med = .,0,1;
do V8med = .,0,1;
output;
end; end; end; end; end; end; end; end; end; end; end; end; end; end; end; end;
run; /* 3**16=43046721 obs. */
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.