The fact that I even have to post this tells u what kind of day I'm having 😞 If the HLT_ALZDEM variable doesn't have a valid value of 1 or 2 (yes/no), then the variables that are constructed based on HLT_ALZDEM should be given a value of missing. However, they sometimes are given a value of 0 instead.
Below is a portion of the output file, sorted for rows where HLT_ALZDEM is missing. Since missing is clearly not 1 or 2, all ADRD_group and non_ADRD_group values should also be missing. But they're not. Obs 9 and 10 are identical except for these 2 variables., when Obs 9 is missing and Obs 10 is 0 for both.
The only possible explanation I can think of is that there are non-printable characters for some values of HLT_ALZDEM. But even if there are, those non-printable characters are not 1 or 2. Frequencies for HLT_ALZDEM only show values of 1, 2, and missing.
IF cohort_flag = 1 AND ADM_MA_FLAG_YR = 3
THEN MA_group = 1; ELSE MA_group = 0;
IF cohort_flag = 0 THEN MA_group = .;
IF cohort_flag = 1 AND ADM_FFS_FLAG_YR = 3
THEN TM_group = 1; ELSE TM_group = 0;
IF cohort_flag = 0 THEN TM_group = .;
IF cohort_flag = 1 AND HLT_ALZDEM = 1
THEN ADRD_group = 1; ELSE ADRD_group = 0;
IF cohort_flag = 0 THEN ADRD_group = .;
IF HLT_ALZDEM ^in (1,2) THEN ADRD_group = .;
IF cohort_flag = 1 AND HLT_ALZDEM = 2
THEN non_ADRD_group = 1; ELSE non_ADRD_group = 0;
IF cohort_flag = 0 THEN non_ADRD_group = .;
IF HLT_ALZDEM ^in (1,2) THEN non_ADRD_group = .;
Obs | HLT_ALZDEM | cohort_flag | MA_group | TM_group | ADRD_group | non_ADRD_group |
1 |
0 | |||||
2 | 1 | 0 | 1 | 0 | 0 | |
3 | 1 | 1 | 0 | |||
4 | 0 | |||||
5 | 0 | |||||
6 | 1 | 0 | 1 | |||
7 | 1 | 1 | 0 | |||
8 | 1 | 1 | 0 | |||
9 | 1 | 1 | 0 | |||
10 | 1 | 1 | 0 | 0 | 0 | |
11 | 1 | 1 | 0 | |||
12 | 1 | 0 | 1 | |||
13 | 1 | 0 | 1 | |||
14 | 1 | 0 | 1 | |||
15 | 1 | 0 | 1 | |||
16 | 1 | 0 | 1 | 0 | 0 |
Embarrassing to say, but the solution was in the log the entire time. There was another set of IF-THEN statements for other variables in the same data step. I forgot to add "then" to one of those statements. That caused an error, so the old version of the file (which was based on syntax that did have a logic error) was not overwritten. I fixed the logic error in the syntax, but I kept looking at frequencies based on the old version of the file. I assumed there must still be some logic error, so I kept going down that path instead of checking the basics (like looking for errors in the log!). Psychologists call that "function fixedness". Friday was not a good day.
Using the exact data and code you shared I'm getting a different result. Can you please verify and share sample data that matches your code.
You haven't shared SAS log. Are there any notes in it about character to numeric conversion?
SAS numerical variables can't store non-printable characters. They could store special missings but none of your tests checks for missings where this could hit you.
And just as a thought: IF there is a custom format applied to your variables that prints certain values as missing then you could be misled. To ensure this is not the case add the following line to your data step.
format HLT_ALZDEM cohort_flag best32.;
Here your exact data and code to demonstrate that the outcome is different than what you describe.
data have;
infile datalines dsd dlm=',' truncover;
input Obs HLT_ALZDEM cohort_flag src_MA_group src_TM_group src_ADRD_group src_non_ADRD_group;
datalines;
1,,0,,,,
2,,1,0,1,0,0
3,,1,1,0,,
4,,0,,,,
5,,0,,,,
6,,1,0,1,,
7,,1,1,0,,
8,,1,1,0,,
9,,1,1,0,,
10,,1,1,0,0,0
11,,1,1,0,,
12,,1,0,1,,
13,,1,0,1,,
14,,1,0,1,,
15,,1,0,1,,
16,,1,0,1,0,0
;
run;
data want;
length Obs HLT_ALZDEM cohort_flag
src_MA_group MA_group src_TM_group TM_group
src_ADRD_group ADRD_group src_non_ADRD_group non_ADRD_group
8;
set have;
IF cohort_flag = 1 AND ADM_MA_FLAG_YR = 3
THEN MA_group = 1; ELSE MA_group = 0;
IF cohort_flag = 0 THEN MA_group = .;
IF cohort_flag = 1 AND ADM_FFS_FLAG_YR = 3
THEN TM_group = 1; ELSE TM_group = 0;
IF cohort_flag = 0 THEN TM_group = .;
IF cohort_flag = 1 AND HLT_ALZDEM = 1
THEN ADRD_group = 1; ELSE ADRD_group = 0;
IF cohort_flag = 0 THEN ADRD_group = .;
IF HLT_ALZDEM ^in (1,2) THEN ADRD_group = .;
IF cohort_flag = 1 AND HLT_ALZDEM = 2
THEN non_ADRD_group = 1; ELSE non_ADRD_group = 0;
IF cohort_flag = 0 THEN non_ADRD_group = .;
IF HLT_ALZDEM ^in (1,2) THEN non_ADRD_group = .;
run;
proc print data=want;
run;
And below not changing your logic but ordering the conditions a bit differently so they become easier to read and maintain:
data want;
length Obs HLT_ALZDEM cohort_flag
src_MA_group MA_group src_TM_group TM_group
src_ADRD_group ADRD_group src_non_ADRD_group non_ADRD_group
8;
set have;
IF cohort_flag = 0 THEN MA_group = .;
else IF cohort_flag = 1 AND ADM_MA_FLAG_YR = 3 THEN MA_group = 1;
ELSE MA_group = 0;
IF cohort_flag = 0 THEN TM_group = .;
else IF cohort_flag = 1 AND ADM_FFS_FLAG_YR = 3 THEN TM_group = 1;
ELSE TM_group = 0;
IF HLT_ALZDEM ^in (1,2) THEN ADRD_group = .;
else IF cohort_flag = 0 THEN ADRD_group = .;
else IF cohort_flag = 1 AND HLT_ALZDEM = 1 THEN ADRD_group = 1;
ELSE ADRD_group = 0;
IF HLT_ALZDEM ^in (1,2) THEN non_ADRD_group = .;
else IF cohort_flag = 0 THEN non_ADRD_group = .;
else IF cohort_flag = 1 AND HLT_ALZDEM = 2 THEN non_ADRD_group = 1;
ELSE non_ADRD_group = 0;
run;
Embarrassing to say, but the solution was in the log the entire time. There was another set of IF-THEN statements for other variables in the same data step. I forgot to add "then" to one of those statements. That caused an error, so the old version of the file (which was based on syntax that did have a logic error) was not overwritten. I fixed the logic error in the syntax, but I kept looking at frequencies based on the old version of the file. I assumed there must still be some logic error, so I kept going down that path instead of checking the basics (like looking for errors in the log!). Psychologists call that "function fixedness". Friday was not a good day.
Don't be shy. Mark your explanation as the solution, so we all know the problem has been solved.
The usual place to begin would be to examine the data. In particular take a look at COHORT_FLAG and see if it contains unusual values. They don't have to be unprintable characters. They could be values of 2 or 3 instead of what you expect. So just run a PROC FREQ to be sure.
Consider how you set up your logic:
IF cohort_flag = 1 AND HLT_ALZDEM = 1
THEN ADRD_group = 1; ELSE ADRD_group = 0;
IF cohort_flag = 0 THEN ADRD_group = .;
IF HLT_ALZDEM ^in (1,2) THEN ADRD_group = .;
This logic could easily give you a value of 0 for ADRD_group when cohort_flag is 3. Remember, the ELSE statement applies to either of these conditions:
That could be a logic problem resolved by checking the data. In fact, a general approach to solving this would be to take some observations that you believe are coming out incorrectly, and examine the incoming data values for that observation.
What I like to do in such cases is to create a list or cross-tabulation of all possible combinations, where the "possibilities" only go as far as the IF/ELSE conditions distinguish them. According to your code, the four new variables are derived independently of each other, each depending on just two existing variables, so cross-tabulations are feasible.
Cross-tabulations for MA_group (analogous for TM_group) and ADRD_group (similar for non_ADRD_group):
MA_group | ADM_MA_FLAG_YR cohort_flag | 3 | ne 3 ------------+-------+-------- 0 | . | . ------------+-------+-------- 1 | 1 | 0 ------------+-------+-------- ^in (0,1) | 0 | 0 -----------------------------
ADRD_group | HLT_ALZDEM cohort_flag | 1 | 2 | ^in (1,2) ------------+-------+-------+----------- 0 | . | . | . ------------+-------+-------+----------- 1 | 1 | 0 | . ------------+-------+-------+----------- ^in (0,1) | 0 | 0 | . ----------------------------------------
I derived these cross-tabulations from your code. You would better use the specifications as a basis for them and then start writing the code according to the cross-tabulations.
If the above cross-tabulations and those for TM_group (with ADM_FFS_FLAG_YR replacing ADM_MA_FLAG_YR) and non_ADRD_group (with "1 | 0" switched to "0 | 1") are in accordance with the specifications, the code could be simplified to:
data want(drop=_c);
set have;
if cohort_flag ne 0 then do;
_c=(cohort_flag=1);
MA_group=(_c & ADM_MA_FLAG_YR =3);
TM_group=(_c & ADM_FFS_FLAG_YR=3);
if HLT_ALZDEM in (1,2) then do;
ADRD_group=(_c & HLT_ALZDEM=1);
non_ADRD_group=(_c & HLT_ALZDEM=2);
end;
end;
run;
Important: This assumes that the new variables are not contained in dataset HAVE. Otherwise, they must be dropped from HAVE with a DROP= dataset option (or initialized to missing after the SET statement). Also, all variables involved are assumed to be numeric.
The derived variables are either set to 1 (for TRUE) or 0 (for FALSE) resulting from a Boolean expression such as _c & HLT_ALZDEM=1 (i.e., cohort_flag=1 & HLT_ALZDEM=1) or they are left missing if an IF condition prevents the execution of the assignment statement.
If you are in doubt about the content of a numeric variable, use the HEX16. format.
Example log:
378 data _null_; 379 have=9.9/3.3; 380 want=3; 381 put (have want)(=hex16./); 382 run; have=4008000000000001 want=4008000000000000
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.