Hi everyone 🙂
I need your help to solve the following issue:
I need to classify the _ascvd_S variable into 4 categories (described below). But the following code does not work for some values and I can't figure why:
1 - Creation of the dataset:
%let NUM=0;
data _ASCVD_IMP_SDS_&NUM.;
format SUBJID $8. VISIT $16.;
input SUBJID $ VISIT $ vis Current_smoker $ _ascvd_S $;
datalines;
0805-004 Baseline 1 Yes 0.073
0805-004 Week_4 2 Yes 0.068
0101-003 Baseline 1 Yes 0.076
0101-003 Week_14 3 Yes 0.071
0003-007 Baseline 1 No 0.023
0003-007 Week_4 3 No 0.018
0003-007 End_of_treatment 4 No 0.018
0003-007 Week_4 3 No 0.018
0207-006 Baseline 1 No 0.322
0207-006 Week_4 2 No 0.199
1503-003 Baseline 1 No 0.214
1503-003 Week_14 3 No 0.199
0003-003 Baseline 1 Yes 0.257
0003-003 Week_28 5 Yes 0.199
;
run;
2 - derivation of the ascvd_100 variable:
%let NUM=%eval(&NUM.+1);
data _ASCVD_IMP_SDS_&NUM.;
set _ASCVD_IMP_SDS_%eval(&NUM.-1);
attrib ascvd_S label="Source ASCVD score (SDS)";
ascvd_S=_ascvd_S*1;
attrib ascvd_100 label="ASCVD score (pct) (SDS)";
ascvd_100=_ascvd_S*100;
drop _ascvd_S;
run;
3 - derivation of the classification variable (ascvd_100_cl):
%let NUM=%eval(&NUM.+1);
data _ASCVD_IMP_SDS_&NUM.;
set _ASCVD_IMP_SDS_%eval(&NUM.-1);
* ASCVD class;
attrib ascvd_S_cl format=$40. label="10-year risk for ASCVD (SDS)";
if missing(ascvd_S) eq 0 then do;
if ascvd_S<0.05 then ascvd_S_cl=" Low-risk (<0.05)";
else if 0.05<=ascvd_S<=0.074 then ascvd_S_cl=" Borderline risk (0.05 to 0.074)";
else if 0.074<=ascvd_S<=0.199 then ascvd_S_cl=" Intermediate risk (0.075 to 0.199)";
else if 0.2<=ascvd_S then ascvd_S_cl="High risk (>=0.2)";
end;
attrib ascvd_100_cl format=$40. label="10-year risk for ASCVD (SDS)";
if missing(ascvd_100) eq 0 then do;
if ascvd_100<5 then ascvd_100_cl=" Low-risk (<5pct)";
else if 5<=ascvd_100<=7.4 then ascvd_100_cl=" Borderline risk (5 to 7.4pct)";
else if 7.5<=ascvd_100<=19.9 then ascvd_100_cl=" Intermediate risk (7.5 to 19.9pct)";
else if 20<=ascvd_100 then ascvd_100_cl="High risk (>=20pct)";
end;
run;
As you can see, after running those 3 steps, the ascvd_100_cl parameter is not filled for ascvd_100 values = 19.9 whereas parameter ascvd_S_cl (based on ascvd_S) is.
Certainely I'm missing something here, but if anyone can help it would be very much appreciated!
Hi @LucileD and welcome to the SAS Support Communities!
You have encountered a numeric representation issue. Read the log below (created with Windows SAS 9.4M5) to see what happened.
136 data _null_; 137 x=0.199; 138 y=100*x; 139 z=19.9; 140 if y > z then put 'Surprised?'; 141 put (y z)(=hex16./); 142 run; Surprised? y=4033E66666666667 z=4033E66666666666 NOTE: DATA statement used (Total process time):
Whenever non-integer numbers (such as 0.199 or 19.9 in this example) are involved in calculations like addition, multiplication, etc. there is a high (!) risk that rounding errors lead to tiny, yet potentially momentous deviations from the exact results obtained by hand calculation. These rounding errors occur during the computer's calculations in the binary system, which is used internally to represent the decimal numeric values. The main problem is that most of these binary numbers have infinitely many digits and therefore must be rounded.
Your numbers, denoted x and z above, are no exceptions:
x= 0.00110010111100011010100111111011111001110110110010001011010000111001010110000001... z=10011.11100110011001100110011001100110011001100110011001100110011001100110011001100110...
To avoid this type of issue, apply the ROUND function with a suitable (small) rounding unit to the result of the calculation:
ascvd_S=input(_ascvd_S, 32.); ascvd_100=round(100*ascvd_S, 1e-9);
(Note that the character-to-numeric conversion using the INPUT function as shown above is good practice because it avoids the unwanted "NOTE: Character values have been converted to numeric values ..." in the log.)
Without appropriate rounding the result of 100*0.199 is slightly larger than 19.9 (see the last hex digit 7 in y above) and therefore falls into the gap between 19.9 and 20 that you left in your IF-THEN/ELSE conditions, as others have pointed out already. I would close those gaps (and also avoid overlaps) by omitting the duplicate borders in the inequality conditions:
if ascvd_S < 0.05 then ascvd_S_cl=...
else if ascvd_S <= 0.074 then ascvd_S_cl=...
else if ascvd_S <= 0.199 then ascvd_S_cl=...
else ascvd_S_cl=...
Or maybe create a user-defined format that assigns the labels ("Low-risk ...", etc.) to the numbers.
Edit: Depending on the actual specifications, "< 0.075" and "< 0.2" might be more appropriate than "<= 0.074" and "<= 0.199," respectively.
Hi.
In step 3 your classes aren't disjoined. There are gaps and overlaping values. This may be the cause of your problem.
3 - derivation of the classification variable (ascvd_100_cl):%let NUM=%eval(&NUM.+1); data _ASCVD_IMP_SDS_&NUM.; set _ASCVD_IMP_SDS_%eval(&NUM.-1); * ASCVD class; attrib ascvd_S_cl format=$40. label="10-year risk for ASCVD (SDS)"; if missing(ascvd_S) eq 0 then do; if ascvd_S<0.05 then ascvd_S_cl=" Low-risk (<0.05)"; else if 0.05<=ascvd_S<0.075 then ascvd_S_cl=" Borderline risk (0.05 to 0.074)"; else if 0.075<=ascvd_S<0.2 then ascvd_S_cl=" Intermediate risk (0.075 to 0.199)"; else if 0.2<=ascvd_S then ascvd_S_cl="High risk (>=0.2)"; end; attrib ascvd_100_cl format=$40. label="10-year risk for ASCVD (SDS)"; if missing(ascvd_100) eq 0 then do; if ascvd_100<5 then ascvd_100_cl=" Low-risk (<5pct)"; else if 5<=ascvd_100<7.5 then ascvd_100_cl=" Borderline risk (5 to 7.4pct)"; else if 7.5<=ascvd_100<20 then ascvd_100_cl=" Intermediate risk (7.5 to 19.9pct)"; else if 20<=ascvd_100 then ascvd_100_cl="High risk (>=20pct)"; end; run;
Regards.
Axel Renoux
Apart from overlapping ranges, I too couldn't find any problem with the code. Adding below code in the 3rd step fixes (temporary
%let NUM=%eval(&NUM.+1);
data _ASCVD_IMP_SDS_&NUM.;
set _ASCVD_IMP_SDS_%eval(&NUM.-1);
* ASCVD class;
attrib ascvd_S_cl format=$40. label="10-year risk for ASCVD (SDS)";
if missing(ascvd_S) eq 0 then do;
if ascvd_S<0.05 then ascvd_S_cl=" Low-risk (<0.05)";
else if 0.05<=ascvd_S<=0.074 then ascvd_S_cl=" Borderline risk (0.05 to 0.074)";
else if 0.075<=ascvd_S<=0.199 then ascvd_S_cl=" Intermediate risk (0.075 to 0.199)";
else if 0.2<=ascvd_S then ascvd_S_cl="High risk (>=0.2)";
end;
attrib ascvd_100_cl format=$40. label="10-year risk for ASCVD (SDS)";
if missing(ascvd_100) eq 0 then do;
if ascvd_100<5 then ascvd_100_cl=" Low-risk (<5pct)";
else if 5<=ascvd_100<=7.4 then ascvd_100_cl=" Borderline risk (5 to 7.4pct)";
else if 7.5 le ascvd_100 le 19 then ascvd_100_cl ="Intermediate risk (7.5 to 19.9pct)";
else if 19 le ascvd_100 lt 20 then ascvd_100_cl ="Intermediate risk (7.5 to 19.9pct)";
else if 20<=ascvd_100 then ascvd_100_cl="High risk (>=20pct)";
end;
run;
).
Hi @LucileD and welcome to the SAS Support Communities!
You have encountered a numeric representation issue. Read the log below (created with Windows SAS 9.4M5) to see what happened.
136 data _null_; 137 x=0.199; 138 y=100*x; 139 z=19.9; 140 if y > z then put 'Surprised?'; 141 put (y z)(=hex16./); 142 run; Surprised? y=4033E66666666667 z=4033E66666666666 NOTE: DATA statement used (Total process time):
Whenever non-integer numbers (such as 0.199 or 19.9 in this example) are involved in calculations like addition, multiplication, etc. there is a high (!) risk that rounding errors lead to tiny, yet potentially momentous deviations from the exact results obtained by hand calculation. These rounding errors occur during the computer's calculations in the binary system, which is used internally to represent the decimal numeric values. The main problem is that most of these binary numbers have infinitely many digits and therefore must be rounded.
Your numbers, denoted x and z above, are no exceptions:
x= 0.00110010111100011010100111111011111001110110110010001011010000111001010110000001... z=10011.11100110011001100110011001100110011001100110011001100110011001100110011001100110...
To avoid this type of issue, apply the ROUND function with a suitable (small) rounding unit to the result of the calculation:
ascvd_S=input(_ascvd_S, 32.); ascvd_100=round(100*ascvd_S, 1e-9);
(Note that the character-to-numeric conversion using the INPUT function as shown above is good practice because it avoids the unwanted "NOTE: Character values have been converted to numeric values ..." in the log.)
Without appropriate rounding the result of 100*0.199 is slightly larger than 19.9 (see the last hex digit 7 in y above) and therefore falls into the gap between 19.9 and 20 that you left in your IF-THEN/ELSE conditions, as others have pointed out already. I would close those gaps (and also avoid overlaps) by omitting the duplicate borders in the inequality conditions:
if ascvd_S < 0.05 then ascvd_S_cl=...
else if ascvd_S <= 0.074 then ascvd_S_cl=...
else if ascvd_S <= 0.199 then ascvd_S_cl=...
else ascvd_S_cl=...
Or maybe create a user-defined format that assigns the labels ("Low-risk ...", etc.) to the numbers.
Edit: Depending on the actual specifications, "< 0.075" and "< 0.2" might be more appropriate than "<= 0.074" and "<= 0.199," respectively.
When an assignment is using only ONE variable then a common way in SAS to handle this type of use is a CUSTOM Format. Formats can be more flexible than creating variables. Consider the following code example:
data example; format SUBJID $8. VISIT $16.; input SUBJID $ VISIT $ vis Current_smoker $ ascvd_S ; datalines; 0805-004 Baseline 1 Yes 0.073 0805-004 Week_4 2 Yes 0.068 0101-003 Baseline 1 Yes 0.076 0101-003 Week_14 3 Yes 0.071 0003-007 Baseline 1 No 0.023 0003-007 Week_4 3 No 0.018 0003-007 End_of_treatment 4 No 0.018 0003-007 Week_4 3 No 0.018 0207-006 Baseline 1 No 0.322 0207-006 Week_4 2 No 0.199 1503-003 Baseline 1 No 0.214 1503-003 Week_14 3 No 0.199 0003-003 Baseline 1 Yes 0.257 0003-003 Week_28 5 Yes 0.199 ; run; proc format; value ascvd_s . =" " low -< 0.05 =" Low-risk" 0.05 -<0.074 =" Borderline risk" 0.074-<0.199 =" Intermediate risk" 0.199-High ="High risk" ; run; /*display existing values with the format*/ Proc print data=example noobs; format ascvd_S ascvd_S.; run; /* use the formatted values in analysis*/ proc freq data=example; tables subjid*ascvd_s; format ascvd_S ascvd_S.; run; /* graph using formatted value*/ proc sgplot data=example; vbar ascvd_s / group=subjid; format ascvd_S ascvd_S.; run;
If someone decides that the boundary for "low risk" should actually be at 0.055 instead of 0.05 the ONLY change needed to make the display, frequency table or graph comply would be to change the 0.05 in the Format code to 0.055 in two places. Or could change multiple boundaries. But I would only have to change the format code. Or create a different format and use it as desired to create the groups used in the Proc Freq or Sgplot code. Most formats are acceptable for report, analysis or graphing procedures and the exceptions usually have to do with custom date formats.
I have a library of formats for age that assign 5 or 10 year intervals, specific age breakouts like <18, 18-44,45+ or <65 and 65+. I use the needed format with a single Age variable for reports and graphs.
BTW your example data has a character variable _ascvd_s not the numeric variable ascvd_s as used in your assignment logic. My example data uses the numeric values as range comparisons work poorly with character variables.
Dear all,
Thanks for all your answers!!!
To FreelanceReinhard: thanks for the explanation, it is exactly what I was looking for!
To Renoux and kleelasiva9: As you can see, all values are written with only 2 decimals so the classes are ok (and thanks to this I was able to spot the issue with 19.9!! 🙂 )
To Renoux: Yes, I could have put all the steps in one dataset, but I really don't like to program "data toto; set toto;" 🙂
To ballardw: Yes, I could have created a format, but I prefer to have both variables (source + derived (formatted or not)), it helps me check my programming :). Moreover, as there is an issue with 19.9 not beeing included in the "<=19.9" condition, the format you proposed would have classified those values as "High Risk" 🙂
Again, thank you all and have a nice day!
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.