I am practicing for the base certification and having trouble understanding the ask in Step 3. The answer to this has a if group='A' then kilograms=79 else kilograms=89; I don't understand where or why these values are used. Any help would be appreciated.
The Question Asked:
Step 1:
data work.cleandata36; set cert.input36; group=upcase(group); if upcase(group) in ('A','B'); run; Step 2:
proc means data=work.cleandata36 median; class group; var kilograms; run; Step 3:
data results.output36; set cleandata36; if Kilograms < 40 OR Kilograms > 200 then do; if group='A' then kilograms=79; else kilograms=89; end; run;
Elaborating on my earlier answer, here's a program that does I believe everything that they're asking for. It uses the calculated means in the final step where values out of range are replaced with the median values. My data has a median of 79 for Group A and 89 for Group B.
As a suggestion: You might try running my program and playing with it a bit in order to understand it better.
See also the comments in the code, such as they are.
Jim
LIBNAME Cert 'X:\sastemp';
LIBNAME Results 'X:\sastemp';
** Step 0 - Load data **;
data Cert.INPUT36;
LENGTH Group $1;
infile datalines4 dsd dlm='09'x;
input Group $
Kilograms
;
DATALINES4;
A 24
A 54
A 60
A 44
A 84
A 42
A 52
A 72
A 74
A 36
A 38
a 104
a 190
a 180
a 120
a 100
a 128
a 130
a 132
a 134
b 24
b 44
b 60
b 54
b 94
b 52
b 62
b 92
b 84
b 86
b 88
b 84
b 170
b 180
b 120
b 110
b 138
b 140
b 142
b 144
b 52
b 62
b 92
b 90
c 999
d 888
e 777
f 666
;;;;
run;
** Step 1 - Clean the data **;
data work.cleandata36;
set cert.input36;
group=upcase(group);
if upcase(group) in ('A','B');
run;
** Step 2 - Calculate the median save it in a data set named Summary **;
*ODS TRACE ON;
ODS OUTPUT Summary=Summary;
proc means data=work.cleandata36 median;
class group;
var kilograms;
run;
*ODS TRACE OFF;
** Step 3 - Populate macro variables containing the calculated median for each group **;
DATA _NULL_;
set WORK.Summary;
CALL SYMPUTX(CATS('Median_', Group), ROUND(Kilograms_Median), 'G');
RUN;
** Step 4 - Substitute the median values when the actual value is out of range **;
data results.output36;
set cleandata36;
if Kilograms < 40 OR Kilograms > 200 then
do;
if group='A' then
kilograms=&Median_A;
else
kilograms=&Median_B;
end;
run;
I don't have access to the certification data (at least not that I know of). What is the median value for Group A? What is the median value for Group B? I assume that those values are 79 and 89, respectively. If one sets a missing value to the median value, then one does not shift a curve one way or another, at least I think that's the idea behind this question. Using the median value is an attempt to avoid skewing the data.
Jim
Elaborating on my earlier answer, here's a program that does I believe everything that they're asking for. It uses the calculated means in the final step where values out of range are replaced with the median values. My data has a median of 79 for Group A and 89 for Group B.
As a suggestion: You might try running my program and playing with it a bit in order to understand it better.
See also the comments in the code, such as they are.
Jim
LIBNAME Cert 'X:\sastemp';
LIBNAME Results 'X:\sastemp';
** Step 0 - Load data **;
data Cert.INPUT36;
LENGTH Group $1;
infile datalines4 dsd dlm='09'x;
input Group $
Kilograms
;
DATALINES4;
A 24
A 54
A 60
A 44
A 84
A 42
A 52
A 72
A 74
A 36
A 38
a 104
a 190
a 180
a 120
a 100
a 128
a 130
a 132
a 134
b 24
b 44
b 60
b 54
b 94
b 52
b 62
b 92
b 84
b 86
b 88
b 84
b 170
b 180
b 120
b 110
b 138
b 140
b 142
b 144
b 52
b 62
b 92
b 90
c 999
d 888
e 777
f 666
;;;;
run;
** Step 1 - Clean the data **;
data work.cleandata36;
set cert.input36;
group=upcase(group);
if upcase(group) in ('A','B');
run;
** Step 2 - Calculate the median save it in a data set named Summary **;
*ODS TRACE ON;
ODS OUTPUT Summary=Summary;
proc means data=work.cleandata36 median;
class group;
var kilograms;
run;
*ODS TRACE OFF;
** Step 3 - Populate macro variables containing the calculated median for each group **;
DATA _NULL_;
set WORK.Summary;
CALL SYMPUTX(CATS('Median_', Group), ROUND(Kilograms_Median), 'G');
RUN;
** Step 4 - Substitute the median values when the actual value is out of range **;
data results.output36;
set cleandata36;
if Kilograms < 40 OR Kilograms > 200 then
do;
if group='A' then
kilograms=&Median_A;
else
kilograms=&Median_B;
end;
run;
Thank you very much for your help this definitely helps me understand it better!
I have no idea where 79 or 89 came from, but I assume they are the expected answer to step2.
Step 3 talks about replacing the value when the current value is either missing or not "valid".
The wording of the last two parts of step 3 is confusing. The middle line make it look like you were requested to eliminate the observations with such invalid values. If you did that then there would be any need to replace any values. So in that case you should first writes some words describing how you have translated their poorly worded question into something that makes sense and then provide the answer to that.
The test : (Kilograms < 40 OR Kilograms > 200 ) will includes values that are too large and values that are too small. Since SAS treats missing values as smaller than any actual value it will also include the missing values.
So you get to the nested IF/THEN/ELSE statements only when you want to replace the value of KILOGRAMS with the group means. That nested IF/THEN/ELSE group is only testing for GROUP='A' but because in STEP 1 you eliminated any values of GROUP that are not A or B then the ELSE must be those where GROUP='B'.
Yes the wording definitely threw me for a loop. After looking further you are correct the 79 and 89 are the median but it didnt click very easily at first when I read what was being asked. Thank you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.