Solved: Maximum value for chacracter variables in ID statement in PROC MEANS

Phil_NZ · Posted 03-09-2021 03:55 PM

Hi SAS Users,

Yesterday, I learned how to tag along a variable to proc means by using ID statement as suggested by @FreelanceReinh . However, the description for this one is quite ambiguous (without example).

Therefore, I did another search and see there is a discussion having the example and quite nice.

I use the sashelp.class dataset for this example

The dataset is:

Name	Sex	Age	Height	Weight
Joyce	F	11	51.3	50.5
Thomas	M	11	57.5	85
Jane	F	12	59.8	84.5
Louise	F	12	56.3	77
James	M	12	57.3	83
John	M	12	59	    99.5
Robert	M	12	64.8	128
Alice	F	13	56.5	84
Barbara	F	13	65.3	98
Jeffrey	M	13	62.5	84
Carol	F	14	62.8	102.5
Judy	F	14	64.3	90
Alfred	M	14	69	    112.5
Henry	M	14	63.5	102.5
Janet	F	15	62.5	112.5
Mary	F	15	66.5	112
Ronald	M	15	67	    133
William	M	15	66.5	112
Philip	M	16	72	    150

My code is as below

data have;
  set sashelp.class;
run;

proc sort data= have;
by age;
run;

proc means data=have;
by age;
var Height Weight;
id Sex;
output out=means(drop =_type_ _freq_);
run;

As what I understand, The value of variable Sex will follow the value of variable Age (if there are two vars in IDs, the second var will follow the maximum value of the first var (same row))

identifies one or more variables from the input data set whose maximum values for groups of observations PROC MEANS includes in the output data set

note that "maximum" refers to alphabetical order in case of character variables such as LOC

But, the result from running the code above is

I have two curiousities here:

1. I confused because the value shown in Sex should be "F" rather than "M:" (because 'F' standing in front of 'M' in alphabet. Even it does not relate to the calculation in general, but because I want to apply in my case with the variable LOC, so I need things precisely.

2. The ID statement seems to be efficiently used in a case that the character variables are constant over time like in my previous topic?

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

FreelanceReinh · Posted 03-09-2021 06:43 PM

Hi @Phil_NZ,

@Phil_NZ wrote:

1. I confused because the value shown in Sex should be "F" rather than "M:" (because 'F' standing in front of 'M' in alphabet.

No, this is not how the order of character variables is defined in SAS. For SAS 'A' < 'B' < 'C' < ... like 1 < 2 < 3 < ... More precisely, SAS (under Windows) uses the underlying numeric ASCII codes to define the order of character values. As a consequence, lowercase letters are "greater than" uppercase letters (e.g., 'Z' < 'a' is true), which in turn are "greater than" digits. See an ASCII table or play with the RANK or BYTE function to find out more details.

So, the "maximum" in the context of the ID statement in PROC MEANS refers to the value that comes last according to the ASCII collating sequence. In your example each of the age groups contains at least one male person and since 'M' > 'F' all age groups are "identified" by Sex='M'. Be creative and modify the example by assigning random letters (combine the BYTE and RAND('integer',...) functions) to variable Sex and see how the results change.

View solution in original post

Reeza · Posted 03-09-2021 04:12 PM

RTM.
ID -> identifies one or more variables from the input data set whose maximum values for groups of observations PROC MEANS includes in the output data set.

You likely need to remerge your data, ID identifies the SEX with the maximum value. If the value is not consistent within the BY groups and you're expecting the value from the original data then you must remerge instead of use ID.
Or look at the IDMIN or PRINTIDVARS option on the PROC statement as the documentation suggests.

How to remerge:
https://github.com/statgeek/SAS-Tutorials/blob/master/add_average_value_to_dataset.sas

Documentation
https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=proc&docsetTarget=n1niocu...

PROC SUMMARY offers some other ID options but in general, I prefer to just remerge as it's less testing or worrying.

Phil_NZ · Posted 03-09-2021 05:38 PM

Hi @Reeza

Many thanks for your suggestion

Let's say about the github solution, I deal with the Solution 1

*Add average value to a dataset;
*Solution 1 - PROC MEANS + Data step;
******************************************************;

proc means data=sashelp.class noprint;
    output out=avg_values mean(height)=avg_height;
run;

data class_data;
    set sashelp.class;

    if _n_=1 then
        set avg_values;
run;

proc print data=class_data;/* I adjusted this one from class to class_data because dataset class does not exist*/
run;

The result is as below:

It is a desired result, but I do not know how SAS put avg_height to every row like that. What I focus on is

data class_data;
    set sashelp.class;

    if _n_=1 then
        set avg_values;
run;

From such a code, we only have one time that _n_1 at the first iteration. So, avg_height should only display at the first row, and "." afterwards.

I know that my imagined running maybe wrong but could you please explain it to me?

Apart from that, could you please tell me what RTM stand for?

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.

Reeza · Posted 03-09-2021 05:55 PM

RTM means Read The Manual or documentation as it's called these days.

IF _N_=1 trick means you need to understand how the SAS PDV works with multiple SET statements.
There are many papers out there on how the PDV works including in the documentation (RTM).

FreelanceReinh · Posted 03-09-2021 06:50 PM

@Phil_NZ wrote:
data class_data;
    set sashelp.class;

    if _n_=1 then
        set avg_values;
run;
From such a code, we only have one time that _n_1 at the first iteration. So, avg_height should only display at the first row, and "." afterwards.

Remember that variables read with a SET statement (here: avg_height from the second SET statement) are automatically retained, i.e., they are not reset to missing when the DATA step iterates. See section Redundancy in the documentation of the RETAIN statement.

FreelanceReinh · Posted 03-09-2021 06:43 PM

Hi @Phil_NZ,

@Phil_NZ wrote:

1. I confused because the value shown in Sex should be "F" rather than "M:" (because 'F' standing in front of 'M' in alphabet.

No, this is not how the order of character variables is defined in SAS. For SAS 'A' < 'B' < 'C' < ... like 1 < 2 < 3 < ... More precisely, SAS (under Windows) uses the underlying numeric ASCII codes to define the order of character values. As a consequence, lowercase letters are "greater than" uppercase letters (e.g., 'Z' < 'a' is true), which in turn are "greater than" digits. See an ASCII table or play with the RANK or BYTE function to find out more details.

So, the "maximum" in the context of the ID statement in PROC MEANS refers to the value that comes last according to the ASCII collating sequence. In your example each of the age groups contains at least one male person and since 'M' > 'F' all age groups are "identified" by Sex='M'. Be creative and modify the example by assigning random letters (combine the BYTE and RAND('integer',...) functions) to variable Sex and see how the results change.

Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Registration is open

Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Re: Maximum value for chacracter variables in ID statement in PROC MEANS

Registration is open

SAS Training: Just a Click Away