Solved: Re: Mode based on group

Jbraun · Posted 07-23-2022 10:37 AM

Dear awesome community

I am hoping someone can help me with this one. I have data in which I would like to insert the lowest frequency occupation for each unit. I can find mean and other measures in proc sql, but not mode. The idea is to see if being in a profession with lowest frequency might predict resignation since you may feel that you don't belong as much as the others.

Below I have tried to explain with a dummy that signifies the occurence of a statistician in unit 1 and a doctors and therapists in unit 2 are the ones with lowest frequency occupation.

have:

data data.data;
input ID Occupation Unit;
datalines;

001 Nurse 1
002 Therapist 1

003 Nurse 1

004 Therapist 1

005 Statistician 1

006 Nurse 2

007 Nurse 2

008 Nurse 2
009 Doctor 2

010 Doctor 2

011 Therapist 2

012 Therapist 2

;
run;

want:

data data.data2;
input ID Occupation Unit OccuLeast;
datalines;

001 Nurse 1 0
002 Therapist 1 0

003 Nurse 1 0

004 Therapist 1 0

005 Statistician 1 1

006 Nurse 2 0

007 Nurse 2 0

008 Nurse 2 0
009 Doctor 2 1

010 Doctor 2 1

011 Therapist 2 1

012 Therapist 2 1

Any help would be very appreciated!

Kind regards

Jacob

PaigeMiller · Posted 07-23-2022 11:40 AM

Hello ... from now on, we request (actually we insist) that you provide your data as WORKING data step code. The code you provided does not work. Please make sure it works by actually testing it yourself before you include it in your message.

Here is my solution. It seems as if you are counting ties to all be occuleast, so to me this indicates you want to use PROC RANK where you can control the way ties are handled.

data have;
input ID Occupation :$12. Unit; 
datalines;
001 Nurse 1  
002 Therapist 1 
003 Nurse 1 
004 Therapist 1
005 Statistician 1
006 Nurse 2 
007 Nurse 2
008 Nurse 2
009 Doctor 2
010 Doctor 2
011 Therapist 2
012 Therapist 2
;
proc freq data=have;
	by unit;
	table occupation/out=_freqs_;
run;
proc rank data=_freqs_ out=min ties=low;
	var count;
	by unit;
	ranks count_ranked;
run;
proc sql;
	create table want as select a.*,(b.count_ranked=1) as occuleast
	from have as a left join min as b	
	on a.unit=b.unit and a.occupation=b.occupation
	order by a.id;
quit;

Another solution would be to use PROC SUMMARY in place of PROC RANK.

I can find mean and other measures in proc sql, but not mode.

This is one of many problems and situations, in my opinion, where trying to coerce a solution from PROC SQL is not a good idea. I urge you to stop trying to use SQL for these problems. See Maxim 7 (There is a procedure for it) and Maxim 10 (SQL may eat your work and time) and also take a look at all the other maxims. There is a procedure (PROC RANK) which actually ranks the occupation frequencies for you, no need to create your own SQL code to do this. https://communities.sas.com/t5/SAS-Communities-Library/Maxims-of-Maximally-Efficient-SAS-Programmers...

--
Paige Miller

View solution in original post

Jbraun · Posted 07-23-2022 10:38 AM

Sorry for misleading headline, can't edit - it is not mode I am looking for

PaigeMiller · Posted 07-23-2022 11:40 AM

Hello ... from now on, we request (actually we insist) that you provide your data as WORKING data step code. The code you provided does not work. Please make sure it works by actually testing it yourself before you include it in your message.

Here is my solution. It seems as if you are counting ties to all be occuleast, so to me this indicates you want to use PROC RANK where you can control the way ties are handled.

data have;
input ID Occupation :$12. Unit; 
datalines;
001 Nurse 1  
002 Therapist 1 
003 Nurse 1 
004 Therapist 1
005 Statistician 1
006 Nurse 2 
007 Nurse 2
008 Nurse 2
009 Doctor 2
010 Doctor 2
011 Therapist 2
012 Therapist 2
;
proc freq data=have;
	by unit;
	table occupation/out=_freqs_;
run;
proc rank data=_freqs_ out=min ties=low;
	var count;
	by unit;
	ranks count_ranked;
run;
proc sql;
	create table want as select a.*,(b.count_ranked=1) as occuleast
	from have as a left join min as b	
	on a.unit=b.unit and a.occupation=b.occupation
	order by a.id;
quit;

Another solution would be to use PROC SUMMARY in place of PROC RANK.

I can find mean and other measures in proc sql, but not mode.

This is one of many problems and situations, in my opinion, where trying to coerce a solution from PROC SQL is not a good idea. I urge you to stop trying to use SQL for these problems. See Maxim 7 (There is a procedure for it) and Maxim 10 (SQL may eat your work and time) and also take a look at all the other maxims. There is a procedure (PROC RANK) which actually ranks the occupation frequencies for you, no need to create your own SQL code to do this. https://communities.sas.com/t5/SAS-Communities-Library/Maxims-of-Maximally-Efficient-SAS-Programmers...

--
Paige Miller

Jbraun · Posted 07-23-2022 01:53 PM

Thank you, I apologize

Jbraun · Posted 07-23-2022 02:53 PM

As a token of appreciation I can add that "as" can be left out

ballardw · Posted 07-23-2022 05:18 PM

@Jbraun wrote:

As a token of appreciation I can add that "as" can be left out

As a personal style point however I don't. The saving of 2 characters typing occasionally doesn't recover the time spent parsing out syntax, especially from people with no, poor or very unusual indenting styles of code.

ballardw · Posted 07-23-2022 05:52 PM

As an addendum to @PaigeMiller about choosing the correct procedure.

Consider that I want a data where I have 3 categorical variables, for reference call them Cat1 cat2 and Cat3.

I also have 25 analysis variables, v1 through v25, where I want the min, max, median, mode, std, interquartile range, p5 and p95.

For combinations of the categorical variables: 1) none, the data set overall3 2) each level of Cat1 alone; 3) each level of Cat2 alone; 4) each level of Cat3 alone; 5) combinations of cat1 and cat2; 6) combinations of Cat1 and Cat3; 7)combinations of Cat2 and Cat3 and 8)combinations of cat1 and cat2 and cat3.

With proc sql (or sql in general) that mean 8 group by so likely 8 separate create table possibly with a union all of some sort that will likely take some time to get working. With 25 variables and 8 statistics (if sql did mode) that would be 25*8 statements like <stat>(varn) as result name.

With proc summary the code would look like:

Proc summary data=have;
   class cat1-cat3;
   var v1-v25;
   output out=want   
   min= max= median= mode= std= qrange= p5= p95= / autoname;
run;

The procedure will use short hand lists of variable names, same some typing if your variable names are "nice" but other wise you just list the names in the correct location: Class for categorical variables and Var for analysis variables. The output data set will include statistics with the name of the variable and the statistic as a suffix with the use of autoname so I don't have to write out 200 variable names with the almost inevitable of misspelling one as v25_mdeian instead of v25_median. As an added bonus there are two automatic variables added with the above code that contain the number of values used to calculate the statistic for each combination and another variable that indicates which combination is represented on that row of the data set.

There are additional options that are oneword that will have all of the combinations of the categorical variables even if they are not present in the data.

For an extra added bonus Summary is one of the few Procedures that can use Multilabel formats and create summaries of those.

An example is suppose I had a variable that indicated a small geographic area such as Zip code. If I have created a format correctly I can combine those Zip code values to represent an aggregate area such a part of a large town, an additional part of the format to cover the entire town, another level to get the county (or parish or what have), yet another level for combinations of counties, such as reporting region. Yes, such a multilabel format takes a little bit of work. However once you have it adding a format to for that variable in the above proc summary code would those levels to the analysis. If you wanted an SQL solution that would another 4 levels to each of the summaries needed.

You may think this is all hypothetical. I have one project I worked on for several years where I did summaries of the data by: All my records, county, combinations of counties to represent management regions, school district, individual school, 3 different school grades, 3 different school types and around 40 analysis variables representing counts of children by characteristic. I used Proc Summary to create the basic data I needed for reports (with a pass through a data step using multiple arrays to get the rates as well as counts needed for the report). Then the topic reports were written to select the required groups of records and displayed the results.

Since we could subset the data by the different groups we could provide customized reports for each school, each school district, each county, each school type (etc).

The moral of the story is to reiterate the "right procedure for the task".

PaigeMiller · Posted 07-24-2022 06:24 AM

@ballardw wrote:

As an addendum to @PaigeMiller about choosing the correct procedure.

Consider that I want a data where I have 3 categorical variables, for reference call them Cat1 cat2 and Cat3.

I also have 25 analysis variables, v1 through v25, where I want the min, max, median, mode, std, interquartile range, p5 and p95.

For combinations of the categorical variables: 1) none, the data set overall3 2) each level of Cat1 alone; 3) each level of Cat2 alone; 4) each level of Cat3 alone; 5) combinations of cat1 and cat2; 6) combinations of Cat1 and Cat3; 7)combinations of Cat2 and Cat3 and 8)combinations of cat1 and cat2 and cat3.

With proc sql (or sql in general) that mean 8 group by so likely 8 separate create table possibly with a union all of some sort that will likely take some time to get working. With 25 variables and 8 statistics (if sql did mode) that would be 25*8 statements like <stat>(varn) as result name.

I have seen people try to do this in SQL, and then they decide that macros are the right approach, but they don't know enough to use macros here, and then they get stuck. For some reason, many of them seem resistant to PROC SUMMARY, which is the perfect tool.

--
Paige Miller

Classroom Training Available!