BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Vivy1
Fluorite | Level 6

Hi all,

I have data where the same individual could have had more than one race/ethnicity ('re' for short) assigned to them (which should not be the case cos a person's race/ethnicity pretty much does not change). I'm trying to write a code so that if there is more than one race/ethnicity value for the same person, then the value with highest frequency of occurrence will be kept. But if the frequencies are the same for two different races for a person, then they will be assigned to an unknown category (8).

Here is a sample of the data:

Vivy1_0-1703121921475.png

I would like patient 11 assigned to a final value of 8 since they have 're' values of 5 and 1 each having a frequency of 1. My sas code is below and it doesnt seem to be working to assign patient 11 to an unknown category of 8. Any help is much appreciated.

proc sql;
create table re_temp1 as 
select unique 	UID4S1 , raceth as re, 
				count(re) as re_frequency

from all
/*where dob_inc_missing ge mdy(01,01,1900)*/
group by UID4S1, re 
order by UID4S1, re_frequency desc,  re desc;
quit;

proc sort data= re_temp1 out=link;
by UID4S1 descending RE_frequency re;
run;

data check2; 
set link;
by UID4S1 descending RE_frequency re ; 
if first.UID4S1 then do;
final_re = re;
retain final_re; 
if (first.re ne last.re) and 
(first.re_frequency eq last.re_frequency) then final_re= 8; 
end; 
if final_re=. then final_re=-999; 
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Vivy1 

 

This works on the same principles as your code. The only real difference is that I dropped the descending sort and only outputs the final result with one observation per Patient. This is merged back on the frequency data set to reproduce your example data.

data have;
  infile datalines;
  informat UID4S1 $10.;
  input UID4S1 raceth;
  datalines;
Patient1 2
Patient1 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 9
Patient3 9
Patient5 5
Patient6 5
Patient7 9
Patient8 1
Patient8 1
Patient8 1
Patient8 1
Patient9 2
Patient9 2
Patient10 2
Patient11 5
Patient11 1
Patient12 1
;
run;

* Count frequency and add ID variable;
* The ID variable is not necessary, UID4S1 would work also, it is introduced to get data in
  "readable" sort order, so Patient10, 11 and 12 aren't placed before Patient2, which is the case
  when sorting is done on the character variable UID4S1;
proc sql;
  create table re_temp1 as 
    select unique
      input(compress(UID4S1,,'kd'),8.) as ID,
      UID4S1,
      raceth as re, 
			count(re) as re_frequency
    from have
    group by 
      UID4S1,
      raceth 
    order by 
     ID, 
     re_frequency,
     re 
  ;
quit;

* The real output - one obs per ID;
data want (drop=re re_frequency lastfreq);
  set re_temp1;
  by ID re_frequency;
  retain lastfreq;
  if first.ID then lastfreq = 0;
  if not last.ID then lastfreq = re_frequency;
  if last.ID then do;
    if re_frequency = lastfreq then final_re = 8;
    else final_re = re;
    output;
  end;
run;

* Merge back for control;
data check2 (drop=ID);
  merge re_temp1 want;
  by ID;
run;

View solution in original post

5 REPLIES 5
s_lassen
Meteorite | Level 14

You should post your demo-data as data step code, not as a picture.

 

Here is a possible solution (not tested):

 

data want;
  do until(last.UID451);
    set have;
    by UID451;
    if re_frequency>max_frequency then do;
      max_frequency=re_frequency;
      final_re=re;
      end;
    else if re_frequency=max_frequency then
      final_re=8;
    end;
  do until(last.UID451);
    set have;
    by UID451;
    output;
    end;
  drop max_frequency;
run; 

 

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Vivy1 

 

This works on the same principles as your code. The only real difference is that I dropped the descending sort and only outputs the final result with one observation per Patient. This is merged back on the frequency data set to reproduce your example data.

data have;
  infile datalines;
  informat UID4S1 $10.;
  input UID4S1 raceth;
  datalines;
Patient1 2
Patient1 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 9
Patient3 9
Patient5 5
Patient6 5
Patient7 9
Patient8 1
Patient8 1
Patient8 1
Patient8 1
Patient9 2
Patient9 2
Patient10 2
Patient11 5
Patient11 1
Patient12 1
;
run;

* Count frequency and add ID variable;
* The ID variable is not necessary, UID4S1 would work also, it is introduced to get data in
  "readable" sort order, so Patient10, 11 and 12 aren't placed before Patient2, which is the case
  when sorting is done on the character variable UID4S1;
proc sql;
  create table re_temp1 as 
    select unique
      input(compress(UID4S1,,'kd'),8.) as ID,
      UID4S1,
      raceth as re, 
			count(re) as re_frequency
    from have
    group by 
      UID4S1,
      raceth 
    order by 
     ID, 
     re_frequency,
     re 
  ;
quit;

* The real output - one obs per ID;
data want (drop=re re_frequency lastfreq);
  set re_temp1;
  by ID re_frequency;
  retain lastfreq;
  if first.ID then lastfreq = 0;
  if not last.ID then lastfreq = re_frequency;
  if last.ID then do;
    if re_frequency = lastfreq then final_re = 8;
    else final_re = re;
    output;
  end;
run;

* Merge back for control;
data check2 (drop=ID);
  merge re_temp1 want;
  by ID;
run;
ballardw
Super User

@Vivy1 wrote:

Hi all,

I have data where the same individual could have had more than one race/ethnicity ('re' for short) assigned to them (which should not be the case cos a person's race/ethnicity pretty much does not change).


In snark mode I would say that you may not be paying attention to some serious trends in "self identification" that will affect such.

The basic approach is going to be to get something with all the RE values per person identifier and then apply your rules to that and merge a "final" result back onto the data.

 

However that would require the complete set of rules involved in deciding where each combination of RE values goes. I dealt with a government Race/ Ethnicity coding scheme that regardless of any value of race if a certain ethnicity was indicated that became the result, similar that if a certain race was mentioned in combination with any other that was the race/ethnicity to be assigned. In other words, a hierarchy of assignment was involved with some combinations not being coded as might be expected without ALL of the rules involved. Which here would mean pretty much all of the combinations you have and the rules.

 

Another data source I work with is so poor on data quality of entry of some such things we just use the value provided in the most recent record.

 

 

 

Tom
Super User Tom
Super User

So get the counts and order count within ID.

Then use BY group processing to get the information you need to find the most common value.

So to make a dataset that has your ID variable and your new variable, let's call it WANT you could use a step like this:

data want;
  set counts;
  by UID4S1 re_count;
  if last.UID4S1 then do;
    if first.re_count then final_re=re;
    else final_re=8;
    output;
  end;
  keep uid4s1 final_re;
run;
Vivy1
Fluorite | Level 6

Thank you!! This worked great too. It seems so easy not sure why i didnt think of it this way

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1147 views
  • 3 likes
  • 5 in conversation