Hi all,
I have data where the same individual could have had more than one race/ethnicity ('re' for short) assigned to them (which should not be the case cos a person's race/ethnicity pretty much does not change). I'm trying to write a code so that if there is more than one race/ethnicity value for the same person, then the value with highest frequency of occurrence will be kept. But if the frequencies are the same for two different races for a person, then they will be assigned to an unknown category (8).
Here is a sample of the data:
I would like patient 11 assigned to a final value of 8 since they have 're' values of 5 and 1 each having a frequency of 1. My sas code is below and it doesnt seem to be working to assign patient 11 to an unknown category of 8. Any help is much appreciated.
proc sql;
create table re_temp1 as
select unique UID4S1 , raceth as re,
count(re) as re_frequency
from all
/*where dob_inc_missing ge mdy(01,01,1900)*/
group by UID4S1, re
order by UID4S1, re_frequency desc, re desc;
quit;
proc sort data= re_temp1 out=link;
by UID4S1 descending RE_frequency re;
run;
data check2;
set link;
by UID4S1 descending RE_frequency re ;
if first.UID4S1 then do;
final_re = re;
retain final_re;
if (first.re ne last.re) and
(first.re_frequency eq last.re_frequency) then final_re= 8;
end;
if final_re=. then final_re=-999;
run;
Hi @Vivy1
This works on the same principles as your code. The only real difference is that I dropped the descending sort and only outputs the final result with one observation per Patient. This is merged back on the frequency data set to reproduce your example data.
data have;
infile datalines;
informat UID4S1 $10.;
input UID4S1 raceth;
datalines;
Patient1 2
Patient1 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 9
Patient3 9
Patient5 5
Patient6 5
Patient7 9
Patient8 1
Patient8 1
Patient8 1
Patient8 1
Patient9 2
Patient9 2
Patient10 2
Patient11 5
Patient11 1
Patient12 1
;
run;
* Count frequency and add ID variable;
* The ID variable is not necessary, UID4S1 would work also, it is introduced to get data in
"readable" sort order, so Patient10, 11 and 12 aren't placed before Patient2, which is the case
when sorting is done on the character variable UID4S1;
proc sql;
create table re_temp1 as
select unique
input(compress(UID4S1,,'kd'),8.) as ID,
UID4S1,
raceth as re,
count(re) as re_frequency
from have
group by
UID4S1,
raceth
order by
ID,
re_frequency,
re
;
quit;
* The real output - one obs per ID;
data want (drop=re re_frequency lastfreq);
set re_temp1;
by ID re_frequency;
retain lastfreq;
if first.ID then lastfreq = 0;
if not last.ID then lastfreq = re_frequency;
if last.ID then do;
if re_frequency = lastfreq then final_re = 8;
else final_re = re;
output;
end;
run;
* Merge back for control;
data check2 (drop=ID);
merge re_temp1 want;
by ID;
run;
You should post your demo-data as data step code, not as a picture.
Here is a possible solution (not tested):
data want;
do until(last.UID451);
set have;
by UID451;
if re_frequency>max_frequency then do;
max_frequency=re_frequency;
final_re=re;
end;
else if re_frequency=max_frequency then
final_re=8;
end;
do until(last.UID451);
set have;
by UID451;
output;
end;
drop max_frequency;
run;
Hi @Vivy1
This works on the same principles as your code. The only real difference is that I dropped the descending sort and only outputs the final result with one observation per Patient. This is merged back on the frequency data set to reproduce your example data.
data have;
infile datalines;
informat UID4S1 $10.;
input UID4S1 raceth;
datalines;
Patient1 2
Patient1 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient2 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 2
Patient3 9
Patient3 9
Patient5 5
Patient6 5
Patient7 9
Patient8 1
Patient8 1
Patient8 1
Patient8 1
Patient9 2
Patient9 2
Patient10 2
Patient11 5
Patient11 1
Patient12 1
;
run;
* Count frequency and add ID variable;
* The ID variable is not necessary, UID4S1 would work also, it is introduced to get data in
"readable" sort order, so Patient10, 11 and 12 aren't placed before Patient2, which is the case
when sorting is done on the character variable UID4S1;
proc sql;
create table re_temp1 as
select unique
input(compress(UID4S1,,'kd'),8.) as ID,
UID4S1,
raceth as re,
count(re) as re_frequency
from have
group by
UID4S1,
raceth
order by
ID,
re_frequency,
re
;
quit;
* The real output - one obs per ID;
data want (drop=re re_frequency lastfreq);
set re_temp1;
by ID re_frequency;
retain lastfreq;
if first.ID then lastfreq = 0;
if not last.ID then lastfreq = re_frequency;
if last.ID then do;
if re_frequency = lastfreq then final_re = 8;
else final_re = re;
output;
end;
run;
* Merge back for control;
data check2 (drop=ID);
merge re_temp1 want;
by ID;
run;
@Vivy1 wrote:
Hi all,
I have data where the same individual could have had more than one race/ethnicity ('re' for short) assigned to them (which should not be the case cos a person's race/ethnicity pretty much does not change).
In snark mode I would say that you may not be paying attention to some serious trends in "self identification" that will affect such.
The basic approach is going to be to get something with all the RE values per person identifier and then apply your rules to that and merge a "final" result back onto the data.
However that would require the complete set of rules involved in deciding where each combination of RE values goes. I dealt with a government Race/ Ethnicity coding scheme that regardless of any value of race if a certain ethnicity was indicated that became the result, similar that if a certain race was mentioned in combination with any other that was the race/ethnicity to be assigned. In other words, a hierarchy of assignment was involved with some combinations not being coded as might be expected without ALL of the rules involved. Which here would mean pretty much all of the combinations you have and the rules.
Another data source I work with is so poor on data quality of entry of some such things we just use the value provided in the most recent record.
So get the counts and order count within ID.
Then use BY group processing to get the information you need to find the most common value.
So to make a dataset that has your ID variable and your new variable, let's call it WANT you could use a step like this:
data want;
set counts;
by UID4S1 re_count;
if last.UID4S1 then do;
if first.re_count then final_re=re;
else final_re=8;
output;
end;
keep uid4s1 final_re;
run;
Thank you!! This worked great too. It seems so easy not sure why i didnt think of it this way
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.