Re: Imputing Gender and Race to the maximal value

Schtroumpfette · Posted 03-11-2021 10:39 AM

Hello,

I would like to impute "sex" and "race" in a dataset to the highest value. Meaning if a patient is male > 50% of the time, he is classified as male; if a patient is Hispanic > 50% of the time, then she is of Hispanic race.

What if they are male 50 % of the time and female 50% of the time? How can I impute that?

So basically, my set contains the following information:

-encounter_ids : visit of the patient that is the record is associated with

-Patient_id : unique identifier. Each patient can have multiple encounter ids (since they had many hospital visits)

-Gender (male, female, null, not mapped, null, other, unknown/invalid)

-Race (African American, Asian, Asian/pacific islander, biracial, Caucasian, Hispanic, mid-Eastern Indian, Null, Native American, Not Mapped, Other, Pacific Islander, Unknow).

Note: I don't want to delete any observations at this point.

Thanks,

Schtroumpfette

ballardw · Posted 03-11-2021 11:14 AM

Proper phrasing would be "to the most frequent value".

If you imply a specific Race or Gender is a "maximum" you may well have to deal with a lecture on "bias".

Impute would normally be done only for missing values. Is that what you are looking for?

Before you start imputing the results since you say you have multiple visits for individuals have checked to see if any of the values for race or gender change between visits? Shouldn't happen but I have medical testing data where I have such occur way too often, such as having 5 visits in a year with changes of race and ethnicity at each.

Or are you looking to make the data consistent with the most common values per patient?

An example of some starting data, dummy is fine as long as it represents what you have, and then the expected result.

As far as the 50/50 split goes it may depend on exactly what you are expecting for a result. It might mean use either the earliest or latest record, depending on why you need to impute the value.

Schtroumpfette · Posted 03-11-2021 11:20 AM

Thank you for the quick response.

Yes, I mean I want to make the data consistent with the most common values per patient. I checked the data and it does not happen often but I am looking for consistency.

For the 50/50, I would use the earliest record as a standard.

My apologies, I forgot to specify that I need a code for this. Is there a proc sql step for this? or a data step?

We can call the dataset "imputation'

Thanks,

Nisrine

Schtroumpfette · Posted 03-11-2021 11:27 AM

Thanks Reeza,
I will try this and get back to you.

Nisrine

Reeza · Posted 03-11-2021 11:22 AM

Use PROC FREQ to get the percentages of each variable and then pick the top one for each category. This handles ties by taking the random values. In my experience, it's better to take the latest value instead of ties, especially for somethings like Gender where it has the potential to change.

The table ID_GENDER_LOOKUP will contain the lookup value.

proc freq dta=have noprint;
table Patient_ID*Gender / list out=gender_list;
run;

*add a random number for ties;
data gender_list_random;
set gender_list;
call streaminit(24);
randVar=rand('normal', 1000, 50);
run;

proc sort data=gender_list_random;
by patient_ID descending percent randVar ;
run;

proc sort data=gender_list_random out=ID_Gender_LookUp nodupkey;
by patient_ID ;
run;

@Schtroumpfette wrote:

Hello,

I would like to impute "sex" and "race" in a dataset to the highest value. Meaning if a patient is male > 50% of the time, he is classified as male; if a patient is Hispanic > 50% of the time, then she is of Hispanic race.

What if they are male 50 % of the time and female 50% of the time? How can I impute that?

So basically, my set contains the following information:

-encounter_ids : visit of the patient that is the record is associated with

-Patient_id : unique identifier. Each patient can have multiple encounter ids (since they had many hospital visits)

-Gender (male, female, null, not mapped, null, other, unknown/invalid)

-Race (African American, Asian, Asian/pacific islander, biracial, Caucasian, Hispanic, mid-Eastern Indian, Null, Native American, Not Mapped, Other, Pacific Islander, Unknow).

Note: I don't want to delete any observations at this point.

Thanks,

Schtroumpfette

Schtroumpfette · Posted 03-11-2021 12:34 PM

Hello Reeza,

I get this error at the first step: SAS system stopped processing this step because of insufficient memory.

proc freq dta=have noprint;
table Patient_ID*Gender / list out=gender_list;
run;

Is there a reason for this,

Thanks,
Nisrine

ballardw · Posted 03-11-2021 12:40 PM

@Reeza wrote:

Use PROC FREQ to get the percentages of each variable and then pick the top one for each category. This handles ties by taking the random values. In my experience, it's better to take the latest value instead of ties, especially for somethings like Gender where it has the potential to change.

Yes gender may change. In my case the data sometimes has this happen on the same day and in an environment where it is clearly not a "before"/"after" case. Two or more swabs taken from different specimen sites on the same day.

There I times I start suspecting either some sort of fraud or just plain $%^ on someone's part entering or selecting a "unique identifier". Same date of test(s), same unique patient ID, same clinic: different gender, date of birth, race and/or ethnicity.

When asked for any report I do tend to standardize to the latest recorded values for the id when the individual is a matter of interest such as with repeated infections.

SAS Innovate 2025: Call for Content

Classroom Training Available!