BookmarkSubscribeRSS Feed
Schtroumpfette
Obsidian | Level 7

Hello, 


I would like to impute "sex" and "race" in a dataset to the highest value.  Meaning if a patient is male > 50% of the time, he is classified as male; if a patient is Hispanic > 50% of the time, then she is of Hispanic race. 

 

What if they are male 50 % of the time and female 50% of the time?  How can I impute that?

 

So basically, my set contains the following information: 

-encounter_ids : visit of the patient that is the record is associated with

-Patient_id : unique identifier.  Each patient can have multiple encounter ids (since they had many hospital visits)

-Gender (male, female, null, not mapped, null, other, unknown/invalid)

-Race (African American, Asian, Asian/pacific islander, biracial, Caucasian, Hispanic, mid-Eastern Indian, Null, Native American, Not Mapped, Other, Pacific Islander, Unknow).  

 

Note: I don't want to delete any observations at this point. 

 

Thanks,

Schtroumpfette 

 

6 REPLIES 6
ballardw
Super User

Proper phrasing would be "to the most frequent value".

 

If you imply a specific Race or Gender is a "maximum" you may well have to deal with a lecture on "bias".

 

Impute would normally be done only for missing values. Is that what you are looking for?

 

Before you start imputing the results since you say you have multiple visits for individuals have checked to see if any of the values for race or gender change between visits? Shouldn't happen but I have medical testing data where I have such occur way too often, such as having 5 visits in a year with changes of race and ethnicity at each.

 

Or are you looking to make the data consistent with the most common values per patient?

 

An example of some starting data, dummy is fine as long as it represents what you have, and then the expected result.

 

As far as the 50/50 split goes it may depend on exactly what you are expecting for a result. It might mean use either the earliest or latest record, depending on why you need to impute the value.

 

 

Schtroumpfette
Obsidian | Level 7

Thank you for the quick response.

 

Yes, I mean I want to make the data consistent with the most common values per patient.  I checked the data and it does not happen often but I am looking for consistency. 

 

For the 50/50, I would use the earliest record as a standard.

 

My apologies, I forgot to specify that I need a code for this.  Is there a proc sql step for this?  or a data step? 

 

We can call the dataset "imputation'

Thanks,

Nisrine

 

 

Schtroumpfette
Obsidian | Level 7

Thanks Reeza, 
I will try this and get back to you. 

Nisrine 

 

Reeza
Super User

Use PROC FREQ to get the percentages of each variable and then pick the top one for each category. This handles ties by taking the random values. In my experience, it's better to take the latest value instead of ties, especially for somethings like Gender where it has the potential to change.  

 

The table ID_GENDER_LOOKUP will contain the lookup value. 

 

proc freq dta=have noprint;
table Patient_ID*Gender / list out=gender_list;
run;

*add a random number for ties;
data gender_list_random;
set gender_list;
call streaminit(24);
randVar=rand('normal', 1000, 50);
run;

proc sort data=gender_list_random;
by patient_ID descending percent randVar ;
run;

proc sort data=gender_list_random out=ID_Gender_LookUp nodupkey;
by patient_ID ;
run;


@Schtroumpfette wrote:

Hello, 


I would like to impute "sex" and "race" in a dataset to the highest value.  Meaning if a patient is male > 50% of the time, he is classified as male; if a patient is Hispanic > 50% of the time, then she is of Hispanic race. 

 

What if they are male 50 % of the time and female 50% of the time?  How can I impute that?

 

So basically, my set contains the following information: 

-encounter_ids : visit of the patient that is the record is associated with

-Patient_id : unique identifier.  Each patient can have multiple encounter ids (since they had many hospital visits)

-Gender (male, female, null, not mapped, null, other, unknown/invalid)

-Race (African American, Asian, Asian/pacific islander, biracial, Caucasian, Hispanic, mid-Eastern Indian, Null, Native American, Not Mapped, Other, Pacific Islander, Unknow).  

 

Note: I don't want to delete any observations at this point. 

 

Thanks,

Schtroumpfette 

 


 

Schtroumpfette
Obsidian | Level 7

Hello Reeza,

 

I get this error at the first step: SAS system stopped processing this step because of insufficient memory.

 

proc freq dta=have noprint;
table Patient_ID*Gender / list out=gender_list;
run;

 

Is there a reason for this,

 

Thanks,
Nisrine 

ballardw
Super User

@Reeza wrote:

Use PROC FREQ to get the percentages of each variable and then pick the top one for each category. This handles ties by taking the random values. In my experience, it's better to take the latest value instead of ties, especially for somethings like Gender where it has the potential to change.  

 


Yes gender may change. In my case the data sometimes has this happen on the same day and in an environment where it is clearly not a "before"/"after" case. Two or more swabs taken from different specimen sites on the same day.

 

There I times I start suspecting either some sort of fraud or just plain $%^ on someone's part entering or selecting a "unique identifier". Same date of test(s), same unique patient ID, same clinic: different gender, date of birth, race and/or ethnicity.

 

When asked for any report I do tend to standardize to the latest recorded values for the id when the individual is a matter of interest such as with repeated infections.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1131 views
  • 0 likes
  • 3 in conversation