Re: Multiple records per individual - data fill and random selection

LucyLou · Posted 04-01-2015 07:46 PM

Hello,

I have a huge dataset (>3 million records) with approximately 600,000 individuals (many have more than one visit).

First, I would like to be able to generate code to fill in gender and race/ethnicity from one record of an individual to their other records if some records are missing this information.

For example, say I have the following records:

ID Gender Race/Ethnicity

1 1 3

1 . 3

1 1 .

2 . 1

2 0 1

2 0 .

I would like the missing cells to be replaced with information from the other records that are available for that same person.

Then, I would like to randomly select one visit that contains complete information (age, gender, race/ethnicity) for each person if they have more than one visit; however if complete information is unavailable, I would still like to select at least one record for each person.

I hope this makes sense. Thank you in advance for your thoughts and help on this issue.

Kindly,

ARR

ballardw · Posted 04-01-2015 08:02 PM

Do the second part first. If the values are numeric for age, gender and race/ethnicity a simple code would be:

proc summary data=have nway;

class id;

var age gender race; /* your variable names here of course*/

output out=want (drop=_type_) max=;

run;

This would select the largest value for each of the variables so if any are missing then then that means it was missing in all of the records.

You could then use the result of the above to merge with the original data set.

Proc sort data=have; by id;run;

(the want set would be sorted by default by id as created above)

data new;

merge have

want;

by id;

run;

This creates a new data set, which isn't efficient but lets you keep your starting data in case of error or changed requirement. This will by default replace any value in the base data that is missing with the value in the want data.

If you feel confident then you can use MODIFY to update in place. I would experiment with subsets of the data first to verify everything is working as needed.

KachiM · Posted 04-02-2015 12:11 AM

In one data step program you can achieve it.

Do you need two output data sets - one with (> 3 M records imputed) and 600,000 randomly selected records ?

Both can be achieved if your input data set is in sorted order of ID. Use DoW-loop.

LucyLou · Posted 04-02-2015 06:09 PM

Thank you both for your suggestions. I will play around with both approaches.

Datasp, I only need one data set with my 600,000 records.

Patrick · Posted 04-02-2015 08:05 PM

3M records are not that much and this is only a adhoc process - so I wouldn't be too concerned about performance but much more about keeping the code simple.

You will need to be careful how you cleanse your data so that it doesn't become biased. Below code just takes the maximum value per patient to create a "golden record". If it's only about replacing a missing with a value from another record then such an approach is o.k. - but what if you have different non-missing values for a patient?

If you have such cases then it's may be better to consider a bit a more sophisticated approach. So for example: How is the data collected? I would assume once patient details have been entered they will show up/be re-used in consecutive visits. In such a case I would assume the latest record has the best data quality and you probably should pick this one. If this is not the case then you could pick the non-missing value with the highest frequency in case of multiple non-missing values - and you would need to come up with some sort of random selection in case of ties. A starting point would be "Proc Rank".

data have;

input (ID Gender Race_Ethnicity treatment) ($) visit_dt :date9.;

format visit_dt date9.;

datalines;

1 1 3 a 1jan2015

1 . 3 b 10jan2015

1 1 . a 15feb2015

2 . 1 y 23jan2015

2 0 1 x 2feb2015

2 0 . z 7mar2015

;

run;

proc sort data=have;

by id visit_dt;

run;

proc surveyselect data=have(keep=id visit_dt) sampsize=1 noprint out=sample(keep=id visit_dt);

strata id ;

run;

proc sql;

create table golden_record as

select id, max(gender) as gender, max(Race_Ethnicity) as Race_Ethnicity

from have

group by id

;

quit;

data want;

merge have sample(in=in_s);

by id visit_dt;

if in_s;

merge sample golden_record;

set golden_record;

by id;

run;

MarkWik · Posted 04-03-2015 06:00 AM

Sorry Patrick for bothering, I'd like to understand the functionality of using multiple merge statements and by statements in a single datastep as you have used:

data want;

merge have sample(in=in_s);

by id visit_dt;

if in_s;

merge sample golden_record;

by id;

run;

So, at compile time, i understand the obs from have and sample will be constructed in PDV as one, and how does the 2nd merge and by fit in the PDV? Or does this work in such a way the merged obs from the first merge merges with 2nd? I am just confused

Patrick · Posted 04-03-2015 09:03 AM

Actually giving the code I've posted a bit more thought the second merge is not required but a simple set for "golden_record" does the job as well. I've amended the code accordingly.

As for the PDV: During compilation time SAS will define all variables used anywhere in the data step the first time it encounters them (it's like an iteration zero through the data step code where everything gets defined). For variables existing in more than one input data set the first occurrence will be used.

As for merging: There it's the other way round and for same named variables which exist in more than one input data set, the value of the last one will make it to the output data set. That's why in the merge statement "sample" must come 2nd. You could also drop all variables from "have" which are not part of the by group and which should be taken from "sample". Then the order of the data sets in the merge statement wouldn't matter anymore.

The whole process works under the assumption that {ID, Visit_dt} can be used as primary key (so defining exactly one record). If this is not the case in your real data then you must use something else together with "ID" to create such a primary key.

"Sample" has only 1 record per ID and so does "golden_record". Both data sets contain exactly the same ID's in the same sort order (sorted by ID). That's why we can simply read the next record from "golden_record" whenever the data step gets there (which is exactly once per ID).

MarkWik · Posted 04-06-2015 02:38 AM

Thank you so much Patrick, Please accept my apologies for acknowledging late as I was enjoying my weekend. Many thanks once again for your time.

KachiM · Posted 04-03-2015 10:44 AM

Here is a data step approach. The input data set is sorted by ID. To help to fill the missing values, an array for Gender and Race each is made. The dimension has to be the maximum number of observations per ID. In this case it is three(3). Further Gender and Race are taken to be Number for using Arrays.

data have;

input ID Gender Race;

datalines;

1 1 3

1 . 3

1 1 .

2 . 1

2 0 1

2 0 .

;

run;

data want(keep = id Gender Race);

array G[3] _temporary_;

array R[3] _temporary_;

do i = 1 by 1 until(last.id);

set have;

by id;

G = Gender;

R = Race;

end;

* Find the non-missing value;

maxG = max(of G

);

maxR = max(of R

);

* Fill with non-missing value;

do j = 1 to i;

G = maxG;

R = maxR;

end;

* select one Random record;

rnd = ceil(ranuni(123) * i);

Gender = G[rnd];

Race = R[rnd];

output;

run;

Registration is open