SAS Programming

fengyuwuzu · Posted 03-02-2016 10:45 AM

I have 7 data sets which have the same data structures (ID, age, Gender, etc).

I want to combine them and then remove the duplicates. There is about 40% duplicated IDs. Some variables (like age or gender have a few missing values)

one way is to use set operation, and concatenate them into a file, and then use proc sort and nodupkey to remove the duplicate IDs.

In this way, there is a chance that I removed obs with age and gender info but kept those with missing values in age or gender.

is there a way that I can combine them, during which the missing values in age or gender can be replaced by available values when ID is the same?

Thanks

Astounding · Posted 03-02-2016 11:22 AM

Well, cleaning the data will have to remain a separate step. Once you have clean data, here is a method that works regardless of the number of observations per ID in each source.

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

proc sort data=all7;

by ID;

run;

data want;

update all7 (obs=0) all7;

by ID;

run;

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name. But this won't resolve conflicts such as different ages in different sources. It will merely take the last nonmissing value that it locates.

View solution in original post

LinusH · Posted 03-02-2016 10:53 AM

Well, you could do a full join on all tables, and then do coalesce() on each column. Assuming that you don't have duplicates within each table.

Data never sleeps

fengyuwuzu · Posted 03-02-2016 11:08 AM

There are some duplicates in ID variable in 6 of the 7 sets. But within the table, if age is missing for an ID, it is also missing in the same duplidate ID.

Reeza · Posted 03-02-2016 10:54 AM

Missing is considered the lowest value and when sorting from low to high they end up first - and then selected over records with values.

If you sort DESCENDING instead of the default ASCENDING you'll choose records with the information present. However, if you have multiple records where you're missing age in one record and sex in another record they wouldn't be combined so you'll need to modify them.

You can do each as a full join and use COALESCE to overwrite missing values.

There's also the UPDATE statement, but I'm not sure it will handle multiple records properly.

fengyuwuzu · Posted 03-02-2016 11:28 AM

Thank you Reeze. I am not familiar with COALESCE that you and LinusH pointed out. I will do some research on it. Thanks.

Astounding · Posted 03-02-2016 11:07 AM

A couple of important questions ...

Can any individual data set contain more than one observation for the same ID?

If two data sets contain conflicting values, does it matter which value gets used? (For example, age=25 in one data set and age=26 for the same ID in a different data set.)

fengyuwuzu · Posted 03-02-2016 11:12 AM

Within the same table, there are some duplicated in ID.

It is a good question. Is there a way to confirm for the same ID the age might be different? I assume they reported the same in the 7 data sets but need to make sure.

Astounding · Posted 03-02-2016 11:22 AM

Well, cleaning the data will have to remain a separate step. Once you have clean data, here is a method that works regardless of the number of observations per ID in each source.

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

proc sort data=all7;

by ID;

run;

data want;

update all7 (obs=0) all7;

by ID;

run;

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name. But this won't resolve conflicts such as different ages in different sources. It will merely take the last nonmissing value that it locates.

fengyuwuzu · Posted 03-02-2016 11:28 AM

Great. I will give it a try.

LinusH · Posted 03-02-2016 11:13 AM

If the data comes from the same source, I suggest that order new cleansed data.

Data never sleeps

SAS Programming

combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

Re: combine (concatenate) data sets and select unique obs

OBS/FIRST OBS

Need help combining and sorting data into one row

[BASE SAS기초] Combining SAS Data Sets(3)

Determining unique combination from a set of variables

selecting unique cases

Follow Us

What is...

SAS Programming

Register Today!

SAS Training: Just a Click Away

Follow Us

What is...