DATA Step, Macro, Functions and more

combine (concatenate) data sets and select unique obs

Accepted Solution Solved
Reply
Super Contributor
Posts: 318
Accepted Solution

combine (concatenate) data sets and select unique obs

I have 7 data sets which have the same data structures (ID, age, Gender, etc).

I want to combine them and then remove the duplicates. There is about 40% duplicated IDs. Some variables (like age or gender have a few missing values)

 

one way is to use set operation, and concatenate them into a file, and then use proc sort and nodupkey to remove the duplicate IDs.

In this way, there is a chance that I removed obs with age and gender info but kept those with missing values in age or gender.

 

is there a way that I can combine them, during which the missing values in age or gender can be replaced by available values when ID is the same?

 

Thanks


Accepted Solutions
Solution
‎03-02-2016 12:21 PM
Super User
Posts: 5,500

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu

Well, cleaning the data will have to remain a separate step.  Once you have clean data, here is a method that works regardless of the number of observations per ID in each source. 

 

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

 

proc sort data=all7;

by ID;

run;

 

data want;

update all7 (obs=0) all7;

by ID;

run;

 

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name.  But this won't resolve conflicts such as different ages in different sources.  It will merely take the last nonmissing value that it locates.

View solution in original post


All Replies
Super User
Posts: 5,426

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu

Well, you could do a full join on all tables, and then do coalesce() on each column. Assuming that you don't have duplicates within each table.

Data never sleeps
Super Contributor
Posts: 318

Re: combine (concatenate) data sets and select unique obs

There are some duplicates in ID variable in 6 of the 7 sets. But within the table, if age is missing for an ID, it is also missing in the same duplidate ID.
Super User
Posts: 19,781

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu

Missing is considered the lowest value and when sorting from low to high they end up first - and then selected over records with values.

 

If you sort DESCENDING instead of the default ASCENDING you'll choose records with the information present. However, if you have multiple records where you're missing age in one record and sex in another record they wouldn't be combined so you'll need to modify them.

 

You can do each as a full join and use COALESCE to overwrite missing values.  

 

There's also the UPDATE statement, but I'm not sure it will handle multiple records properly.

 

 

Super Contributor
Posts: 318

Re: combine (concatenate) data sets and select unique obs

Thank you Reeze. I am not familiar with COALESCE that you and LinusH pointed out. I will do some research on it. Thanks.
Super User
Posts: 5,500

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu

A couple of important questions ...

 

Can any individual data set contain more than one observation for the same ID?

 

If two data sets contain conflicting values, does it matter which value gets used?  (For example, age=25 in one data set and age=26 for the same ID in a different data set.)

Super Contributor
Posts: 318

Re: combine (concatenate) data sets and select unique obs

[ Edited ]
Posted in reply to Astounding

Within the same table, there are some duplicated in ID.

It is a good question. Is there a way to confirm for the same ID the age might be different? I assume they reported the same in the 7 data sets but need to make sure.

Solution
‎03-02-2016 12:21 PM
Super User
Posts: 5,500

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu

Well, cleaning the data will have to remain a separate step.  Once you have clean data, here is a method that works regardless of the number of observations per ID in each source. 

 

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

 

proc sort data=all7;

by ID;

run;

 

data want;

update all7 (obs=0) all7;

by ID;

run;

 

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name.  But this won't resolve conflicts such as different ages in different sources.  It will merely take the last nonmissing value that it locates.

Super Contributor
Posts: 318

Re: combine (concatenate) data sets and select unique obs

Posted in reply to Astounding
Great. I will give it a try.
Super User
Posts: 5,426

Re: combine (concatenate) data sets and select unique obs

Posted in reply to fengyuwuzu
If the data comes from the same source, I suggest that order new cleansed data.
Data never sleeps
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 325 views
  • 1 like
  • 4 in conversation