BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fengyuwuzu
Pyrite | Level 9

I have 7 data sets which have the same data structures (ID, age, Gender, etc).

I want to combine them and then remove the duplicates. There is about 40% duplicated IDs. Some variables (like age or gender have a few missing values)

 

one way is to use set operation, and concatenate them into a file, and then use proc sort and nodupkey to remove the duplicate IDs.

In this way, there is a chance that I removed obs with age and gender info but kept those with missing values in age or gender.

 

is there a way that I can combine them, during which the missing values in age or gender can be replaced by available values when ID is the same?

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Well, cleaning the data will have to remain a separate step.  Once you have clean data, here is a method that works regardless of the number of observations per ID in each source. 

 

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

 

proc sort data=all7;

by ID;

run;

 

data want;

update all7 (obs=0) all7;

by ID;

run;

 

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name.  But this won't resolve conflicts such as different ages in different sources.  It will merely take the last nonmissing value that it locates.

View solution in original post

9 REPLIES 9
LinusH
Tourmaline | Level 20

Well, you could do a full join on all tables, and then do coalesce() on each column. Assuming that you don't have duplicates within each table.

Data never sleeps
fengyuwuzu
Pyrite | Level 9
There are some duplicates in ID variable in 6 of the 7 sets. But within the table, if age is missing for an ID, it is also missing in the same duplidate ID.
Reeza
Super User

Missing is considered the lowest value and when sorting from low to high they end up first - and then selected over records with values.

 

If you sort DESCENDING instead of the default ASCENDING you'll choose records with the information present. However, if you have multiple records where you're missing age in one record and sex in another record they wouldn't be combined so you'll need to modify them.

 

You can do each as a full join and use COALESCE to overwrite missing values.  

 

There's also the UPDATE statement, but I'm not sure it will handle multiple records properly.

 

 

fengyuwuzu
Pyrite | Level 9
Thank you Reeze. I am not familiar with COALESCE that you and LinusH pointed out. I will do some research on it. Thanks.
Astounding
PROC Star

A couple of important questions ...

 

Can any individual data set contain more than one observation for the same ID?

 

If two data sets contain conflicting values, does it matter which value gets used?  (For example, age=25 in one data set and age=26 for the same ID in a different data set.)

fengyuwuzu
Pyrite | Level 9

Within the same table, there are some duplicated in ID.

It is a good question. Is there a way to confirm for the same ID the age might be different? I assume they reported the same in the 7 data sets but need to make sure.

Astounding
PROC Star

Well, cleaning the data will have to remain a separate step.  Once you have clean data, here is a method that works regardless of the number of observations per ID in each source. 

 

data all7;

set source1 source2 source3 source4 source5 source6 source7;

run;

 

proc sort data=all7;

by ID;

run;

 

data want;

update all7 (obs=0) all7;

by ID;

run;

 

The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name.  But this won't resolve conflicts such as different ages in different sources.  It will merely take the last nonmissing value that it locates.

fengyuwuzu
Pyrite | Level 9
Great. I will give it a try.
LinusH
Tourmaline | Level 20
If the data comes from the same source, I suggest that order new cleansed data.
Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1095 views
  • 1 like
  • 4 in conversation