- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have 7 data sets which have the same data structures (ID, age, Gender, etc).
I want to combine them and then remove the duplicates. There is about 40% duplicated IDs. Some variables (like age or gender have a few missing values)
one way is to use set operation, and concatenate them into a file, and then use proc sort and nodupkey to remove the duplicate IDs.
In this way, there is a chance that I removed obs with age and gender info but kept those with missing values in age or gender.
is there a way that I can combine them, during which the missing values in age or gender can be replaced by available values when ID is the same?
Thanks
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, cleaning the data will have to remain a separate step. Once you have clean data, here is a method that works regardless of the number of observations per ID in each source.
data all7;
set source1 source2 source3 source4 source5 source6 source7;
run;
proc sort data=all7;
by ID;
run;
data want;
update all7 (obs=0) all7;
by ID;
run;
The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name. But this won't resolve conflicts such as different ages in different sources. It will merely take the last nonmissing value that it locates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, you could do a full join on all tables, and then do coalesce() on each column. Assuming that you don't have duplicates within each table.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Missing is considered the lowest value and when sorting from low to high they end up first - and then selected over records with values.
If you sort DESCENDING instead of the default ASCENDING you'll choose records with the information present. However, if you have multiple records where you're missing age in one record and sex in another record they wouldn't be combined so you'll need to modify them.
You can do each as a full join and use COALESCE to overwrite missing values.
There's also the UPDATE statement, but I'm not sure it will handle multiple records properly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
A couple of important questions ...
Can any individual data set contain more than one observation for the same ID?
If two data sets contain conflicting values, does it matter which value gets used? (For example, age=25 in one data set and age=26 for the same ID in a different data set.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Within the same table, there are some duplicated in ID.
It is a good question. Is there a way to confirm for the same ID the age might be different? I assume they reported the same in the 7 data sets but need to make sure.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, cleaning the data will have to remain a separate step. Once you have clean data, here is a method that works regardless of the number of observations per ID in each source.
data all7;
set source1 source2 source3 source4 source5 source6 source7;
run;
proc sort data=all7;
by ID;
run;
data want;
update all7 (obs=0) all7;
by ID;
run;
The major advantage of this approach (and avoiding coalesce) is that you don't need to know the names of all the variables, and you don't need any sort of code to address each variable by name. But this won't resolve conflicts such as different ages in different sources. It will merely take the last nonmissing value that it locates.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content