I have been working on a merge of two data sets: the number of obs are
Orpheus_one_for_matching = 1,454,515 originally 1,817,358 obs
Orpheus_for_matching = 299,088 originally 326,329 obs
I created a concatenate common variable as shown
Data SASONE.ORPHEUS_ONE_For_Matching; Set SASONE.ORPHEUS_ONE_For_Matching; Common_ID_4 = CATS(LastNm,FirstNm); run; Data SASONE1.ORPHEUS_For_Matching; Set SASONE1.ORPHEUS_For_Matching; Common_ID_4 = CATS(LastNm,FirstNm); run;
I merged the two datasets on the common_id_4
Data SASONE.ORPOne_AllVarMerge_5A (drop = FirstMerge); Merge SASONE.ORPHEUS_ONE_For_Matching /*(In=In2)*/ SASONE1.ORPHEUS_For_Matching; *(In=In1); by Common_ID_4; *If In1 then output; *In1 = In2; If FirstNm = 'ANONYMOUS' then delete; Else If FirstNm = 'HIV' then delete; run;
There were still duplicates so I did a proc sort and used the last.common_id_4
proc sort data = SASONE.ORPOne_AllVarMerge_5A NODUPKEYS; by DOB; run; Data SASONE.ORPOne_AllVarMerge_5_Keep; Set SASONE.ORPOne_AllVarMerge_5; by Common_ID_4; If Last.Common_ID_4 Then output; run;
Which resulted in 1,694,385 in the combined data set.
I suspect there are duplicates so I ran the following
Proc sql noprint; CREATE TABLE SASONE.ORPOne_AllVarDups_5_Check AS SELECT 'SASONE.ORPHEUS_ONE_For_Matching' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From SASONE.ORPHEUS_ONE_For_Matching Outer Union Corresponding SELECT 'SASONE1.ORPHEUS_For_Matching' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From SASONE1.ORPHEUS_For_Matching Outer Union Corresponding SELECT 'SASONE.ORPOne_AllVarMerge_5A' As Dataset, Count(Distinct Common_ID_4) as Ndistinct, Count(*) as N From (SELECT Common_ID_4 From SASONE.ORPHEUS_ONE_For_Matching Outer Union Corresponding SELECT Common_ID_4 From SASONE1.ORPHEUS_For_Matching); quit;
The next step was to print the duplicate obs
Proc sql noprint; CREATE TABLE SASONE.DUPS_EXAMINE AS SELECT Common_ID_4 FROM (SELECT Common_ID_4 FROM SASONE.ORPHEUS_ONE_For_Matching OUTER UNION CORRESPONDING SELECT Common_ID_4 FROM SASONE1.ORPHEUS_For_Matching) GROUP BY Common_ID_4 HAVING Count(Common_ID_4) > 1; quit; /* 229,206 obs 6.14.2021 */
The total dups still left in the dataset total 229,206 with a portion coming from each individual data set.
My question is how can I delete the dups from the data set with 1,694,385 obs.
Nothing seems to work the last.var method; using proc sort and NODUPKEY.
What would you propose as a solution, so I can provide a dataset without those remaining duplicates?
Thank you.
wklierman
First, get rid of COMMON_ID. SAS knows how to sort and merge by multiple variables. You might start with:
proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
by LastNm FirstNm;
where FirstNm not in ('HIV', 'ANONYMOUS');
run;
Then similarly for the second data set:
proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
by LastNm FirstNm;
where FirstNm not in ('HIV', 'ANONYMOUS');
run;
Now the individual data sets will have no duplicates remaining. But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets. Do you want to take the values from one particular data set or the other? Do you want a blend of the values from the two data sets? The programming won't be that difficult, but you have to make the decision first.
First, get rid of COMMON_ID. SAS knows how to sort and merge by multiple variables. You might start with:
proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
by LastNm FirstNm;
where FirstNm not in ('HIV', 'ANONYMOUS');
run;
Then similarly for the second data set:
proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
by LastNm FirstNm;
where FirstNm not in ('HIV', 'ANONYMOUS');
run;
Now the individual data sets will have no duplicates remaining. But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets. Do you want to take the values from one particular data set or the other? Do you want a blend of the values from the two data sets? The programming won't be that difficult, but you have to make the decision first.
Thank you for the help. I have a dataset that merged without dups so I'm sending that forward. I am going to work on this smaller but dup containing dataset.
Thanks again.
wklierman
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.