I have been working on a merge of two data sets: the number of obs are
Orpheus_one_for_matching = 1,454,515 originally 1,817,358 obs
Orpheus_for_matching = 299,088 originally 326,329 obs
I created a concatenate common variable as shown
Data SASONE.ORPHEUS_ONE_For_Matching;
Set SASONE.ORPHEUS_ONE_For_Matching;
Common_ID_4 = CATS(LastNm,FirstNm);
run;
Data SASONE1.ORPHEUS_For_Matching;
Set SASONE1.ORPHEUS_For_Matching;
Common_ID_4 = CATS(LastNm,FirstNm);
run;
I merged the two datasets on the common_id_4
Data SASONE.ORPOne_AllVarMerge_5A (drop = FirstMerge);
Merge SASONE.ORPHEUS_ONE_For_Matching /*(In=In2)*/
SASONE1.ORPHEUS_For_Matching; *(In=In1);
by Common_ID_4;
*If In1 then output;
*In1 = In2;
If FirstNm = 'ANONYMOUS' then delete;
Else If FirstNm = 'HIV' then delete;
run;
There were still duplicates so I did a proc sort and used the last.common_id_4
proc sort data = SASONE.ORPOne_AllVarMerge_5A NODUPKEYS;
by DOB;
run;
Data SASONE.ORPOne_AllVarMerge_5_Keep;
Set SASONE.ORPOne_AllVarMerge_5;
by Common_ID_4;
If Last.Common_ID_4
Then output;
run;
Which resulted in 1,694,385 in the combined data set.
I suspect there are duplicates so I ran the following
Proc sql noprint;
CREATE TABLE SASONE.ORPOne_AllVarDups_5_Check AS
SELECT 'SASONE.ORPHEUS_ONE_For_Matching' As Dataset,
Count(Distinct Common_ID_4) as Ndistinct,
Count(*) as N
From SASONE.ORPHEUS_ONE_For_Matching
Outer Union Corresponding
SELECT 'SASONE1.ORPHEUS_For_Matching' As Dataset,
Count(Distinct Common_ID_4) as Ndistinct,
Count(*) as N
From SASONE1.ORPHEUS_For_Matching
Outer Union Corresponding
SELECT 'SASONE.ORPOne_AllVarMerge_5A' As Dataset,
Count(Distinct Common_ID_4) as Ndistinct,
Count(*) as N
From (SELECT Common_ID_4
From SASONE.ORPHEUS_ONE_For_Matching
Outer Union Corresponding
SELECT Common_ID_4
From SASONE1.ORPHEUS_For_Matching);
quit;
The next step was to print the duplicate obs
Proc sql noprint;
CREATE TABLE SASONE.DUPS_EXAMINE AS
SELECT Common_ID_4
FROM (SELECT Common_ID_4
FROM SASONE.ORPHEUS_ONE_For_Matching
OUTER UNION CORRESPONDING
SELECT Common_ID_4
FROM SASONE1.ORPHEUS_For_Matching)
GROUP BY Common_ID_4
HAVING Count(Common_ID_4) > 1;
quit;
/* 229,206 obs 6.14.2021 */
The total dups still left in the dataset total 229,206 with a portion coming from each individual data set.
My question is how can I delete the dups from the data set with 1,694,385 obs.
Nothing seems to work the last.var method; using proc sort and NODUPKEY.
What would you propose as a solution, so I can provide a dataset without those remaining duplicates?
Thank you.
wklierman
... View more