Solved: Eliminating all duplicates

wlierman · Posted 06-15-2021 01:40 AM

I have been working on a merge of two data sets: the number of obs are

Orpheus_one_for_matching = 1,454,515 originally 1,817,358 obs

Orpheus_for_matching = 299,088 originally 326,329 obs

I created a concatenate common variable as shown

Data SASONE.ORPHEUS_ONE_For_Matching;
  Set SASONE.ORPHEUS_ONE_For_Matching;
  Common_ID_4 = CATS(LastNm,FirstNm);
run;

Data SASONE1.ORPHEUS_For_Matching;
  Set SASONE1.ORPHEUS_For_Matching;
  Common_ID_4 = CATS(LastNm,FirstNm);
run;

I merged the two datasets on the common_id_4

Data SASONE.ORPOne_AllVarMerge_5A (drop = FirstMerge);
  Merge SASONE.ORPHEUS_ONE_For_Matching /*(In=In2)*/ 
        SASONE1.ORPHEUS_For_Matching;   *(In=In1);  
  by Common_ID_4;
                                        *If In1 then output;
                                        *In1 = In2;
 If FirstNm = 'ANONYMOUS' then delete;
 Else If FirstNm = 'HIV' then delete;
run;

There were still duplicates so I did a proc sort and used the last.common_id_4

proc sort data = SASONE.ORPOne_AllVarMerge_5A NODUPKEYS;
  by DOB;
run;

Data SASONE.ORPOne_AllVarMerge_5_Keep;
  Set SASONE.ORPOne_AllVarMerge_5;
  by Common_ID_4;
  If Last.Common_ID_4
  Then output;
run;

Which resulted in 1,694,385 in the combined data set.

I suspect there are duplicates so I ran the following

Proc sql noprint;
CREATE TABLE SASONE.ORPOne_AllVarDups_5_Check AS
  SELECT 'SASONE.ORPHEUS_ONE_For_Matching' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From SASONE.ORPHEUS_ONE_For_Matching
  Outer Union Corresponding
  SELECT 'SASONE1.ORPHEUS_For_Matching' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From SASONE1.ORPHEUS_For_Matching
  Outer Union Corresponding
  SELECT 'SASONE.ORPOne_AllVarMerge_5A' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From (SELECT Common_ID_4 
        From SASONE.ORPHEUS_ONE_For_Matching
        Outer Union Corresponding
		SELECT Common_ID_4
        From SASONE1.ORPHEUS_For_Matching);
quit;

The next step was to print the duplicate obs

Proc sql noprint;
CREATE TABLE SASONE.DUPS_EXAMINE AS
   SELECT Common_ID_4
   FROM (SELECT Common_ID_4
         FROM SASONE.ORPHEUS_ONE_For_Matching
		 OUTER UNION CORRESPONDING
		 SELECT Common_ID_4
		 FROM SASONE1.ORPHEUS_For_Matching)
   GROUP BY Common_ID_4
   HAVING Count(Common_ID_4) > 1;
quit;
/* 229,206  obs  6.14.2021 */

The total dups still left in the dataset total 229,206 with a portion coming from each individual data set.

My question is how can I delete the dups from the data set with 1,694,385 obs.

Nothing seems to work the last.var method; using proc sort and NODUPKEY.

What would you propose as a solution, so I can provide a dataset without those remaining duplicates?

Thank you.

wklierman

Astounding · Posted 06-15-2021 07:48 AM

First, get rid of COMMON_ID. SAS knows how to sort and merge by multiple variables. You might start with:

proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Then similarly for the second data set:

proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Now the individual data sets will have no duplicates remaining. But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets. Do you want to take the values from one particular data set or the other? Do you want a blend of the values from the two data sets? The programming won't be that difficult, but you have to make the decision first.

View solution in original post

jimbarbour · Posted 06-15-2021 01:52 AM

Your sort appears to be by DOB. Try changing it to Common_ID_4.

Also, pay close attention to your dataset names. Make sure that the output from one step is applied with the correct name in any subsequent steps.

Jim

Astounding · Posted 06-15-2021 07:48 AM

First, get rid of COMMON_ID. SAS knows how to sort and merge by multiple variables. You might start with:

proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Then similarly for the second data set:

proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Now the individual data sets will have no duplicates remaining. But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets. Do you want to take the values from one particular data set or the other? Do you want a blend of the values from the two data sets? The programming won't be that difficult, but you have to make the decision first.

wlierman · Posted 06-15-2021 09:42 AM

Thank you for the help. I have a dataset that merged without dups so I'm sending that forward. I am going to work on this smaller but dup containing dataset.

Thanks again.

wklierman

Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Re: Eliminating all duplicates

Click image to register for webinar

Classroom Training Available!