BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
wlierman
Lapis Lazuli | Level 10

I have been working on a merge of two data sets: the number of obs are

Orpheus_one_for_matching = 1,454,515  originally  1,817,358 obs

Orpheus_for_matching  = 299,088 originally  326,329 obs

 

I created a concatenate common variable as shown

Data SASONE.ORPHEUS_ONE_For_Matching;
  Set SASONE.ORPHEUS_ONE_For_Matching;
  Common_ID_4 = CATS(LastNm,FirstNm);
run;

Data SASONE1.ORPHEUS_For_Matching;
  Set SASONE1.ORPHEUS_For_Matching;
  Common_ID_4 = CATS(LastNm,FirstNm);
run;

I merged the two datasets on the common_id_4

Data SASONE.ORPOne_AllVarMerge_5A (drop = FirstMerge);
  Merge SASONE.ORPHEUS_ONE_For_Matching /*(In=In2)*/ 
        SASONE1.ORPHEUS_For_Matching;   *(In=In1);  
  by Common_ID_4;
                                        *If In1 then output;
                                        *In1 = In2;
 If FirstNm = 'ANONYMOUS' then delete;
 Else If FirstNm = 'HIV' then delete;
run;

There were still duplicates so I did a proc sort and used the last.common_id_4

proc sort data = SASONE.ORPOne_AllVarMerge_5A NODUPKEYS;
  by DOB;
run;

Data SASONE.ORPOne_AllVarMerge_5_Keep;
  Set SASONE.ORPOne_AllVarMerge_5;
  by Common_ID_4;
  If Last.Common_ID_4
  Then output;
run;

Which resulted in 1,694,385 in the combined data set.

 

I suspect there are duplicates so I ran the following

Proc sql noprint;
CREATE TABLE SASONE.ORPOne_AllVarDups_5_Check AS
  SELECT 'SASONE.ORPHEUS_ONE_For_Matching' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From SASONE.ORPHEUS_ONE_For_Matching
  Outer Union Corresponding
  SELECT 'SASONE1.ORPHEUS_For_Matching' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From SASONE1.ORPHEUS_For_Matching
  Outer Union Corresponding
  SELECT 'SASONE.ORPOne_AllVarMerge_5A' As Dataset,
         Count(Distinct Common_ID_4) as Ndistinct,
         Count(*) as N
  From (SELECT Common_ID_4 
        From SASONE.ORPHEUS_ONE_For_Matching
        Outer Union Corresponding
		SELECT Common_ID_4
        From SASONE1.ORPHEUS_For_Matching);
quit;

The next step was to print the duplicate obs

Proc sql noprint;
CREATE TABLE SASONE.DUPS_EXAMINE AS
   SELECT Common_ID_4
   FROM (SELECT Common_ID_4
         FROM SASONE.ORPHEUS_ONE_For_Matching
		 OUTER UNION CORRESPONDING
		 SELECT Common_ID_4
		 FROM SASONE1.ORPHEUS_For_Matching)
   GROUP BY Common_ID_4
   HAVING Count(Common_ID_4) > 1;
quit;
/* 229,206  obs  6.14.2021 */

The total dups still left in the dataset total 229,206 with a portion coming from each individual data set.

 

My question is how can I delete the dups from the data set with 1,694,385 obs.

 

Nothing seems to work the last.var method; using proc sort and NODUPKEY.

 

What would you propose as a solution, so I can provide a dataset without those remaining duplicates?

 

Thank you.

 

wklierman

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

First, get rid of COMMON_ID.  SAS knows how to sort and merge by multiple variables.  You might start with:

proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;
   

Then similarly for the second data set:

proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Now the individual data sets will have no duplicates remaining.  But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets.  Do you want to take the values from one particular data set or the other?  Do you want a blend of the values from the two data sets?  The programming won't be that difficult, but you have to make the decision first.

View solution in original post

3 REPLIES 3
jimbarbour
Meteorite | Level 14
Your sort appears to be by DOB. Try changing it to Common_ID_4.

Also, pay close attention to your dataset names. Make sure that the output from one step is applied with the correct name in any subsequent steps.

Jim
Astounding
PROC Star

First, get rid of COMMON_ID.  SAS knows how to sort and merge by multiple variables.  You might start with:

proc sort data=sasone.orpheus_one_for_matching nodupkey out=sorted_one;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;
   

Then similarly for the second data set:

proc sort data=sasone.orpheus_for_matching out=sorted_for nodupkey;
   by LastNm FirstNm;
   where FirstNm not in ('HIV', 'ANONYMOUS');
run;

Now the individual data sets will have no duplicates remaining.  But you have to decide what to do if the same LastNm FirstNm combination appears in both of the data sets.  Do you want to take the values from one particular data set or the other?  Do you want a blend of the values from the two data sets?  The programming won't be that difficult, but you have to make the decision first.

wlierman
Lapis Lazuli | Level 10

Thank you for the help.  I have a dataset that merged without dups so I'm sending that forward.  I am going to work on this smaller but dup containing dataset.

 

Thanks again.

 

wklierman

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 466 views
  • 1 like
  • 3 in conversation