SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Tricky issue trying to get rid of "duplicates"...

Accepted Solution Solved
Reply
Contributor
Posts: 38
Accepted Solution

Tricky issue trying to get rid of "duplicates"...

Hi all, I'm working on a de-duplication of a dataset and have run into a quandary. Here's how the data looks, enough to get the idea anyway:

Comp # ID#     name           ID#2       name2

1              24      Dave             26          Dave

2              26      Dave             24          Dave

etc.

The point of this dataset is to investigate whether 2 distinct IDs are actually the same person.

There are thousands of rows. Essentially, what I'm looking for is a way to remove that 2nd line (comp# 2). It should be removed because the comparison between ID 24 and 26 has already been made -- in comp# 1. The tricky part (for me anyway!) is that in that first case, the id# 24 was in the "id#" field, and 26 was in the id#2 field. So somehow I need to make a comparison across id1 and id2, so that who ever is checking out these potential duplicates does not investigate the same pair twice, even though the ID#s are in different places (it's the same comparison).

I can think of how to do this in Excel where I can check values against a range, but in SAS, no clue.. any help would be greatly appreciated!

Thank you.


Accepted Solutions
Solution
‎11-13-2013 02:41 PM
Respected Advisor
Posts: 3,799

Re: Tricky issue trying to get rid of "duplicates"...

Use operator.

data have;
   input comp   (ID1 ID2) ($);
   *concvar = compress(min(id1,id2) ||"|"|| max(id1,id2));
  
length convar $11;
   convar = catx(
'|',id1 min id2,id1 max id2);
   cards;
1               a1234    b1235
2               b1235    a1234
;;;;
   run;
proc print;
  
run;
proc sort data=have nodupkey;
  
by convar;
   run;
proc print;
  
run;

View solution in original post


All Replies
Contributor
Posts: 38

Re: Tricky issue trying to get rid of "duplicates"...

Nevermind, just figured it out. I'll simply create a concat variable assembled according to the min and max of the two ID values. That will put two pairs in the same order each time, then I'll sort and keep the first only....

PROC Star
Posts: 7,468

Re: Tricky issue trying to get rid of "duplicates"...

Yes, that sounds like it will work!  If it does, mark your question as "answered" so that others don't spend time trying to come up with the same solution.  Better yet, post your code and simply mark your own response as being the correct answer.

Contributor
Posts: 38

Re: Tricky issue trying to get rid of "duplicates"...

Actually, I spoke too soon. Many of my ID variables have letters in them, so SAS will not process the MIN or MAX parts.

Is there a function to alpha sort two strings (or #s), then select the first or last of that sorting?

Thanks.

Respected Advisor
Posts: 3,799

Re: Tricky issue trying to get rid of "duplicates"...

It will make it easier if you give more complete example of your data.  Include have and want.

Contributor
Posts: 38

Re: Tricky issue trying to get rid of "duplicates"...

Posted in reply to data_null__

Hope this makes it clearer:

Have:

comp#       ID1        ID2

1               a1234    b1235

2               b1235    a1234

Want:

comp#       ID1        ID2

1               a1234    b1235

The 2nd record has been removed, because the pair (a1234 and b1234) already exist, despite those values existing in different variables (id1 and id2).

if the ID1 and ID2 were always numeric (e.g. 1234 and 1235), I would make a concatenated variable of

concvar = compress(min(id1,id2) ||"|"|| max(id1,id2));

Which would result in both of the original rows having the concvar of "1234|1235". I would then sort by this, and keep only the first concvar.

But since my ID1 and ID2 vars have alphabetical characters in them, I cannot use MIN and MAX. Ideally I'd like a function that would compute "min" as the first element in an alphabetically sorted set.

Solution
‎11-13-2013 02:41 PM
Respected Advisor
Posts: 3,799

Re: Tricky issue trying to get rid of "duplicates"...

Use operator.

data have;
   input comp   (ID1 ID2) ($);
   *concvar = compress(min(id1,id2) ||"|"|| max(id1,id2));
  
length convar $11;
   convar = catx(
'|',id1 min id2,id1 max id2);
   cards;
1               a1234    b1235
2               b1235    a1234
;;;;
   run;
proc print;
  
run;
proc sort data=have nodupkey;
  
by convar;
   run;
proc print;
  
run;
Contributor
Posts: 38

Re: Tricky issue trying to get rid of "duplicates"...

Posted in reply to data_null__

Thank you, data_null_;!! It works perfectly.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 524 views
  • 0 likes
  • 3 in conversation