BookmarkSubscribeRSS Feed
rsva
Fluorite | Level 6

Greetings,

 

I have a data similar to below

 

id1             id2

--------------------

ds246       jk900

kl78          gh456

jk900        ds246

ko000       rt234

gh456      kl789

nm879     th45

 

If you look closely, 1st and 3rd records are the same pair, but the ids are flipped. I need to identify such pairs to deduplicate the data. I mean instead of counting the pair of ids twice, I need to count them once. Any guidance would help.

 

Thanks.

rsva  

6 REPLIES 6
Astounding
PROC Star

Assuming that you are free to play with the data and the program completes in a reasonable amount of time, here's one way.

 

data want;

set have;

if id1 < id2 then do;

   id1_sorted = id1;

   id2_sorted = id2;

end;

else do;

  id1_sorted = id2;

  id2_sorted = id1;

end;

run;

 

proc sort data=want NODUPKEY;

  by id1_sorted id2_sorted;

run;

 

Optionally, you could then drop id1_sorted and id2_sorted;

Patrick
Opal | Level 21

Just another way of doing it using a hash table. This code assumes that you want to keep only a single row per unique combination of ID1 & ID2.

 

data have;
input id1 $ id2 $;
datalines;
ds246 jk900
kl78 gh456
jk900 ds246
ko000 rt234
gh456 kl789
nm879 th45
;
run;

data want;
set have;
if _n_=1 then
do;
dcl hash h1();
h1.defineKey('id1','id2');
h1.defineDone();
end;
if h1.check() ne 0 then
do;
output;
h1.add();
h1.add(key:id2,key:id1, data&colon;id2, data&colon;id1);
end;
run;

Reeza
Super User

How big is your data?
A simple way is to create two new variables and use a proc sql with distinct or proc sort with no dupkey.

 

/*Set up fake data*/
data have;
input id1  $ id2 $;
cards;
ds246       jk900
kl78          gh456
jk900        ds246
ko000       rt234
gh456      kl789
nm879     th45
;

/*Assign to new variables in fixed order*/
data want;
set have;
if id1<id2 then do;
	id3=id1;
	id4=id2;
end;
else do;
id3=id2;
id4=id1;
end;
run;

*Unique via SQL;
proc sql; 
create table want_sql as 
select distinct id3, id4
from want
order by id3, id4;
quit;

*unique records via Sort;
Proc sort data=want (keep=id3 id4) out=want_sort nodupkey;
by id3 id4;
run;

 

MikeZdeb
Rhodochrosite | Level 12

Hi, another idea ...

 

data x;
input id1 $ id2 $;
cards;
ds246 jk900
kl78  gh456
jk900 ds246
ko000 rt234
gh456 kl789
nm879 th45
;

 

data y;
set x;
y1 = id1;
y2 = id2;
array y(2) ;
call sortc(of y(*));
run;

 

proc sort data=y out=y (drop=y: ) nodupkey;
by y: ;

run;

 

data set y ...

 

Obs     id1      id2

 1     ds246    jk900
 2     kl78     gh456
 3     gh456    kl789
 4     ko000    rt234
 5     nm879    th45

 

If changing the order of ID1 and ID2 doesn't matter, simpler code ...

 

data y;
set x;
array id(2) ;
call sortc(of id(*));
run;

 

proc sort data=y nodupkey;
by id: ;
run;

hbi
Quartz | Level 8 hbi
Quartz | Level 8

[edited to change MAX() statement to equality test within CASE WHEN ... ]

 

Hi,

 

This PROC SQL solution works well with pairs, since a pair of values will always have a minimum and maximum.

 

If one were dealing with n-tuples (id1, id2, id3, etc.), then a different solution would be needed. 

 

-- hbi Robot Happy

 

PROC SQL;
    CREATE TABLE want AS 
    SELECT DISTINCT CASE WHEN id1 <= id2 THEN id1 ELSE id2 END AS id1_fixed
                  , CASE WHEN id1 <  id2 THEN id2 ELSE id1 END AS id2_fixed
    FROM have;
QUIT;

PROC SQL;
    SELECT COUNT(*) AS num_distinct_obs
    FROM want;
QUIT;

 

hbi
Quartz | Level 8 hbi
Quartz | Level 8

This version is just for fun. It creates a hash for each pair of IDs. This version is only useful if you want a count of the number of distinct ID pairs. 

 

PROC SQL;
  /* create unique MD5 hash based on ID pair */
  CREATE TABLE id_pairs_hashed AS 
  SELECT CASE WHEN id1 < id2 THEN PUT(MD5(CATS(id1, "|", id2)), $hex32.)
              ELSE                PUT(MD5(CATS(id2, "|", id1)), $hex32.)
         END AS id_pair_md5 length=32
  FROM have;
QUIT;


PROC SQL;
  SELECT COUNT(DISTINCT id_pair_md5) AS num_distinct_obs
  FROM id_pairs_hashed;
QUIT;

 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 941 views
  • 0 likes
  • 6 in conversation