BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
catnipper
Calcite | Level 5

I have several duplicates that don't look like duplicates to SAS but I know they are. (They were determined by fuzzy matching.) The only way to know they are associated is by looking at their clustering. A is related to B, and B to E and E to A. 

I need to ID each cluster, so I can run code by ClusterID. The next step will be to remove some records from each cluster based on additional requirements.

 

This is what I have:

 

data have;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;
proc print data=have;run; 

 

I want to create any unique ID per row, a count is fine but it doesnt need to be consecutive.

I need the data to look like this:

 

Want:
      left     right      ClusterID
	A	  B		 1
	B	  E		 1
	C	  D		 2
	D	  C		 2
	E	  A		 1
	F	  G 	         3

Any ideas?

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

View solution in original post

3 REPLIES 3
Ksharp
Super User
data have(rename=(left=from right=to));
infile cards expandtabs truncover;
input left $ right $;
cards;
A	B
B	E
C	D
D	C
E	A
F	G 
;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

catnipper
Calcite | Level 5

Thank you so much!!

That's a great solution. I have been trying to understand it but I am struggling with some aspects. 

 

I know what it does but don't understand why. 

node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');

 

Then the making of the want is a mystery. What is going on there? It definitely worked, even with more my more complex and much larger dataset.

Ksharp
Super User

This code create full path which is used to search a TREE. For example:

 

From To

A       B

 

 

 

would exchange the position and get :

From To

A       B

B       A

 

 

And the code generated WANT table is a long story to explain.

I have no time to explain the details to you.

If you are familiar with Hash Table, I think you could understand the mystery as long as you read it with a lot of time .

 

Good Luck.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 830 views
  • 0 likes
  • 2 in conversation