DATA Step, Macro, Functions and more

Recursive lookup for ID's

Accepted Solution Solved
Reply
Regular Contributor
Posts: 190
Accepted Solution

Recursive lookup for ID's

[ Edited ]

I have a dataset like this:

ID1 ID2

A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9

 

I want an output dataset like:

ID Clus

A1 1

A2 1

A3 1

A9 1

A4 2

A5 2

A6 3

A7 3

A8 3

 

Basically I want to cluster all the mapped IDs into one cluster. I tried unsuccessfully with self join.

 

Any ideas?


Accepted Solutions
Solution
‎04-04-2016 03:57 AM
Regular Contributor
Posts: 190

Re: How to perform this in sas?

Posted in reply to FreelanceReinhard
Can you please explain? How does this work?

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

And the purpose of it?

View solution in original post


All Replies
Super User
Super User
Posts: 7,942

Re: How to perform this in sas?

Posted in reply to munitech4u

Sorry, I am not clear on your logic here at all.  Your Have dataset doesn't seem to reflect your out dataset in any way.  Why is there only one A1 in your output dataset?  Why is A9 after A3.  What is Clus, and how is it derived?  To get a unique list:

 

data want;  
  set have (keep=id1 rename=(id1=id))
       have (keep=id2 rename=(id2=id));
run;
proc sort data=want nodupkey;
  by id;
run;

Note, I haven't tested the above - post test data in the forma of a datastep so I don't have to type it in to get tested code.  The above will give you a distinct list of id's.

Regular Contributor
Posts: 190

Re: How to perform this in sas?

clus is nothing but a category we need to assing for related IDs
Super User
Posts: 19,772

Re: How to perform this in sas?

[ Edited ]
Posted in reply to munitech4u

This is a chained or recursive lookup. Do you have SAS OR licensed? If so, PROC BOM can be useful.

Trusted Advisor
Posts: 1,117

Re: How to perform this in sas?

Posted in reply to munitech4u

Hello @munitech4u,

 

That's interesting! Last week the same sort of question was asked by someone else. Ksharp and I have provided two different solutions on 24th and 25th March, respectively. Please see this thread and don't hesitate if you have questions (regarding my solution, I haven't analyzed Ksharp's yet).

Regular Contributor
Posts: 190

Re: How to perform this in sas?

Posted in reply to FreelanceReinhard
Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.
Super User
Posts: 19,772

Re: How to perform this in sas?

Posted in reply to munitech4u
Regular Contributor
Posts: 190

Re: How to perform this in sas?

[ Edited ]

Yea, Seem like I need to go through it. The solution is based on the lag value which, is not the case with me. It does not necessarily have a relationship with lag.

Trusted Advisor
Posts: 1,117

Re: How to perform this in sas?

Posted in reply to munitech4u

@munitech4u: I'm already in the process of adapting my macro ITER to your needs. (It was really interesting to develop it.)

Regular Contributor
Posts: 190

Re: How to perform this in sas?

Posted in reply to FreelanceReinhard
Cool, I am looking forward to it too. I have posted a logic solution, but not sure, as of now, how to implement it!
Trusted Advisor
Posts: 1,117

Re: How to perform this in sas?

Posted in reply to munitech4u

Here's an adapted version ITER4U of my previous macro ITER. I haven't tested it thoroughly yet, but at least it produces the correct result on your test data. Maybe it's unnecessarily complicated and it definitely could be polished here and there because it's an adaptation of code which was developed for a slightly different task.

 

data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;

/* Macro to perform the iterative algorithm */

%macro iter4u(dsin=have  /* input dataset                 */
           , dsout=want  /* output dataset                */
           , v1=id1      /* variable name for first IDs   */
           , v2=id2      /* variable name for second IDs  */
           );
%local i n1 n2;

data &dsin.2;
set &dsin
    &dsin(rename=(&v1=&v2 &v2=&v1));
run;

proc sort data=&dsin.2;
by &v1;
run;

%let i=1;

/* Assign initial group numbers g */

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one ID2 has been assigned      */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to ID1       */
                            /* rather than ID2.                  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

proc sql;                   /* Bring ID1 and ID2 together        */
create table uni as       
select g, &v1 as ID from t1
union
select g, &v2 as ID from t1
quit;

proc sql noprint;           /* Assign preliminary cluster numbers */
create table &dsout.0 as 
select ID, min(g) as g
from uni
group by ID;  
quit;

/* Create result dataset */

proc sort data=&dsout.0;
by g ID;
run;

data &dsout;
set &dsout.0;
by g;
Clus+first.g;              /* Renumber the clusters as 1, 2, 3, ... */
drop g;
run;
%mend iter4u;

%iter4u;

proc print data=want noobs;
run;
Solution
‎04-04-2016 03:57 AM
Regular Contributor
Posts: 190

Re: How to perform this in sas?

Posted in reply to FreelanceReinhard
Can you please explain? How does this work?

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

And the purpose of it?
Trusted Advisor
Posts: 1,117

Re: How to perform this in sas?

Posted in reply to munitech4u

This step is based on the intermediate dataset HAVE2 (&dsin.2) which contains all ID1-ID2 pairs and ID2-ID1 pairs (but with the variable names swapped) of the original input dataset HAVE (&dsin). I newly introduced this dataset into the macro to reflect the fact that in your application ID1 and ID2 are actually taken from a single set of IDs. This dataset was sorted by ID1 (&v1) in the previous step.

 

Now the first ID1 BY group is assigned g=1, the second is assigned g=2 and so on. (The sum statement g+... increments g by 1 whenever a new ID1 BY group starts, i.e., if FIRST.ID1=1.) These group numbers g can be regarded as preliminary cluster numbers. In the subsequent iterative steps these numbers are changed (replaced by a minimum group number) if it turns out that one ID has been assigned different group numbers (due to links to different other IDs). This process is iterated until no further renumberings occur.

Regular Contributor
Posts: 190

Re: How to perform this in sas?

Posted in reply to FreelanceReinhard
A1 A2 1
A1 A9 1
A2 A3 2
A2 A1 2
A3 A2 1
A4 A5 4
A5 A4 5
A6 A7 6
A7 A8 7
A7 A6 7
A8 A7 6
A9 A1 2

But in the dataset g is like above. Why the 4th and 5th obs are 1 and 4?
Super User
Posts: 19,772

Re: How to perform this in sas?

[ Edited ]
Posted in reply to munitech4u

Here's the subgraph macro solution courtesy of @PGStats

 

It uses hash tables, which I'm not even going to pretend to understand. The cluster ID's are not the same, but the clusters are.

 

The forum or Chrome is messing with the code though, you need to replace the ampersandcolonsemicolon with a single colon to make it work. 

 

/*
The SubGraphs macro

This SAS macro finds the disjoint subgraphs in a graph described by a set of arcs.
The input dataset requires only two variables of the same length and type, character
or numeric. Each observation describes an arc in the graph. The output dataset will
contain two variables: Node will be similar in size and type to the input variables
and Clust will be a number identifying the clusters.

Macro arguments

arcs : input dataset name
from : first arc variable (optional, default=from)
to : second arc variable (optional, default=to)
out : output dataset name (optional, default=Clusters)
exp : power of two exponent of internal hash size (optional, default=8)

Tested with SAS version 9.3

PGStats

*/

/* Example : From a list of American city pairs separated by less than 800 miles,
we identify two regions where one could get from one city to the next without
ever travelling more than 800 miles. */

/*Example data view: convert city distance matrix to city pairs list.
Keep nearby cities (<800 miles apart) only. */
/*
data Nearby(keep=from to Miles) / view=Nearby;
set Sashelp.Mileages;
length from to $15;
array c{*} _NUMERIC_;
from = upcase(compress(City,". -"));
do i = 1 to dim(c);
	if c{i} > 0 and c{i} < 800 then do;
		to = upcase(vname(c{i}));
		Miles = c{i};
		output;
	end;
end;
run;
*/
	
/* Macro definition */
%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');

do newClust = 1 by 1 while(not endLoop);
	set &arcs end=endLoop;
	call missing(clust); node = &from;
	if 0^=nodes.find() then nodes.add(); 
	fromClust = clust; 
	call missing(clust); node = &to;
	if 0^=nodes.find() then nodes.add(); 
	toClust = clust;
	if n(fromClust, toClust) = 0 then do;
		nodes.replace(key:&from, data&colon;&from, data&colon;newClust);
		nodes.replace(key:&to, data&colon;&to, data&colon;newClust);
	end;
	else if missing(toClust) then 
		nodes.replace(key:&to, data&colon;&to, data&colon;fromClust);
	else if missing(fromClust) then 
		nodes.replace(key:&from, data&colon;&from, data&colon;toClust);
	else if fromClust ne toClust then do;
		rc = nodeList.first();
		do while (rc = 0);
			if clust = fromClust then 
				nodes.replace(key:node, data&colon;node, data&colon;toClust);
			rc = nodeList.next();
		end;
	end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;

/* Example cont'd. Call the macro with the city pairs list dataview 
and print the results. */
/*
options mprint symbolgen;
%SubGraphs(Nearby,out=Regions,exp=4);

proc sql;
select clust as Region, node as City 
from Regions order by clust, node;
quit; 
*/



data have;

input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9
;
run;

%SubGraphs(have,from=id1,to=id2,out=Clusters,exp=8);

proc sort data=clusters;
by clust node;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 24 replies
  • 536 views
  • 5 likes
  • 5 in conversation