BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
munitech4u
Quartz | Level 8

I have a dataset like this:

ID1 ID2

A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9

 

I want an output dataset like:

ID Clus

A1 1

A2 1

A3 1

A9 1

A4 2

A5 2

A6 3

A7 3

A8 3

 

Basically I want to cluster all the mapped IDs into one cluster. I tried unsuccessfully with self join.

 

Any ideas?

1 ACCEPTED SOLUTION

Accepted Solutions
munitech4u
Quartz | Level 8
Can you please explain? How does this work?

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

And the purpose of it?

View solution in original post

24 REPLIES 24
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Sorry, I am not clear on your logic here at all.  Your Have dataset doesn't seem to reflect your out dataset in any way.  Why is there only one A1 in your output dataset?  Why is A9 after A3.  What is Clus, and how is it derived?  To get a unique list:

 

data want;  
  set have (keep=id1 rename=(id1=id))
       have (keep=id2 rename=(id2=id));
run;
proc sort data=want nodupkey;
  by id;
run;

Note, I haven't tested the above - post test data in the forma of a datastep so I don't have to type it in to get tested code.  The above will give you a distinct list of id's.

munitech4u
Quartz | Level 8
clus is nothing but a category we need to assing for related IDs
Reeza
Super User

This is a chained or recursive lookup. Do you have SAS OR licensed? If so, PROC BOM can be useful.

FreelanceReinh
Jade | Level 19

Hello @munitech4u,

 

That's interesting! Last week the same sort of question was asked by someone else. Ksharp and I have provided two different solutions on 24th and 25th March, respectively. Please see this thread and don't hesitate if you have questions (regarding my solution, I haven't analyzed Ksharp's yet).

munitech4u
Quartz | Level 8
Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.
munitech4u
Quartz | Level 8

Yea, Seem like I need to go through it. The solution is based on the lag value which, is not the case with me. It does not necessarily have a relationship with lag.

FreelanceReinh
Jade | Level 19

@munitech4u: I'm already in the process of adapting my macro ITER to your needs. (It was really interesting to develop it.)

munitech4u
Quartz | Level 8
Cool, I am looking forward to it too. I have posted a logic solution, but not sure, as of now, how to implement it!
FreelanceReinh
Jade | Level 19

Here's an adapted version ITER4U of my previous macro ITER. I haven't tested it thoroughly yet, but at least it produces the correct result on your test data. Maybe it's unnecessarily complicated and it definitely could be polished here and there because it's an adaptation of code which was developed for a slightly different task.

 

data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;

/* Macro to perform the iterative algorithm */

%macro iter4u(dsin=have  /* input dataset                 */
           , dsout=want  /* output dataset                */
           , v1=id1      /* variable name for first IDs   */
           , v2=id2      /* variable name for second IDs  */
           );
%local i n1 n2;

data &dsin.2;
set &dsin
    &dsin(rename=(&v1=&v2 &v2=&v1));
run;

proc sort data=&dsin.2;
by &v1;
run;

%let i=1;

/* Assign initial group numbers g */

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

proc sql noprint;        /* The sum of all group numbers */
select sum(g) into :n1   /* is used to detect changes    */
from t1;                 /* after reassigning groups.    */
quit;

%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;

%do %until(&n1=&n2);
  %let i=%eval(3-&i); /* toggle i between 1 and 2 */

  proc sql noprint;         /* If one ID2 has been assigned      */
  create table t&i as       /* multiple group numbers, they are  */
  select &v1, &v2, min(g) as g  /* replaced by their minimum.    */
  from t%eval(3-&i)         /* In the next iteration the same    */
  group by &&v&i;           /* procedure is applied to ID1       */
                            /* rather than ID2.                  */
  select sum(g) into :n&i   /* And so on, alternating, until     */
  from t&i;                 /* no reassignments occur, i.e., t1  */
  quit;                     /* and t2 are equal up to sort order */
%end;                       /* and hence &n1=&n2.                */

proc sql;                   /* Bring ID1 and ID2 together        */
create table uni as       
select g, &v1 as ID from t1
union
select g, &v2 as ID from t1
quit;

proc sql noprint;           /* Assign preliminary cluster numbers */
create table &dsout.0 as 
select ID, min(g) as g
from uni
group by ID;  
quit;

/* Create result dataset */

proc sort data=&dsout.0;
by g ID;
run;

data &dsout;
set &dsout.0;
by g;
Clus+first.g;              /* Renumber the clusters as 1, 2, 3, ... */
drop g;
run;
%mend iter4u;

%iter4u;

proc print data=want noobs;
run;
munitech4u
Quartz | Level 8
Can you please explain? How does this work?

data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;

And the purpose of it?
FreelanceReinh
Jade | Level 19

This step is based on the intermediate dataset HAVE2 (&dsin.2) which contains all ID1-ID2 pairs and ID2-ID1 pairs (but with the variable names swapped) of the original input dataset HAVE (&dsin). I newly introduced this dataset into the macro to reflect the fact that in your application ID1 and ID2 are actually taken from a single set of IDs. This dataset was sorted by ID1 (&v1) in the previous step.

 

Now the first ID1 BY group is assigned g=1, the second is assigned g=2 and so on. (The sum statement g+... increments g by 1 whenever a new ID1 BY group starts, i.e., if FIRST.ID1=1.) These group numbers g can be regarded as preliminary cluster numbers. In the subsequent iterative steps these numbers are changed (replaced by a minimum group number) if it turns out that one ID has been assigned different group numbers (due to links to different other IDs). This process is iterated until no further renumberings occur.

munitech4u
Quartz | Level 8
A1 A2 1
A1 A9 1
A2 A3 2
A2 A1 2
A3 A2 1
A4 A5 4
A5 A4 5
A6 A7 6
A7 A8 7
A7 A6 7
A8 A7 6
A9 A1 2

But in the dataset g is like above. Why the 4th and 5th obs are 1 and 4?
Reeza
Super User

Here's the subgraph macro solution courtesy of @PGStats

 

It uses hash tables, which I'm not even going to pretend to understand. The cluster ID's are not the same, but the clusters are.

 

The forum or Chrome is messing with the code though, you need to replace the ampersandcolonsemicolon with a single colon to make it work. 

 

/*
The SubGraphs macro

This SAS macro finds the disjoint subgraphs in a graph described by a set of arcs.
The input dataset requires only two variables of the same length and type, character
or numeric. Each observation describes an arc in the graph. The output dataset will
contain two variables: Node will be similar in size and type to the input variables
and Clust will be a number identifying the clusters.

Macro arguments

arcs : input dataset name
from : first arc variable (optional, default=from)
to : second arc variable (optional, default=to)
out : output dataset name (optional, default=Clusters)
exp : power of two exponent of internal hash size (optional, default=8)

Tested with SAS version 9.3

PGStats

*/

/* Example : From a list of American city pairs separated by less than 800 miles,
we identify two regions where one could get from one city to the next without
ever travelling more than 800 miles. */

/*Example data view: convert city distance matrix to city pairs list.
Keep nearby cities (<800 miles apart) only. */
/*
data Nearby(keep=from to Miles) / view=Nearby;
set Sashelp.Mileages;
length from to $15;
array c{*} _NUMERIC_;
from = upcase(compress(City,". -"));
do i = 1 to dim(c);
	if c{i} > 0 and c{i} < 800 then do;
		to = upcase(vname(c{i}));
		Miles = c{i};
		output;
	end;
end;
run;
*/
	
/* Macro definition */
%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');

do newClust = 1 by 1 while(not endLoop);
	set &arcs end=endLoop;
	call missing(clust); node = &from;
	if 0^=nodes.find() then nodes.add(); 
	fromClust = clust; 
	call missing(clust); node = &to;
	if 0^=nodes.find() then nodes.add(); 
	toClust = clust;
	if n(fromClust, toClust) = 0 then do;
		nodes.replace(key:&from, data&colon;&from, data&colon;newClust);
		nodes.replace(key:&to, data&colon;&to, data&colon;newClust);
	end;
	else if missing(toClust) then 
		nodes.replace(key:&to, data&colon;&to, data&colon;fromClust);
	else if missing(fromClust) then 
		nodes.replace(key:&from, data&colon;&from, data&colon;toClust);
	else if fromClust ne toClust then do;
		rc = nodeList.first();
		do while (rc = 0);
			if clust = fromClust then 
				nodes.replace(key:node, data&colon;node, data&colon;toClust);
			rc = nodeList.next();
		end;
	end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;

/* Example cont'd. Call the macro with the city pairs list dataview 
and print the results. */
/*
options mprint symbolgen;
%SubGraphs(Nearby,out=Regions,exp=4);

proc sql;
select clust as Region, node as City 
from Regions order by clust, node;
quit; 
*/



data have;

input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9
;
run;

%SubGraphs(have,from=id1,to=id2,out=Clusters,exp=8);

proc sort data=clusters;
by clust node;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 24 replies
  • 3367 views
  • 5 likes
  • 5 in conversation