turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- Recursive lookup for ID's

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-29-2016 08:38 AM - last edited on 03-29-2016 11:50 AM by Reeza

I have a dataset like this:

**ID1 ID2**

A1 A2

A2 A3

A4 A5

A6 A7

A7 A8

A1 A9

I want an output dataset like:

**ID Clus**

A1 1

A2 1

A3 1

A9 1

A4 2

A5 2

A6 3

A7 3

A8 3

Basically I want to cluster all the mapped IDs into one cluster. I tried unsuccessfully with self join.

Any ideas?

Accepted Solutions

Solution

04-04-2016
03:57 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

03-29-2016 12:53 PM

Can you please explain? How does this work?

data t1;

set &dsin.2;

by &v1;

g+first.&v1;

run;

And the purpose of it?

data t1;

set &dsin.2;

by &v1;

g+first.&v1;

run;

And the purpose of it?

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 09:38 AM

Sorry, I am not clear on your logic here at all. Your Have dataset doesn't seem to reflect your out dataset in any way. Why is there only one A1 in your output dataset? Why is A9 after A3. What is Clus, and how is it derived? To get a unique list:

data want; set have (keep=id1 rename=(id1=id)) have (keep=id2 rename=(id2=id)); run; proc sort data=want nodupkey; by id; run;

Note, I haven't tested the above - post test data in the forma of a datastep so I don't have to type it in to get tested code. The above will give you a distinct list of id's.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-29-2016 11:05 AM

clus is nothing but a category we need to assing for related IDs

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 09:40 AM - edited 03-29-2016 10:44 AM

This is a chained or recursive lookup. Do you have SAS OR licensed? If so, PROC BOM can be useful.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 09:52 AM

Hello @munitech4u,

That's interesting! Last week the same sort of question was asked by someone else. Ksharp and I have provided two different solutions on 24th and 25th March, respectively. Please see this thread and don't hesitate if you have questions (regarding my solution, I haven't analyzed Ksharp's yet).

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

03-29-2016 11:04 AM

Mine, is even more complex, as the relationship is not just one way, it can be two way. So like in the example give in other solution, the player id can occur in assignment id and vice versa.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 11:13 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Reeza

03-29-2016 11:21 AM - edited 03-29-2016 11:31 AM

Yea, Seem like I need to go through it. The solution is based on the lag value which, is not the case with me. It does not necessarily have a relationship with lag.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 11:31 AM

@munitech4u: I'm already in the process of adapting my macro ITER to your needs. (It was really interesting to develop it.)

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

03-29-2016 11:33 AM

Cool, I am looking forward to it too. I have posted a logic solution, but not sure, as of now, how to implement it!

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 12:19 PM

Here's an adapted version ITER4U of my previous macro ITER. I haven't tested it thoroughly yet, but at least it produces the correct result on your test data. Maybe it's unnecessarily complicated and it definitely could be polished here and there because it's an adaptation of code which was developed for a slightly different task.

```
data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A3 A1
A4 A5
A6 A7
A7 A8
A1 A9
;
/* Macro to perform the iterative algorithm */
%macro iter4u(dsin=have /* input dataset */
, dsout=want /* output dataset */
, v1=id1 /* variable name for first IDs */
, v2=id2 /* variable name for second IDs */
);
%local i n1 n2;
data &dsin.2;
set &dsin
&dsin(rename=(&v1=&v2 &v2=&v1));
run;
proc sort data=&dsin.2;
by &v1;
run;
%let i=1;
/* Assign initial group numbers g */
data t1;
set &dsin.2;
by &v1;
g+first.&v1;
run;
proc sql noprint; /* The sum of all group numbers */
select sum(g) into :n1 /* is used to detect changes */
from t1; /* after reassigning groups. */
quit;
%if &n1>=%sysfunc(constant(EXACTINT))
%then %put %str(WAR)NING: Sum of group numbers has become too large. Results may be incorrect!;
%do %until(&n1=&n2);
%let i=%eval(3-&i); /* toggle i between 1 and 2 */
proc sql noprint; /* If one ID2 has been assigned */
create table t&i as /* multiple group numbers, they are */
select &v1, &v2, min(g) as g /* replaced by their minimum. */
from t%eval(3-&i) /* In the next iteration the same */
group by &&v&i; /* procedure is applied to ID1 */
/* rather than ID2. */
select sum(g) into :n&i /* And so on, alternating, until */
from t&i; /* no reassignments occur, i.e., t1 */
quit; /* and t2 are equal up to sort order */
%end; /* and hence &n1=&n2. */
proc sql; /* Bring ID1 and ID2 together */
create table uni as
select g, &v1 as ID from t1
union
select g, &v2 as ID from t1
quit;
proc sql noprint; /* Assign preliminary cluster numbers */
create table &dsout.0 as
select ID, min(g) as g
from uni
group by ID;
quit;
/* Create result dataset */
proc sort data=&dsout.0;
by g ID;
run;
data &dsout;
set &dsout.0;
by g;
Clus+first.g; /* Renumber the clusters as 1, 2, 3, ... */
drop g;
run;
%mend iter4u;
%iter4u;
proc print data=want noobs;
run;
```

Solution

04-04-2016
03:57 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

03-29-2016 12:53 PM

data t1;

set &dsin.2;

by &v1;

g+first.&v1;

run;

And the purpose of it?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 01:23 PM

This step is based on the intermediate dataset HAVE2 (&dsin.2) which contains all ID1-ID2 pairs and ID2-ID1 pairs (but with the variable names swapped) of the original input dataset HAVE (&dsin). I newly introduced this dataset into the macro to reflect the fact that in your application ID1 and ID2 are actually taken from a single set of IDs. This dataset was sorted by ID1 (&v1) in the previous step.

Now the first ID1 BY group is assigned g=1, the second is assigned g=2 and so on. (The sum statement g+... increments g by 1 whenever a new ID1 BY group starts, i.e., if FIRST.ID1=1.) These group numbers g can be regarded as preliminary cluster numbers. In the subsequent iterative steps these numbers are changed (replaced by a minimum group number) if it turns out that one ID has been assigned different group numbers (due to links to different other IDs). This process is iterated until no further renumberings occur.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to FreelanceReinhard

03-29-2016 01:36 PM

A1 A2 1

A1 A9 1

A2 A3 2

A2 A1 2

A3 A2 1

A4 A5 4

A5 A4 5

A6 A7 6

A7 A8 7

A7 A6 7

A8 A7 6

A9 A1 2

But in the dataset g is like above. Why the 4th and 5th obs are 1 and 4?

A1 A9 1

A2 A3 2

A2 A1 2

A3 A2 1

A4 A5 4

A5 A4 5

A6 A7 6

A7 A8 7

A7 A6 7

A8 A7 6

A9 A1 2

But in the dataset g is like above. Why the 4th and 5th obs are 1 and 4?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to munitech4u

03-29-2016 01:46 PM - edited 03-29-2016 01:49 PM

Here's the subgraph macro solution courtesy of @PGStats

It uses hash tables, which I'm not even going to pretend to understand. The cluster ID's are not the same, but the clusters are.

The forum or Chrome is messing with the code though, you need to replace the __ ampersandcolonsemicolon__ with a single colon to make it work.

```
/*
The SubGraphs macro
This SAS macro finds the disjoint subgraphs in a graph described by a set of arcs.
The input dataset requires only two variables of the same length and type, character
or numeric. Each observation describes an arc in the graph. The output dataset will
contain two variables: Node will be similar in size and type to the input variables
and Clust will be a number identifying the clusters.
Macro arguments
arcs : input dataset name
from : first arc variable (optional, default=from)
to : second arc variable (optional, default=to)
out : output dataset name (optional, default=Clusters)
exp : power of two exponent of internal hash size (optional, default=8)
Tested with SAS version 9.3
PGStats
*/
/* Example : From a list of American city pairs separated by less than 800 miles,
we identify two regions where one could get from one city to the next without
ever travelling more than 800 miles. */
/*Example data view: convert city distance matrix to city pairs list.
Keep nearby cities (<800 miles apart) only. */
/*
data Nearby(keep=from to Miles) / view=Nearby;
set Sashelp.Mileages;
length from to $15;
array c{*} _NUMERIC_;
from = upcase(compress(City,". -"));
do i = 1 to dim(c);
if c{i} > 0 and c{i} < 800 then do;
to = upcase(vname(c{i}));
Miles = c{i};
output;
end;
end;
run;
*/
/* Macro definition */
%macro SubGraphs(arcs,from=from,to=to,out=Clusters,exp=8);
data _null_;
if 0 then set &arcs(keep=&from rename=(&from=node)); /* get node data type */
length clust 8;
declare hash nodes(hashexp:&exp);
nodes.defineKey('node');
nodes.defineData('node', 'clust');
nodes.defineDone();
declare hiter nodeList('nodes');
do newClust = 1 by 1 while(not endLoop);
set &arcs end=endLoop;
call missing(clust); node = &from;
if 0^=nodes.find() then nodes.add();
fromClust = clust;
call missing(clust); node = &to;
if 0^=nodes.find() then nodes.add();
toClust = clust;
if n(fromClust, toClust) = 0 then do;
nodes.replace(key:&from, data:&from, data:newClust);
nodes.replace(key:&to, data:&to, data:newClust);
end;
else if missing(toClust) then
nodes.replace(key:&to, data:&to, data:fromClust);
else if missing(fromClust) then
nodes.replace(key:&from, data:&from, data:toClust);
else if fromClust ne toClust then do;
rc = nodeList.first();
do while (rc = 0);
if clust = fromClust then
nodes.replace(key:node, data:node, data:toClust);
rc = nodeList.next();
end;
end;
end;
nodes.output(dataset:"&out");
stop;
run;
%mend SubGraphs;
/* Example cont'd. Call the macro with the city pairs list dataview
and print the results. */
/*
options mprint symbolgen;
%SubGraphs(Nearby,out=Regions,exp=4);
proc sql;
select clust as Region, node as City
from Regions order by clust, node;
quit;
*/
data have;
input ID1 $ ID2 $;
cards;
A1 A2
A2 A3
A4 A5
A6 A7
A7 A8
A1 A9
;
run;
%SubGraphs(have,from=id1,to=id2,out=Clusters,exp=8);
proc sort data=clusters;
by clust node;
run;
```