I have two datasets which need to be “combined” in a certain way that goes beyond a simple join/merge.
One dataset represents a roadway network consisting of links connected by nodes. Below is a representation of a roadway that splits at a “Y” intersection. A driver on link 100 can continue onto link 101 or 201. 101 is a dead-end, but 201 connects to 202, which connects to 203. The dataset is tens of thousands of rows.
LINK_ID | FROM_NODE | TO_NODE | LINK_TOTAL_LENGTH |
100 | 1 | 2 | 1.1 |
101 | 2 | 3 | 0.8 |
201 | 2 | 90 | 0.2 |
202 | 90 | 91 | 0.4 |
203 | 91 | 92 | 1.0 |
The other dataset represents points on the roadway network. These will serve as starting points for trips. Here is one record that indicates a starting point 0.3 units into the link. This dataset is a few thousand rows.
START_POINT_ID | LINK_ID | DISTANCE_INTO_LINK |
10001 | 100 | 0.3 |
My target deliverable is a third table that lists all links within a certain distance of the starting point. In this case, the distance is 1.5 units. I will need to do a bit of math to calculate the last three columns, in between each iteration.
START_POINT_ID | LINK_ID | START_POSITION_ON_LINK | END_POSITION_ON_LINK | CUMULATIVE_DISTANCE |
10001 | 100 | 0.3 | 1.1 | 0.8 |
10001 | 101 | 0.0 | 0.7 | 1.5 |
10001 | 201 | 0.0 | 0.2 | 1.0 |
10001 | 202 | 0.0 | 0.4 | 1.4 |
10001 | 203 | 0.1 | 0.1 | 1.5 |
Conceptually simple, but I’m not sure of the best way to code this. The logical approach is to just perform datastep merges over and over (using a macro) to keep finding the next LINK(s) based on matching the nodes, until the cumulative distance reaches the specified distance (1.5 units), then start over for the next START_POINT_ID. But that requires a lot of overhead.
Is this "iterative conditional lookup" something that could be done within a single datastep?
Thanks!
In case my example is too specific, this is probably similar to what I'm doing.
https://en.wikipedia.org/wiki/Connected_component_(graph_theory)
Edit: @Reeza to fix link
My initial post shows the exact data and exact results. I could expand to include additional records.
Would you suggest I create dataline code so responders can easily work with actual SAS datasets?
Hi @desertsp2,
Do you know PGStats's SubGraphs macro? Please see How to find all connected components in a graph. (Just as a starting point.) There are also several threads about similar problems in the community archives. "subgraphs" is a suitable keyword to find some of them.
Thanks! I'll do some further research and see if this has already been addressed.
Something like this may work:
proc sql;
create index from_node on links(from_node);
quit;
data want;
merge points(in=ok) links;
by link_id;
if ok;
start_position_on_link=distance_into_link;
end_position_on_link=link_total_length;
cumulative_distance=link_total_length-distance_into_link;
keep start_point_id link_id start_position_on_link end_position_on_link cumulative_distance from_node to_node;
run;
data want;
modify want;
from_node=to_node;
distance_so_far=cumulative_distance;
do from_node=.,to_node;
do until(0);
set links key=from_node;
if _iorc_ then leave;
start_position_on_link=0;
end_position_on_link=link_total_length;
cumulative_distance=distance_so_far+link_total_length;
output;
end;
end;
_error_=0;
run;
- except that I do not understand your calculation of the cumulated distance (nor why the start_position_on_link should be 0.1 on the last obs), so I made that up.
The idea is to make an iterative travel through the tree by adding nodes in the end of the table, which will then be read iteratively. The _error_ variable is set to 0 after the loop ends, because it would create an error message every time there are no more nodes to read.
Edit: I put in the outer loop (do from_node=.,to_node) to make the datastep reread the whole set of nodes, if there happens to be a point lying on the ending link of the last link read from the previous point.
The following code could get connected component.
data have;
infile cards ;
input from $ to $ ;
cards;
1 2
1 3
4 5
5 2
9 4
6 7
8 7
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.