I have two datasets which need to be “combined” in a certain way that goes beyond a simple join/merge.
One dataset represents a roadway network consisting of links connected by nodes. Below is a representation of a roadway that splits at a “Y” intersection. A driver on link 100 can continue onto link 101 or 201. 101 is a dead-end, but 201 connects to 202, which connects to 203. The dataset is tens of thousands of rows.
LINK_ID | FROM_NODE | TO_NODE | LINK_TOTAL_LENGTH |
100 | 1 | 2 | 1.1 |
101 | 2 | 3 | 0.8 |
201 | 2 | 90 | 0.2 |
202 | 90 | 91 | 0.4 |
203 | 91 | 92 | 1.0 |
The other dataset represents points on the roadway network. These will serve as starting points for trips. Here is one record that indicates a starting point 0.3 units into the link. This dataset is a few thousand rows.
START_POINT_ID | LINK_ID | DISTANCE_INTO_LINK |
10001 | 100 | 0.3 |
My target deliverable is a third table that lists all links within a certain distance of the starting point. In this case, the distance is 1.5 units. I will need to do a bit of math to calculate the last three columns, in between each iteration.
START_POINT_ID | LINK_ID | START_POSITION_ON_LINK | END_POSITION_ON_LINK | CUMULATIVE_DISTANCE |
10001 | 100 | 0.3 | 1.1 | 0.8 |
10001 | 101 | 0.0 | 0.7 | 1.5 |
10001 | 201 | 0.0 | 0.2 | 1.0 |
10001 | 202 | 0.0 | 0.4 | 1.4 |
10001 | 203 | 0.1 | 0.1 | 1.5 |
Conceptually simple, but I’m not sure of the best way to code this. The logical approach is to just perform datastep merges over and over (using a macro) to keep finding the next LINK(s) based on matching the nodes, until the cumulative distance reaches the specified distance (1.5 units), then start over for the next START_POINT_ID. But that requires a lot of overhead.
Is this "iterative conditional lookup" something that could be done within a single datastep?
Thanks!
In case my example is too specific, this is probably similar to what I'm doing.
https://en.wikipedia.org/wiki/Connected_component_(graph_theory)
Edit: @Reeza to fix link
My initial post shows the exact data and exact results. I could expand to include additional records.
Would you suggest I create dataline code so responders can easily work with actual SAS datasets?
Hi @desertsp2,
Do you know PGStats's SubGraphs macro? Please see How to find all connected components in a graph. (Just as a starting point.) There are also several threads about similar problems in the community archives. "subgraphs" is a suitable keyword to find some of them.
Thanks! I'll do some further research and see if this has already been addressed.
Something like this may work:
proc sql;
create index from_node on links(from_node);
quit;
data want;
merge points(in=ok) links;
by link_id;
if ok;
start_position_on_link=distance_into_link;
end_position_on_link=link_total_length;
cumulative_distance=link_total_length-distance_into_link;
keep start_point_id link_id start_position_on_link end_position_on_link cumulative_distance from_node to_node;
run;
data want;
modify want;
from_node=to_node;
distance_so_far=cumulative_distance;
do from_node=.,to_node;
do until(0);
set links key=from_node;
if _iorc_ then leave;
start_position_on_link=0;
end_position_on_link=link_total_length;
cumulative_distance=distance_so_far+link_total_length;
output;
end;
end;
_error_=0;
run;
- except that I do not understand your calculation of the cumulated distance (nor why the start_position_on_link should be 0.1 on the last obs), so I made that up.
The idea is to make an iterative travel through the tree by adding nodes in the end of the table, which will then be read iteratively. The _error_ variable is set to 0 after the loop ends, because it would create an error message every time there are no more nodes to read.
Edit: I put in the outer loop (do from_node=.,to_node) to make the datastep reread the whole set of nodes, if there happens to be a point lying on the ending link of the last link read from the previous point.
The following code could get connected component.
data have;
infile cards ;
input from $ to $ ;
cards;
1 2
1 3
4 5
5 2
9 4
6 7
8 7
;
run;
data full;
set have end=last;
if _n_ eq 1 then do;
declare hash h();
h.definekey('node');
h.definedata('node');
h.definedone();
end;
output;
node=from; h.replace();
from=to; to=node;
output;
node=from; h.replace();
if last then h.output(dataset:'node');
drop node;
run;
data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();
if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
from_to.definekey('from');
from_to.definedata('to');
from_to.definedone();
if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
no.definekey('node');
no.definedata('node');
no.definedone();
do while(hi_no.next()=0);
household+1; output;
count=1;
key=node;_ha.add();
last=node;ha.add();
rc=hi.first();
do while(rc=0);
from=last;rx=from_to.find();
do while(rx=0);
key=to;ry=_ha.check();
if ry ne 0 then do;
node=to;output;rr=no.remove(key:node);
key=to;_ha.add();
count+1;
last=to;ha.add();
end;
rx=from_to.find_next();
end;
rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.