Hi there,
I have around 15 million observations, where each observation is a 1 metre length of road with the variables being various attributes.
I want to create an aggregate data set, so that each observation is a longer length (with a upper limit of 1000 metres, but no lower limit). I would like this length to be determined by having the same attributes (homogeneous lengths of road). Where anyone of the attributes changes, a new length is created. Now, each observation has a start point (metres along a road) and an end point (e.g. 1 to 2) and the next contiguous length will have the previous end point as the start point and so on (think of a spatial file with lines). There will be breaks in the lengths, the observations will not always be contiguous so this is another break point I want to use.
And on top of that I want to use the first start point as the new start point and take the end point of the last record in that length and use that as the end point in the aggregate data set. The start and end points also need to be in order when aggregating is occurring so that you don't end up with a start point that exists after your end point.
Ideally I would also calculate the new length by summing the length_m variable of the observations aggregated into a length, or simply by subtracting the new end point from the new start point.
I get the logic of what I need to do I think, but I'm struggling to turn this into code. Note that I'm ok with the fact that in this way the roads all exist in parallel (i.e. it doesn't take into account the real intersecting nature of a road network).
I am using Base SAS 9.4.
The following is a small example of the data that I have. It has 46 metres of road, with 46 observations/rows:
data have;
input road $ carriage $ start end attribute1 $ length_m;
length road $ 2 carriage $ 2 start 3 end 3 attribute1 $ 4 length_m 4;
datalines;
P1 L 0 1 50PD 1
P1 L 1 2 50PD 1
P1 L 2 3 50PD 1
P1 L 7 8 50PD 1
P1 L 8 9 50PD 1
P1 L 3 4 60PD 1
P1 L 4 5 60PD 1
P1 L 6 7 60PD 1
P1 L 11 12 70AS 1
P1 L 9 10 70PD 1
P1 L 10 11 70PD 1
P1 R 0 1 50PD 1
P1 R 1 2 50PD 1
P1 R 2 3 50PD 1
P1 R 7 8 50PD 1
P1 R 8 9 50PD 1
P1 R 9 10 50PD 1
P1 R 3 4 60PD 1
P1 R 4 5 60PD 1
P1 R 6 7 60PD 1
P1 R 10 11 70PD 1
P1 R 11 12 70PD 1
P1 S 5 6 60PD 1
P2 L 0 1 50PD 1
P2 L 1 2 50PD 1
P2 L 2 3 50PD 1
P2 L 7 8 50PD 1
P2 L 8 9 50PD 1
P2 L 3 4 60PD 1
P2 L 4 5 60PD 1
P2 L 6 7 60PD 1
P2 L 11 12 70AS 1
P2 L 9 10 70PD 1
P2 L 10 11 70PD 1
P2 R 0 1 50PD 1
P2 R 1 2 50PD 1
P2 R 2 3 50PD 1
P2 R 7 8 50PD 1
P2 R 8 9 50PD 1
P2 R 9 10 50PD 1
P2 R 3 4 60PD 1
P2 R 4 5 60PD 1
P2 R 6 7 60PD 1
P2 R 10 11 70PD 1
P2 R 11 12 70PD 1
P2 S 5 6 60PD 1
run;
The resulting table I'm hoping to get is as follows. It has 46 metres of road, broken up into 24 observations (unique, homogenous lengths of road of varying length):
data want;
input road $ carriage $ start end attribute1 $ length_m;
length road $ 2 carriage $ 2 start 3 end 3 attribute1 $ 4 length_m 4;
datalines;
P1 L 0 3 50PD 3
P1 L 7 9 50PD 2
P1 L 3 5 60PD 2
P1 L 6 7 60PD 1
P1 L 11 12 70AS 1
P1 L 9 11 70PD 2
P1 R 0 3 50PD 3
P1 R 7 10 50PD 3
P1 R 3 5 60PD 2
P1 R 6 7 60PD 1
P1 R 10 12 70PD 2
P1 S 5 6 60PD 1
P2 L 0 3 50PD 3
P2 L 7 9 50PD 2
P2 L 3 5 60PD 2
P2 L 6 7 60PD 1
P2 L 11 12 70AS 1
P2 L 9 11 70PD 2
P2 R 0 3 50PD 3
P2 R 7 10 50PD 3
P2 R 3 5 60PD 2
P2 R 6 7 60PD 1
P2 R 10 12 70PD 2
P2 S 5 6 60PD 1
run;
Can anyone help with an efficient and effective solution?
Thanks for your time!
This has been tested with your data, but your sample doesn't include a length over 1,000, so it's not a complete test:
data have;
input road $ carriage $ start end attribute1 $ length_m;
length /*road $ 2 carriage $ 2 */ start 3 end 3 /*attribute1 $ 4*/ length_m 4;
datalines;
P1 L 0 1 50PD 1
P1 L 1 2 50PD 1
P1 L 2 3 50PD 1
P1 L 7 8 50PD 1
P1 L 8 9 50PD 1
P1 L 3 4 60PD 1
P1 L 4 5 60PD 1
P1 L 6 7 60PD 1
P1 L 11 12 70AS 1
P1 L 9 10 70PD 1
P1 L 10 11 70PD 1
P1 R 0 1 50PD 1
P1 R 1 2 50PD 1
P1 R 2 3 50PD 1
P1 R 7 8 50PD 1
P1 R 8 9 50PD 1
P1 R 9 10 50PD 1
P1 R 3 4 60PD 1
P1 R 4 5 60PD 1
P1 R 6 7 60PD 1
P1 R 10 11 70PD 1
P1 R 11 12 70PD 1
P1 S 5 6 60PD 1
P2 L 0 1 50PD 1
P2 L 1 2 50PD 1
P2 L 2 3 50PD 1
P2 L 7 8 50PD 1
P2 L 8 9 50PD 1
P2 L 3 4 60PD 1
P2 L 4 5 60PD 1
P2 L 6 7 60PD 1
P2 L 11 12 70AS 1
P2 L 9 10 70PD 1
P2 L 10 11 70PD 1
P2 R 0 1 50PD 1
P2 R 1 2 50PD 1
P2 R 2 3 50PD 1
P2 R 7 8 50PD 1
P2 R 8 9 50PD 1
P2 R 9 10 50PD 1
P2 R 3 4 60PD 1
P2 R 4 5 60PD 1
P2 R 6 7 60PD 1
P2 R 10 11 70PD 1
P2 R 11 12 70PD 1
P2 S 5 6 60PD 1
run;
data want (drop=nxt_: first_:);
if 0 then set have;
do until (nxt_start>end or nxt_length>1000 or last.attribute1=1);
set have (keep=road carriage attribute1);
by road carriage attribute1 notsorted;
merge have have (firstobs=2 keep=start end rename=(start=nxt_start end=nxt_end));
if first_start=. then first_start=start;
nxt_length=(nxt_end-first_start);
end;
start=first_start;
length_m=(end-start);
run;
This has been tested with your data, but your sample doesn't include a length over 1,000, so it's not a complete test:
data have;
input road $ carriage $ start end attribute1 $ length_m;
length /*road $ 2 carriage $ 2 */ start 3 end 3 /*attribute1 $ 4*/ length_m 4;
datalines;
P1 L 0 1 50PD 1
P1 L 1 2 50PD 1
P1 L 2 3 50PD 1
P1 L 7 8 50PD 1
P1 L 8 9 50PD 1
P1 L 3 4 60PD 1
P1 L 4 5 60PD 1
P1 L 6 7 60PD 1
P1 L 11 12 70AS 1
P1 L 9 10 70PD 1
P1 L 10 11 70PD 1
P1 R 0 1 50PD 1
P1 R 1 2 50PD 1
P1 R 2 3 50PD 1
P1 R 7 8 50PD 1
P1 R 8 9 50PD 1
P1 R 9 10 50PD 1
P1 R 3 4 60PD 1
P1 R 4 5 60PD 1
P1 R 6 7 60PD 1
P1 R 10 11 70PD 1
P1 R 11 12 70PD 1
P1 S 5 6 60PD 1
P2 L 0 1 50PD 1
P2 L 1 2 50PD 1
P2 L 2 3 50PD 1
P2 L 7 8 50PD 1
P2 L 8 9 50PD 1
P2 L 3 4 60PD 1
P2 L 4 5 60PD 1
P2 L 6 7 60PD 1
P2 L 11 12 70AS 1
P2 L 9 10 70PD 1
P2 L 10 11 70PD 1
P2 R 0 1 50PD 1
P2 R 1 2 50PD 1
P2 R 2 3 50PD 1
P2 R 7 8 50PD 1
P2 R 8 9 50PD 1
P2 R 9 10 50PD 1
P2 R 3 4 60PD 1
P2 R 4 5 60PD 1
P2 R 6 7 60PD 1
P2 R 10 11 70PD 1
P2 R 11 12 70PD 1
P2 S 5 6 60PD 1
run;
data want (drop=nxt_: first_:);
if 0 then set have;
do until (nxt_start>end or nxt_length>1000 or last.attribute1=1);
set have (keep=road carriage attribute1);
by road carriage attribute1 notsorted;
merge have have (firstobs=2 keep=start end rename=(start=nxt_start end=nxt_end));
if first_start=. then first_start=start;
nxt_length=(nxt_end-first_start);
end;
start=first_start;
length_m=(end-start);
run;
Thank you @mkeintz - this looks great. I'll test this on the real data today. I'd got about as far as creating all the break points manually with flags, using lags and was trying to tackle a do loop so I'm so glad I was somewhat on the right, albeit much clunkier track. Practice, practice, practice.
Thank you very much, this worked perfectly on the sample and my real data. I modified slightly by removing the notsorted option in the data step and including an explicit proc sort as below.
proc sort data = have;
by road carriage attribute1 start;
run;
For validation, I summed the lengths by various features in the original data set and compared to those in the end result, as well as a spot check of a random selection of roads.
Thanks again for your time!
Once upon a time I worked with some similar data. There may be a few things that you need to address, possibly before the accumulation.
One question: Is there some specific reason that your data is not sorted by start/end within the Road/Carriage settings?
The road data I worked with had roads that overlapped, i.e. the same road section belonged to multiple road identifiers. In my case those were US Interstate , US Highway, and State Highway most often. Does the same road section in your data ever get associated with multiple road identifiers? For your purpose would want/need each road (never a gap in the accumulation) or some treatment by priority (if this section belongs to US and State highway only include for US highway).
And for added fun, on one section of these overlaps since it was sharing roadbed with one road that was mile marked (think Start/End section) for east-west direction and another that was north/south so that the actual "mile" directions in effect changed.
Since we were using this data to create maps this was a non-trivial exercise. Also when traffic crashes were recorded there were rules involved as to which highway system was reported in the overlapping sections.
Thanks for replying! You've got to love the 'added fun'. I'm reasonably new to this network data and not from a spatial background, so it has been eye opening. I'm hoping that I've dealt with most issues in pre-processing, but it's an iterative process. I do not have the same section belonging to multiple identifiers and do not expect the direction of mile markers to be a problem. I'm doing some risk-based modelling, so my interest is in the features of the sections and how well they predict various crash outcomes. I do have the issue of crashes at intersections being allocated according to a hierarchy and plan to compare a couple of simplistic ways of dealing with this to see how it impacts results.
At this point I'm happy with gaps in accumulation if there are gaps in the data. There are a number of contextual reasons for gaps, but at this stage it's also a good checkpoint for if something is going wrong in my manipulation and merging of large data sets I'm unfamiliar with. I will collapse sections after inspection, if appropriate.
There is no specific reason the data wouldn't be sorted that way in Base SAS as I have to PROC SORT it that way for a lot of the prep before this aggregation phase. There's no context specific reason that I'm aware of either. The only reason it wouldn't be would be because I have access to SAS Viya as well, so I did it in there I would require some code modifications as my understanding is that that 'sorts on the fly'. However, that's a whole separate kettle of fish.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.