I have a dataset of operating data as measured on three variables (e.g. speed, temperature, weight). I would like to identify which variables lie outside a boundary. The boundary are values for the three variables. The boundary is made up of about 200 observations in another table, and can be made to be contiguous, though operationally, only the max values are of consideration.
The only analogy I can find to this is gps related such as 'proc ginside'. My need is to compare numeric variables, but more than 2, and i don't know if my application would match a mapping procedure.
Is this just a matter of testing an operating point against each point in the boundary table to see if it falls outside the limits? In which case construct a macro or loop?
Is there an existing method that streamlines this?
example operating data from field
unit date temperature speed weight g 01JAN2021:00:00:00.000000 12 1,231.40 248.93 g 01JAN2021:00:10:00.000000 12.5 1,480.82 250.74 g 01JAN2021:00:20:00.000000 15 1,134.39 282.13 g 01JAN2021:00:30:00.000000 22 1,090.64 275.02
example limits table
point Temperature Speed_limit Weight_limit 1 30 0 601.5029396 2 30 50 601.5029334 3 30 50.00000055 765.8339735 4 30 100.0000011 765.8339736 5 30 150.0000017 765.8339737 6 30 200.0000023 765.8339737 7 30 250.0000028 765.8339738 8 30 300.0000034 765.8339739 9 30 350.000004 765.8339739 10 30 400.0000047 765.833974
IF your polygons are convex and possibly concentric then use Ginside with concentric polygons with Temperature as the ID variable. Use the other two coordinates as the map variables. Ginside would return the innermost polygon id value that a point is inside. If the data is ordered correctly. At least it does for a very small example.
/* create concentric polygon set with the ID of the innermost smallest first (Not sure if the order is critical) note the repeat of first vertex as last to close the polygon */ data mapset; input x y idvar; datalines; 5 5 1 5 10 1 10 10 1 10 5 1 5 5 1 0 0 2 0 15 2 15 15 2 15 0 2 0 0 2 ; /* this shows appearance of the polygons*/ pattern value=empty; proc gmap map=mapset data=mapset; id idvar; choro idvar /coutline=black; run; quit; /* create some points to test ginside*/ data testvalues; input x y; datalines; 6 6 6 13 1 1 ; proc ginside map=mapset data=testvalues out=resultset includeborder ; id idvar; run;
The "trick" may be getting the map data set created as I cannot tell if you have the values in proper graphing order which is not a trivial exercise.
I have an example that closes each polygon to make sure the entire boundary is as expected. This could be accomplished if your data is in a good order by using first. and last processing to set the starting coordinates of the polygon on the First.temperature and then add them as the closing point at the Last.temperature from boundaries set.
This is one way:
data limits; input cat Temperature speed weight; datalines; 0.9 30 0 602 0.9 30 50 602 0.9 30 50 1128 0.9 30 1953 1128 0.95 30 2161 395 0.95 30 2162 339 0.95 30 2162 282 0.95 30 2163 226 0.95 30 2164 169 0.95 30 2164 113 0.9 35 1978 -564 0.9 35 1974 -621 0.9 35 1968 -677 0.95 35 2037 1016 0.95 35 2050 959 0.95 35 2062 903 0.95 35 2071 846 0.95 35 2079 790 0.95 35 2085 734 0.95 35 2090 677 0.95 35 2092 621 ; data mapset; /*map set requires specific variables*/ set limits (rename=(temperature=id speed=x weight=y)); /* if your id variable is grouped but not actually sorted*/ by id notsorted; retain tempx tempy; if first.id then do; tempx=x; tempy=y; end; output; if last.id then do; x=tempx; y=tempy; output; end; drop tempx tempy; run;
No clue as to the behavior of how overlapping non-concentric polygons will react. I think the first ID value found would be the result but not interested enough to try to create such data and test.
Note that depending on GINSIDE options chosen you can restrict the output data set to only inside, only outside or all the data. If "outside" of all polygons the ID variable should be missing.
Can you show what you expect for the result?
This is likely a fairly straight-forward Proc SQL call as that is one way to combine and compare all records in one data set with another. This would be called a Cartesian Join. One example that selects records where ALL three limits are exceeded would look like this. Note that you didn't very clearly specify if you want when ALL three exceed or two of three or if any one of the three exceed their limit. So that might mean some additional logic involving OR instead of AND for the comparisons.
proc sql; create table compared as select a.*, b.* from operatingdata as a ,limitdata as b where a.temperature > b.temperature and a.speed > b.speed_limit and a.weight > b.weight_limit ; quit;
The output of the above would have all of the values involved so you could process data further for some process.
I suspect somewhere you are going to get to a "but I only wanted the lowest set of limits that were exceeded". If so, you should state that but the output set above could be sorted and processed to get that.
Thanks @ballardw, I think a simple compare on all three variables would show some points as out of bounds when they are in fact in bounds.
A math method I have found that comes close is 'ray casting'. (in the case of 2 dimensions) take any operating point's y value and extend to infinite. If the line intersects the limits boundary an odd number of times, the point is inside. If an even number of times, or 0, then it is outside the boundary.
So some time of ray casting in SAS for more than 2 dimensions. I might be able to remove a dimension though.
@eh51 wrote:
Thanks @ballardw, I think a simple compare on all three variables would show some points as out of bounds when they are in fact in bounds.
A math method I have found that comes close is 'ray casting'. (in the case of 2 dimensions) take any operating point's y value and extend to infinite. If the line intersects the limits boundary an odd number of times, the point is inside. If an even number of times, or 0, then it is outside the boundary.
So some time of ray casting in SAS for more than 2 dimensions. I might be able to remove a dimension though.
I think you may be understating your problem by a lot. If the points in your limits are vertices of solution polygons or areas, then you need to state so. If so you may be looking for SAS/IML or SAS/OR depending on the interpretation of your complete data.
A reasonable two-dimension solutions set such as just speed and weight to construct solution polygons then Proc Ginside would work. Which may well be an implementation of the "ray casting" as I have no idea of the internal coding.
But this is three dimensions and I'm not sure of a 3d approach that is quite equivalent. Something related to solid modelling perhaps.
I am really working from scratch here.
My limits can be vertices of a polygon, and i believe I can further assume they are always convex.
I am looking at SAS/IML right now and might try to implement the equation (y-y0)(x1-x0)-(x-x0)(y1-y0)
if the result is less than 0, then the point is to the right of the segment, greater than is to the left, and 0 is on the line segment. To be inside or outside, the point has to be to the right of every segment, or left of every segment. Any mix means it is outside the polygon.
I might give ginside a try though given your input. Assuming the map can be anything.
In either case, I will have determine which of 10 different speed and weight polygons to use based on the operating temperature. There is some error in this, but I can make it conservative.
Ignoring the fact you have three variables for the moment, see this example. I'd try and get it working for one first and then figure out how to scale it.
Looking further at Ginside, I don't see a way to switch between different polygons based on the operating point. Ginside wants adjoining polygons, but my polygons will enclose each other since the third variable expands the polygon that should be used.
For this reason, I am creating a set of arrays that define the polygon, and using the third variable from the operating data to choose which polygon to compare. I will determine if the op point lies always to one side of all of the polygon segments.
You can use arrays here most likely but it depends a bit on how your limits are structured. GINSIDE is likely not the right approach unless you're dealing with spatial limits.
Can you post a small reproducible example and someone can help you with the code?
And here's a tutorial on using Arrays in SAS
https://stats.idre.ucla.edu/sas/seminars/sas-arrays/
I will try to address all of the comments from my Array thread, which are essentially asking me to 'start over'. 🙂
The data set I have is a set of field measurements. Operating data.
Obs | Group | local_dttm | Speed | weight | Temperature | Cat |
1 | 8784 | 04JAN2021:17:20:00.000000 | 2,329.65 | -426.06 | 3.35 | 0.97 |
2 | 8784 | 04JAN2021:17:30:00.000000 | 2,330.98 | -480.73 | 3.74 | 0.97 |
3 | 8784 | 04JAN2021:17:40:00.000000 | 2,329.38 | -447.44 | 3.88 | 0.97 |
4 | 8784 | 04JAN2021:17:50:00.000000 | 2,327.91 | -365.98 | 3.74 | 0.98 |
5 | 8784 | 04JAN2021:18:00:00.000000 | 2,329.30 | -225.05 | 3.62 | 0.99 |
6 | 8784 | 04JAN2021:18:10:00.000000 | 2,329.98 | -234.77 | 3.37 | 0.99 |
7 | 8784 | 04JAN2021:21:10:00.000000 | 2,331.31 | -87.29 | 2.01 | 0.99 |
8 | 8784 | 04JAN2021:21:20:00.000000 | 2,329.99 | -77.5 | 1.81 | 0.99 |
9 | 8784 | 04JAN2021:21:30:00.000000 | 2,330.17 | 12.78 | 1.83 | 1 |
10 | 8784 | 04JAN2021:21:40:00.000000 | 2,330.07 | 16.9 | 1.68 | 1 |
11 | 8784 | 04JAN2021:21:50:00.000000 | 2,330.94 | 2.36 | 1.94 | 1 |
12 | 8784 | 04JAN2021:22:00:00.000000 | 2,329.89 | -33.22 | 2.03 | 0.99 |
13 | 8784 | 04JAN2021:22:10:00.000000 | 2,329.93 | -142.8 | 2 | 0.98 |
14 | 8784 | 04JAN2021:22:20:00.000000 | 2,330.20 | -141.09 | 1.95 | 0.98 |
15 | 8784 | 04JAN2021:22:30:00.000000 | 2,329.50 | -129.25 | 1.78 | 0.98 |
16 | 8784 | 04JAN2021:22:40:00.000000 | 2,330.17 | -133.56 | 1.69 | 0.98 |
17 | 8784 | 04JAN2021:22:50:00.000000 | 2,329.39 | -152.02 | 1.56 | 0.98 |
18 | 8784 | 04JAN2021:23:00:00.000000 | 2,330.09 | -182.97 | 1.52 | 0.98 |
19 | 8784 | 04JAN2021:23:10:00.000000 | 2,329.59 | -169.54 | 1.29 | 0.98 |
20 | 8784 | 04JAN2021:23:20:00.000000 | 2,330.77 | -221.52 | 1.31 | 0.98 |
21 | 8784 | 04JAN2021:23:30:00.000000 | 2,329.89 | -236.03 | 1.61 | 0.98 |
22 | 8784 | 04JAN2021:23:40:00.000000 | 2,329.93 | -243.14 | 1.7 | 0.97 |
23 | 8784 | 04JAN2021:23:50:00.000000 | 2,329.21 | -234.27 | 1.6 | 0.97 |
24 | 8784 | 05JAN2021:00:00:00.000000 | 2,328.60 | -229.13 | . | 0.97 |
25 | 8784 | 05JAN2021:00:10:00.000000 | 2,330.11 | -188.98 | 1.49 | 0.98 |
26 | 8784 | 05JAN2021:00:20:00.000000 | 2,328.01 | -216.42 | 1.34 | 0.98 |
27 | 8784 | 05JAN2021:00:30:00.000000 | 2,308.17 | -213.41 | 1.23 | 0.98 |
28 | 8784 | 05JAN2021:00:40:00.000000 | 2,327.50 | -212.38 | 1.39 | . |
My desired data set would show if the operating point is over a limit (yes/no could be inside/outside or 1/0, etc.)
Obs | Group | local_dttm | Speed | weight | Temperature | Cat | Exceeds_limits |
1 | 8784 | 04JAN2021:17:20:00.000000 | 2,329.65 | -426.06 | 3.35 | 0.97 | yes |
2 | 8784 | 04JAN2021:17:30:00.000000 | 2,330.98 | -480.73 | 3.74 | 0.97 | No |
3 | 8784 | 04JAN2021:17:40:00.000000 | 2,329.38 | -447.44 | 3.88 | 0.97 | No |
4 | 8784 | 04JAN2021:17:50:00.000000 | 2,327.91 | -365.98 | 3.74 | 0.98 | No |
5 | 8784 | 04JAN2021:18:00:00.000000 | 2,329.30 | -225.05 | 3.62 | 0.99 | No |
6 | 8784 | 04JAN2021:18:10:00.000000 | 2,329.98 | -234.77 | 3.37 | 0.99 | No |
7 | 8784 | 04JAN2021:21:10:00.000000 | 2,331.31 | -87.29 | 2.01 | 0.99 | No |
8 | 8784 | 04JAN2021:21:20:00.000000 | 2,329.99 | -77.5 | 1.81 | 0.99 | No |
9 | 8784 | 04JAN2021:21:30:00.000000 | 2,330.17 | 12.78 | 1.83 | 1 | No |
10 | 8784 | 04JAN2021:21:40:00.000000 | 2,330.07 | 16.9 | 1.68 | 1 | Yes |
11 | 8784 | 04JAN2021:21:50:00.000000 | 2,330.94 | 2.36 | 1.94 | 1 | No |
12 | 8784 | 04JAN2021:22:00:00.000000 | 2,329.89 | -33.22 | 2.03 | 0.99 | No |
13 | 8784 | 04JAN2021:22:10:00.000000 | 2,329.93 | -142.8 | 2 | 0.98 | No |
14 | 8784 | 04JAN2021:22:20:00.000000 | 2,330.20 | -141.09 | 1.95 | 0.98 | No |
15 | 8784 | 04JAN2021:22:30:00.000000 | 2,329.50 | -129.25 | 1.78 | 0.98 | No |
16 | 8784 | 04JAN2021:22:40:00.000000 | 2,330.17 | -133.56 | 1.69 | 0.98 | No |
17 | 8784 | 04JAN2021:22:50:00.000000 | 2,329.39 | -152.02 | 1.56 | 0.98 | Yes |
18 | 8784 | 04JAN2021:23:00:00.000000 | 2,330.09 | -182.97 | 1.52 | 0.98 | No |
19 | 8784 | 04JAN2021:23:10:00.000000 | 2,329.59 | -169.54 | 1.29 | 0.98 | No |
20 | 8784 | 04JAN2021:23:20:00.000000 | 2,330.77 | -221.52 | 1.31 | 0.98 | No |
21 | 8784 | 04JAN2021:23:30:00.000000 | 2,329.89 | -236.03 | 1.61 | 0.98 | Yes |
22 | 8784 | 04JAN2021:23:40:00.000000 | 2,329.93 | -243.14 | 1.7 | 0.97 | No |
23 | 8784 | 04JAN2021:23:50:00.000000 | 2,329.21 | -234.27 | 1.6 | 0.97 | No |
24 | 8784 | 05JAN2021:00:00:00.000000 | 2,328.60 | -229.13 | . | 0.97 | No |
25 | 8784 | 05JAN2021:00:10:00.000000 | 2,330.11 | -188.98 | 1.49 | 0.98 | No |
26 | 8784 | 05JAN2021:00:20:00.000000 | 2,328.01 | -216.42 | 1.34 | 0.98 | No |
27 | 8784 | 05JAN2021:00:30:00.000000 | 2,308.17 | -213.41 | 1.23 | 0.98 | No |
28 | 8784 | 05JAN2021:00:40:00.000000 | 2,327.50 | -212.38 | 1.39 | . | Yes |
The final report by group, which is simple once the above is available.
Group | Percent exceeds |
8784 | 1.00% |
5694 | 1.50% |
5289 | 3.00% |
4535 | 0.10% |
2489 | 8.00% |
3527 | 4.00% |
The limits data set defines convex polygons. I cannot use a simple "is speed greater than the speed limit". I need to use something like Ginside, but my polygons are not adjacent. They lay on top of each other. A lower temperature provides a larger polygon with the same center and some identical vertices.
I can reduce my problem to two dimensions (two variables). The other variable (actually 2), I would use to choose the correct set of limits (speed/weight).
Here is a look at the limits data set, from which 10 different polygon vertices are obtained.
each cat/temperature combination represents one array.
cat | Temperature | speed | weight |
0.9 | 30 | 0 | 602 |
0.9 | 30 | 50 | 602 |
0.9 | 30 | 50 | 1128 |
0.9 | 30 | 1953 | 1128 |
0.95 | 30 | 2161 | 395 |
0.95 | 30 | 2162 | 339 |
0.95 | 30 | 2162 | 282 |
0.95 | 30 | 2163 | 226 |
0.95 | 30 | 2164 | 169 |
0.95 | 30 | 2164 | 113 |
0.9 | 35 | 1978 | -564 |
0.9 | 35 | 1974 | -621 |
0.9 | 35 | 1968 | -677 |
0.95 | 35 | 2037 | 1016 |
0.95 | 35 | 2050 | 959 |
0.95 | 35 | 2062 | 903 |
0.95 | 35 | 2071 | 846 |
0.95 | 35 | 2079 | 790 |
0.95 | 35 | 2085 | 734 |
0.95 | 35 | 2090 | 677 |
0.95 | 35 | 2092 | 621 |
My thought was to step through each operating data set observation,
Determine which set of limits to use (the rows where op-cat and op-temperature fall within a cat and temp in the limits data set)
Make the calculation for each operating set observation
Write the result to a new variable in the operating data set
This is what I have. Any recommendations on how to accomplish the above is appreciated.
data op_results;
set op_base;
/* original idea was to load a set of arrays that I can step through later in this data step but am not seeing a path */
create set of applicable_array (plural) here
/* initialize some variables */
negatives = 0;
positives = 0;
zeroes = 0;
/* step through the limits and determine if the operating point is to the right or left SIDE of each line drawn by each vertices pair in the limits data set/array */
do j = 2 to numrow(applicable_array,1);
side = sign(
(speed - applicable_array[i-1,1])*(applicable_array[i,2]-applicable_array[i-1,2]) -
(weight - applicable_array[i-1,2])*(applicable_array[i,1]-applicable_array[i-1,1])
);
if side = -1 then negatives++;
else if side = 1 then positives++;
else zeroes++;
end;
IF negatives = 0 and positives > 0 then exceedance = 'inside'; /* inside/outside could be yes/no or 1/0, etc. */
else exceedance = 'outside';
run;
The limits data set defines convex polygons. I cannot use a simple "is speed greater than the speed limit". I need to use something like Ginside, but my polygons are not adjacent. They lay on top of each other. A lower temperature provides a larger polygon with the same center and some identical vertices.
How do you know which polygon goes with which point?
The value of 'cat' and 'temperature' in the operations data set will fall between one of the limits. I would use the more conservative limit. This is the code I planned to use to choose the applicable limit array.
I replaced the code with --use...-- to give the idea.
/* determine which boundary to use */
if temperature <= 30 then do;
if cat <= 0.90 then --use limits 30090-- ;
else if (0.90 < cat <= 0.95) then --use limits 30095-- end;
else if (0.95 < cat <= 1) then --use limits 30100-- end;
else if (1 < cat <= 1.05) then --use limits 30105-- end;
else --use limits 30110;
end;
else do;
if cat <= 0.90 then --use limits 35090-- ;
else if (0.90 < cat <= 0.95) then --use limits 35095-- ;
else if (0.95 < cat <= 1) then --use limits 35100-- ;
else if (1 < cat <= 1.05) then --use limits 35105-- ;
else --use limits 35110-- ;
end;
IF your polygons are convex and possibly concentric then use Ginside with concentric polygons with Temperature as the ID variable. Use the other two coordinates as the map variables. Ginside would return the innermost polygon id value that a point is inside. If the data is ordered correctly. At least it does for a very small example.
/* create concentric polygon set with the ID of the innermost smallest first (Not sure if the order is critical) note the repeat of first vertex as last to close the polygon */ data mapset; input x y idvar; datalines; 5 5 1 5 10 1 10 10 1 10 5 1 5 5 1 0 0 2 0 15 2 15 15 2 15 0 2 0 0 2 ; /* this shows appearance of the polygons*/ pattern value=empty; proc gmap map=mapset data=mapset; id idvar; choro idvar /coutline=black; run; quit; /* create some points to test ginside*/ data testvalues; input x y; datalines; 6 6 6 13 1 1 ; proc ginside map=mapset data=testvalues out=resultset includeborder ; id idvar; run;
The "trick" may be getting the map data set created as I cannot tell if you have the values in proper graphing order which is not a trivial exercise.
I have an example that closes each polygon to make sure the entire boundary is as expected. This could be accomplished if your data is in a good order by using first. and last processing to set the starting coordinates of the polygon on the First.temperature and then add them as the closing point at the Last.temperature from boundaries set.
This is one way:
data limits; input cat Temperature speed weight; datalines; 0.9 30 0 602 0.9 30 50 602 0.9 30 50 1128 0.9 30 1953 1128 0.95 30 2161 395 0.95 30 2162 339 0.95 30 2162 282 0.95 30 2163 226 0.95 30 2164 169 0.95 30 2164 113 0.9 35 1978 -564 0.9 35 1974 -621 0.9 35 1968 -677 0.95 35 2037 1016 0.95 35 2050 959 0.95 35 2062 903 0.95 35 2071 846 0.95 35 2079 790 0.95 35 2085 734 0.95 35 2090 677 0.95 35 2092 621 ; data mapset; /*map set requires specific variables*/ set limits (rename=(temperature=id speed=x weight=y)); /* if your id variable is grouped but not actually sorted*/ by id notsorted; retain tempx tempy; if first.id then do; tempx=x; tempy=y; end; output; if last.id then do; x=tempx; y=tempy; output; end; drop tempx tempy; run;
No clue as to the behavior of how overlapping non-concentric polygons will react. I think the first ID value found would be the result but not interested enough to try to create such data and test.
Note that depending on GINSIDE options chosen you can restrict the output data set to only inside, only outside or all the data. If "outside" of all polygons the ID variable should be missing.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.