An interesting problem...

iiibbb · Posted 02-09-2012 05:28 PM

I know that I've been asking a lot of questions. I have been pushing my SAS envelope quite a bit lately.

I have an interesting dataset of GPS data tracking a machine as it travels on a field. It moves from field to headrows on each end. We have 10's of thousands of points taken from GPS and GIS. I am basically trying to quantify the time spent on each leg (up a row, turn around in a headrow, down another row, turn around in a headrow, up another row), and in the future do some other calculations.

My current problem

Sometimes the machine doesn't cleanly re-enter the field and has to back out, reset, and restart. My approach has been to assign leg numbers to these and when they cross the boundary into another zone. However, I wind up with tiny leg segments that should be lumped in with the point where the machine first exited the field. I anticipated this problem to a degree using a small artificial dataset, used proc means to find out these tiny legs, then reassign their leg using the lag statement. Merged it back and recalculated the legs... worked great.

Real life is not as simple. First off, I get multiple tiny leg segments (sometimes close to a dozen, sometimes just a couple).

So-- is there a way anyone can think of that I might automate this. I am not generally opposed to doing it by hand, but I have at least 10 of these datasets to go through, and if I can automate it I'd like to.

This table represents a subset of data that is generated by my proc means, minus the New_PLT column, which is the column I'd like to find a way to generate automatically. You can see that the machine proceeds from a headrow "HR1" into the field, into
"HR2", reenters the field. The attribute that is common is that when I want to lump them in with previous types, there are a series of "legs" (PLeg) with low _FREQ_, and it primarily there is a repeating pattern HR1 FIELD HR1 FIELD repeat... or HR2 FIELD HR2 FIELD repeat...

Any ideas? I had some good ones, but I have a headache now.

Obs	PLeg	PLT	_TYPE_	_FREQ_	speed	New_PLT
51	51	HR1	0	44	2.69545	HR1
52	52	FLD	0	295	2.03322	FLD
53	53	HR2	0	145	0.68828	HR2
54	54	FLD	0	216	2.54676	FLD
55	55	HR1	0	43	3.08837	HR1
56	56	FLD	0	301	2.04751	FLD
57	57	HR2	0	119	0.73277	HR2
58	58	FLD	0	21	0.0619	HR2
59	59	HR2	0	3	0	HR2
60	60	FLD	0	1	0.1	HR2
61	61	HR2	0	2	0.15	HR2
62	62	FLD	0	5	0.12	HR2
63	63	HR2	0	43	0.1	HR2
64	64	FLD	0	211	2.61374	FLD
65	65	HR1	0	35	3.22857	HR1
66	66	FLD	0	317	1.83533	FLD
67	67	HR2	0	331	0.37583	HR2
68	68	FLD	0	243	2.27037	FLD
69	69	HR1	0	86	2.48023	HR1
70	70	FLD	0	22	0.76364	FLD
71	71	HR1	0	43	2.3814	HR1
72	72	FLD	0	255	2.07961	FLD
73	73	HR2	0	232	0.80819	HR2
74	74	FLD	0	208	2.52837	FLD
75	75	HR1	0	2	0.5	HR1
76	76	FLD	0	14	0.70714	HR1
77	77	HR1	0	37	3.94865	HR1
78	78	FLD	0	171	2.84971	FLD
79	79	HR2	0	299	0.64482	HR2
80	80	FLD	0	320	1.83125	FLD
81	81	HR1	0	35	4.18286	HR1
82	82	FLD	0	193	2.51244	FLD
83	83	HR2	0	778	0.31427	HR2
84	84	FLD	0	310	1.80097	FLD
119	119	HR1	0	39	3.76154	HR1
120	120	FLD	0	213	2.11549	FLD
121	121	HR2	0	2	3.45	HR2
122	122	FLD	0	2	2.7	HR2
123	123	HR2	0	57	3.38246	HR2
124	124	FLD	0	570	1.02649	FLD
125	125	HR1	0	428	0.39556	HR1

Astounding · Posted 02-10-2012 11:15 AM

You're right. It is an interesting problem. And I understand the headache.

Here's an approach. Since you're working with summary data sets, it looks like you can process them multiple times without having efficiency come into play.

For the summary data set, take a subset by removing the FLD records. That lets you determine the starting and ending legs for each combination HDR segment:

data subset;

set original (keep=Pleg PLT) end=done;

where PLT ne 'FLD';

by PLT notsorted;

if first.PLT then start=Pleg;

retain start;

if last.PLT;

end = Pleg;

fmtname = 'hdr';

label = PLT;

output;

if done then do;

hlo='O';

label='FLD';

output;

end;

keep start end fmtname label hlo;

run;

proc format cntlin=subset;

run;

In a way, this mimics how you would do it if you were looking at the data on paper. Use the format to assign new_PLT:

data add_new_PLT;

set original;

new_PLT = put(Pleg, hdr.);

run;

The code is untested, so you may have to tweak it. But the approach should work.

Good luck.

iiibbb · Posted 02-10-2012 04:35 PM

I will try that.

I have been mulling over another idea that I'm pretty sure will work, but it is far less elegant looking, more of a square peg, round hole, big robot hammer approach.

I will try yours in the next day or two and let you know.

Patrick · Posted 02-10-2012 07:48 PM

That's a really neat solution! Like it a lot.

art297 · Posted 02-10-2012 04:52 PM

I probably don't understand what you are trying to do, but wouldn't the following accomplish what you want?

data want (drop=hold:);

set have;

retain hold_plt;

if _n_ eq 1 or speed gt 0.15 then hold_PLT=PLT;

new_plt=hold_plt;

run;

Astounding · Posted 02-10-2012 06:38 PM

Art,

It looks like you understand the problem, but I think your solution relies on the speed values too much. For example, look at observations 75 through 77 where your logic would switch back and forth instead of grouping all three. The same issue would exist for observations 121 through 123.

Just for the record, iiibbb, shouldn't observation 70 have new_PLT=HR1?

art297 · Posted 02-10-2012 07:08 PM

122 was the only one I questioned but, yes, my code relied totally on speed.

iiibbb · Posted 02-10-2012 11:25 PM

I figured out something that worked....

It is not Elegant. Get rid of the drop statement toward the end and you can see the progression.

data a;

input Obs PLeg PLT $ _TYPE_ _FREQ_ speed Desired_PLT $;

cards;

51 51 HR1 0 44 2.69545 HR1

52 52 FLD 0 295 2.03322 FLD

53 53 HR2 0 145 0.68828 HR2

54 54 FLD 0 216 2.54676 FLD

55 55 HR1 0 43 3.08837 HR1

56 56 FLD 0 301 2.04751 FLD

57 57 HR2 0 119 0.73277 HR2

58 58 FLD 0 21 0.0619 HR2

59 59 HR2 0 3 0 HR2

60 60 FLD 0 1 0.1 HR2

61 61 HR2 0 2 0.15 HR2

62 62 FLD 0 5 0.12 HR2

63 63 HR2 0 43 0.1 HR2

64 64 FLD 0 211 2.61374 FLD

65 65 HR1 0 35 3.22857 HR1

66 66 FLD 0 317 1.83533 FLD

67 67 HR2 0 331 0.37583 HR2

68 68 FLD 0 243 2.27037 FLD

69 69 HR1 0 86 2.48023 HR1

70 70 FLD 0 22 0.76364 HR1

71 71 HR1 0 43 2.3814 HR1

72 72 FLD 0 255 2.07961 FLD

73 73 HR2 0 232 0.80819 HR2

74 74 FLD 0 208 2.52837 FLD

75 75 HR1 0 2 0.5 HR1

76 76 FLD 0 14 0.70714 HR1

77 77 HR1 0 37 3.94865 HR1

78 78 FLD 0 171 2.84971 FLD

79 79 HR2 0 299 0.64482 HR2

80 80 FLD 0 320 1.83125 FLD

81 81 HR1 0 35 4.18286 HR1

82 82 FLD 0 193 2.51244 FLD

83 83 HR2 0 778 0.31427 HR2

84 84 FLD 0 310 1.80097 FLD

119 119 HR1 0 39 3.76154 HR1

120 120 FLD 0 213 2.11549 FLD

121 121 HR2 0 2 3.45 HR2

122 122 FLD 0 2 2.7 HR2

123 123 HR2 0 57 3.38246 HR2

124 124 FLD 0 570 1.02649 FLD

125 125 HR1 0 428 0.39556 HR1

;;

run;

data b;

set a;

if PLT="FLD" then a=0;

if PLT="HR1" then a=-1;

if PLT="HR2" then a=1;

b = lag2(a);

sumab= abs(a+b);

c = lag1(sumab);

if sumab = 0 then d = c; else d = sumab;

run;

proc sort;

by descending PLeg;

run;

data c;

drop sumab a b c d;

set b;

e = lag1(d);

LagPLT=lag1(PLT);

if e=2 and PLT="FLD" then NewPLT=LagPLT; else NewPLT=PLT;

run;

proc sort;

by PLeg;

proc print;

run;

iiibbb · Posted 02-10-2012 11:36 PM

	Obs	Pleg	PLT	_TYPE_	_FREQ_	Speed	Desired_PLT	e	lagPLT	NewPLT
1	51	51	HR1	0	44	2.69545	HR1	.	FLD	HR1
2	52	52	FLD	0	295	2.03322	FLD	.	HR2	FLD
3	53	53	HR2	0	145	0.68828	HR2	0	FLD	HR2
4	54	54	FLD	0	216	2.54676	FLD	0	HR1	FLD
5	55	55	HR1	0	43	3.08837	HR1	0	FLD	HR1
6	56	56	FLD	0	301	2.04751	FLD	0	HR2	FLD
7	57	57	HR2	0	119	0.73277	HR2	0	FLD	HR2
8	58	58	FLD	0	21	0.0619	HR2	2	HR2	HR2
9	59	59	HR2	0	3	0	HR2	2	FLD	HR2
10	60	60	FLD	0	1	0.1	HR2	2	HR2	HR2
11	61	61	HR2	0	2	0.15	HR2	2	FLD	HR2
12	62	62	FLD	0	5	0.12	HR2	2	HR2	HR2
13	63	63	HR2	0	43	0.1	HR2	2	FLD	HR2
14	64	64	FLD	0	211	2.61374	FLD	0	HR1	FLD
15	65	65	HR1	0	35	3.22857	HR1	0	FLD	HR1
16	66	66	FLD	0	317	1.83533	FLD	0	HR2	FLD
17	67	67	HR2	0	331	0.37583	HR2	0	FLD	HR2
18	68	68	FLD	0	243	2.27037	FLD	0	HR1	FLD
19	69	69	HR1	0	86	2.48023	HR1	0	FLD	HR1
20	70	70	FLD	0	22	0.76364	HR1	2	HR1	HR1
21	71	71	HR1	0	43	2.3814	HR1	2	FLD	HR1
22	72	72	FLD	0	255	2.07961	FLD	0	HR2	FLD
23	73	73	HR2	0	232	0.80819	HR2	0	FLD	HR2
24	74	74	FLD	0	208	2.52837	FLD	0	HR1	FLD
25	75	75	HR1	0	2	0.5	HR1	0	FLD	HR1
26	76	76	FLD	0	14	0.70714	HR1	2	HR1	HR1
27	77	77	HR1	0	37	3.94865	HR1	2	FLD	HR1
28	78	78	FLD	0	171	2.84971	FLD	0	HR2	FLD
29	79	79	HR2	0	299	0.64482	HR2	0	FLD	HR2
30	80	80	FLD	0	320	1.83125	FLD	0	HR1	FLD
31	81	81	HR1	0	35	4.18286	HR1	0	FLD	HR1
32	82	82	FLD	0	193	2.51244	FLD	0	HR2	FLD
33	83	83	HR2	0	778	0.31427	HR2	0	FLD	HR2
34	84	84	FLD	0	310	1.80097	FLD	0	HR1	FLD
35	119	119	HR1	0	39	3.76154	HR1	0	FLD	HR1
36	120	120	FLD	0	213	2.11549	FLD	0	HR2	FLD
37	121	121	HR2	0	2	3.45	HR2	0	FLD	HR2
38	122	122	FLD	0	2	2.7	HR2	2	HR2	HR2
39	123	123	HR2	0	57	3.38246	HR2	2	FLD	HR2
40	124	124	FLD	0	570	1.02649	FLD	0	HR1	FLD
41	125	125	HR1	0	428	0.39556	HR1	.		HR1

iiibbb · Posted 02-10-2012 11:48 PM

Art I think I owe you a beer by now.

Thanks everyone.

Astounding · Posted 02-11-2012 12:27 PM

iiibbb,

It looks like you're comfortable with a DATA step approach. Here's an intuitive one (or should I say, intuitive to me):

data boundaries;

set original (keep=Pleg PLT);

by PLT notsorted;

where PLT ne 'FLD';

first_PLT = first.plt;

last_PLT = last.plt;

keep Pleg first_PLT last_PLT;

run;

data final;

merge original boundaries;

by Pleg;

if first_PLT then status = 'Header Group Begins';

if last_PLT then status = 'Header Group Ends';

retain status new_PLT;

if PLT =: 'HR' then new_PLT = PLT;

else if status ne 'Header Group Begins' then new_PLT='FLD';

run;

Good luck.

iiibbb · Posted 02-11-2012 12:45 PM

I wish it were intutive, but it's not.

I'm willing to learn though.

Astounding · Posted 02-11-2012 01:04 PM

Try adding some proc prints. Examine each of the data sets, especially the new variables first_PLT, last_PLT, and status.

Good luck.