Re: DOW LOOP HELP

robertrao · Posted 10-09-2013 05:00 AM

Hi...

Is the folllwing two codes doing the SAME for the below data?

if true can someone explain the do until code???????why two SET stmnts used etc etc???

1)IF we have two or more consecutive INs then we need to take the first IN

2)if we have two or more consecutive OUTS then we need to take the LAST OUT

data want;

do until(last.flag);

set have;

by id flag notsorted;

end;

do until(last.flag);

set have;

by id flag notsorted;

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

end;

run;

set have;

by id flag notsorted;

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

run;

HAVE

ID recorded_time flag

101   01AUG2013:10:00     IN
101   01AUG2013:14:15     IN
101   01AUG2013:17:15     OUT

102   02AUG2013:19:23     OUT
102   02AUG2013:21:04     OUT
102   11AUG2013:01:01     IN
102   11AUG2013:01:30     OUT

103 09AUG2013:19:25     IN
103 14AUG2013:14:35     OUT
103 14AUG2013:23:01     IN
103 14AUG2013:23:12     IN
103 17AUG2013:11:20     OUT

104 02AUG2013:04:02    OUT
104 08AUG2013:15:00     IN
104 20AUG2013:14:11     IN
104 20AUG2013:17:12     OUT

105 07AUG2013:02:50    IN
105 07AUG2013:02:55    IN
105 11AUG2013:14:00    OUT
105 16AUG2013:14:10    OUT

106 02AUG2013:08:15 OUT

107 20AUG2013:13:05 IN

108 12AUG2013:11:00   IN
108 12AUG2013:18:37   OUT
108 17AUG2013:03:55   OUT
108 22AUG2013:11:25   OUT

109 06AUG2013:18:55   IN
109 07AUG2013:11:20   OUT
109 13AUG2013:00:09   IN
109 14AUG2013:10:35   OUT

WANT

ID recorded_time flag

101 01AUG2013:10:00 IN

101 01AUG2013:17:15 OUT

102   02AUG2013:21:04     OUT
102   11AUG2013:01:01     IN
102   11AUG2013:01:30     OUT

103 09AUG2013:19:25     IN
103 14AUG2013:14:35     OUT
103 14AUG2013:23:01     IN

103 17AUG2013:11:20 OUT

104 02AUG2013:04:02 OUT
104 08AUG2013:15:00 IN

104 20AUG2013:17:12 OUT

105 07AUG2013:02:50 IN

105 16AUG2013:14:10 OUT

106 02AUG2013:08:15 OUT

107 20AUG2013:13:05 IN

108 12AUG2013:11:00 IN

108 22AUG2013:11:25 OUT

109 06AUG2013:18:55   IN
109 07AUG2013:11:20   OUT
109 13AUG2013:00:09   IN
109 14AUG2013:10:35   OUT

Vince28_Statcan · Posted 10-09-2013 08:13 AM

Hi Robert,

At first glance, both codes do produce the same results. The only distinction is that in the first approach doesn't increment _N_ the same way as it uses do blocks. However, the first do until block is redundant with the second. You are essentially reading however many rows there are in your by group, doing nothing with it and then reading them all again to do the desired conditionnal processing in the second do block.

It should be simplified as

data want;

do until(last.flag);

set have;

by id flag notsorted;

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

end;

As an experiment, try the following:

data want;
do until(last.flag);
i=_N_;
set have;
by id flag notsorted;
if flag='IN' and first.flag then output;
if flag='OUT' and last.flag then output;
end;
run;

data want2;
set have;
i=_N_;
by id flag notsorted;
if flag='IN' and first.flag then output;
if flag='OUT' and last.flag then output;
run;

This will depict the slight subtlety between the 2. Please note also that because the first code is enclosed within DO block, it means that the PDV only gets reset to missing each time _N_ iterates. Thus, if you calculated any new column from your data conditionnally, the values would be kept as though you had a retain statement by group. However, this is a non-issue with your example.

Vince

robertrao · Posted 10-09-2013 09:49 AM

Hi,

Thanks for the response. If you take 109 for example

109 06AUG2013:18:55 IN

109 07AUG2013:11:20 OUT

109 13AUG2013:00:09 IN

109 14AUG2013:10:35 OUT

data want2;

set have;

i=_N_;

by id flag notsorted;

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

run;

because of the non sorted options within the same ID we have many first.'s and last.'s instead of having

one first. and one last.??????

Thanks

Vince28_Statcan · Posted 10-09-2013 10:15 AM

The not sorted allows you not to receive an error message because your data is not sorted according to your BY variables. I've assumed from your example that it was intended to get each pair of in/out and that all you wanted to remove from your data were the successive INs or OUTs.

The not sorted options allows you to retain the chronological ordering of your data and creates new by groups each time there is a change in a by group variable even if the same group existed before. So yes, it allows you to have many firsts and lasts for each by group for so long as they are separated by a different by group.

If you wanted to get strictly the earliest IN and lastest OUT for each ID rather than each pair of alternate, you would've done

proc sort data=have;

by id flag recorded_time;

run;

data want;

set have;

by id flag;

if first.flag and flag='IN' then output;

if last.flag and flag='OUT' then output;

run;

The re-sorting by flag before the chronological series allows you to retain strictly one 'IN' and one 'OUT' value. Since your data is sorted according to your by variables, the notsorted option would have no effect.

Not sure that answers your question - hope it helps.

Vince

robertrao · Posted 10-09-2013 10:57 AM

Hi Vince,

I get what you are trying to explain...

My only concern and thought is

wont there be duplicates because of using the SET statement twice???

data want;

do until(last.flag);

set have;

by id flag notsorted;

end;

do until(last.flag);

set have;

by id flag notsorted;

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

end;

run;

/*ALSO CAN I USE THE ELSE IF instead of IF condition in the second IF condition???

Vince28_Statcan · Posted 10-09-2013 11:09 AM

No. Your output is controlled by the OUTPUT statements in the two conditionnal statements in the second DO loop. Thus, as I mentioned in my first post, the first DO block is useless code. It only slows your program down.

Since both set statement use a different copy of the same dataset, there is a single PDV with the variables in HAVE. The PDV gets overwritten for each set statement in the first do block nothing gets output. The second do block begins, reads from "another copy" of HAVE and overwrites the PDV until one of the two IF conditions are met in which case there is an output (another copy conceptually that is not exactly how it works in practice - I believe that each set statement uses a different pointer, think like an internal _N_ to the set statement so they both read through the entire dataset and the rows read in the first block have no impact on the rows read in the second block).

I'm not sure my explanation is great above but anyway it gives some insight on how the set statements apply to the PDV and why the first DO block is useless to the desired output. DO UNTIL blocks are always entered because the until condition only applies at the end of the do block. Thus, even if you are at last.flag from the first block, you still enter the second block and then last.flag is set to 0 and first.flag is set to 1 from the new row read from set have.

Vince28_Statcan · Posted 10-09-2013 11:14 AM

Yes, you could use an ELSE IF because the conditions are mutually exclusive. In fact, it would save you from testing the second condition for each first.flag thus slightly improving the program efficiency.

Be aware though, first.flag and last.flag are not mutually exclusive, the empty intersection is that of flag='IN' and flag='OUT'.

For example

if first.flag then output;

if last.flag then output;

behaves differently from

if first.flag then output;

else if last.flag then output;

In the former, anytime there is a single element in a by flag group will be output twice as it is both the first and the last flag of the group so you will absolutely always have 2 rows produced for each by flag group.

In the later, the else removes double printing whenever a group has a single element.

robertrao · Posted 10-09-2013 11:21 AM

Hi,

Thanks for the detailed explanation....

In my case even if there is a single element "I am saying

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

I am specifying what the flag is so i think it would not pull any duplicates?

Thanks

Vince28_Statcan · Posted 10-09-2013 11:37 AM

Indeed I said that in the previous post. Two ifs with mutually exclusive conditions cannot both be true at the same time so you will not have duplicate output from not using an ELSE statement. You simply slightly reduce program efficiency

Registration is open

SAS Training: Just a Click Away