BookmarkSubscribeRSS Feed
robertrao
Quartz | Level 8

Hi...

Is the folllwing two codes doing the SAME for the below data?

if true can someone explain the do until code???????why two SET stmnts used etc etc???

1)IF we have two or more consecutive INs then we need to take the first IN

2)if we have two or more consecutive OUTS then we need to take the LAST OUT

data want;

do until(last.flag);

  set have;

  by id flag notsorted;

end;

  do until(last.flag);

  set have;

  by id flag notsorted;

  if flag='IN' and first.flag then output;

  if flag='OUT' and last.flag then output;

end;

run;

set have;

  by id flag notsorted;

  if flag='IN' and first.flag then output;

  if flag='OUT' and last.flag then output;

run;

HAVE

ID        recorded_time       flag

101   01AUG2013:10:00     IN
101   01AUG2013:14:15     IN
101   01AUG2013:17:15     OUT

102   02AUG2013:19:23     OUT
102   02AUG2013:21:04     OUT
102   11AUG2013:01:01     IN
102   11AUG2013:01:30     OUT


103  09AUG2013:19:25     IN
103  14AUG2013:14:35     OUT
103  14AUG2013:23:01     IN    
103  14AUG2013:23:12     IN
103  17AUG2013:11:20     OUT

104  02AUG2013:04:02    OUT
104  08AUG2013:15:00     IN  
104  20AUG2013:14:11     IN
104  20AUG2013:17:12     OUT

105  07AUG2013:02:50    IN
105  07AUG2013:02:55    IN
105  11AUG2013:14:00    OUT
105  16AUG2013:14:10    OUT

106  02AUG2013:08:15   OUT

107  20AUG2013:13:05   IN


108  12AUG2013:11:00   IN
108  12AUG2013:18:37   OUT
108  17AUG2013:03:55   OUT
108  22AUG2013:11:25   OUT

109  06AUG2013:18:55   IN
109  07AUG2013:11:20   OUT
109  13AUG2013:00:09   IN
109  14AUG2013:10:35   OUT

WANT

ID        recorded_time       flag


101   01AUG2013:10:00     IN

101   01AUG2013:17:15     OUT

102   02AUG2013:21:04     OUT
102   11AUG2013:01:01     IN
102   11AUG2013:01:30     OUT


103  09AUG2013:19:25     IN
103  14AUG2013:14:35     OUT
103  14AUG2013:23:01     IN   

103  17AUG2013:11:20     OUT

104  02AUG2013:04:02    OUT
104  08AUG2013:15:00     IN 

104  20AUG2013:17:12     OUT

105  07AUG2013:02:50    IN

105  16AUG2013:14:10    OUT

106  02AUG2013:08:15   OUT

107  20AUG2013:13:05   IN

108  12AUG2013:11:00   IN

108  22AUG2013:11:25   OUT

109  06AUG2013:18:55   IN
109  07AUG2013:11:20   OUT
109  13AUG2013:00:09   IN
109  14AUG2013:10:35   OUT

8 REPLIES 8
Vince28_Statcan
Quartz | Level 8

Hi Robert,

At first glance, both codes do produce the same results. The only distinction is that in the first approach doesn't increment _N_ the same way as it uses do blocks. However, the first do until block is redundant with the second. You are essentially reading however many rows there are in your by group, doing nothing with it and then reading them all again to do the desired conditionnal processing in the second do block.

It should be simplified as

data want;

  do until(last.flag);

  set have;

  by id flag notsorted;

  if flag='IN' and first.flag then output;

  if flag='OUT' and last.flag then output;

end;

As an experiment, try the following:

data want;
  do until(last.flag);
  i=_N_;
  set have;
  by id flag notsorted;
  if flag='IN' and first.flag then output;
  if flag='OUT' and last.flag then output;
end;
run;

data want2;
set have;
  i=_N_;
  by id flag notsorted;
  if flag='IN' and first.flag then output;
  if flag='OUT' and last.flag then output;
run;

This will depict the slight subtlety between the 2. Please note also that because the first code is enclosed within DO block, it means that the PDV only gets reset to missing each time _N_ iterates. Thus, if you calculated any new column from your data conditionnally, the values would be kept as though you had a retain statement by group. However, this is a non-issue with your example.

Vince

robertrao
Quartz | Level 8

Hi,

Thanks for the response. If you take 109 for example

109  06AUG2013:18:55   IN

109  07AUG2013:11:20   OUT

109  13AUG2013:00:09   IN

109  14AUG2013:10:35   OUT

data want2;

set have;

  i=_N_;

  by id flag notsorted;

  if flag='IN' and first.flag then output;

  if flag='OUT' and last.flag then output;

run;

because of the non sorted options within the same ID we have many first.'s and last.'s instead of having

one first. and one last.??????

Thanks

Vince28_Statcan
Quartz | Level 8

The not sorted allows you not to receive an error message because your data is not sorted according to your BY variables. I've assumed from your example that it was intended to get each pair of in/out and that all you wanted to remove from your data were the successive INs or OUTs.

The not sorted options allows you to retain the chronological ordering of your data and creates new by groups each time there is a change in a by group variable even if the same group existed before. So yes, it allows you to have many firsts and lasts for each by group for so long as they are separated by a different by group.

If you wanted to get strictly the earliest IN and lastest OUT for each ID rather than each pair of alternate, you would've done

proc sort data=have;

     by id flag recorded_time;

run;

data want;

     set have;

     by id flag;

     if first.flag and flag='IN' then output;

     if last.flag and flag='OUT' then output;

run;

The re-sorting by flag before the chronological series allows you to retain strictly one 'IN' and one 'OUT' value. Since your data is sorted according to your by variables, the notsorted option would have no effect.

Not sure that answers your question - hope it helps.

Vince

robertrao
Quartz | Level 8

Hi Vince,

I get what you are trying to explain...

My only concern and thought is

wont there be duplicates because of using the SET statement twice???

data want;

do until(last.flag);

  set have;

  by id flag notsorted;

end;

do until(last.flag);

  set have;

  by id flag notsorted;

  if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

end;

run;

/*ALSO CAN I USE THE ELSE IF instead of IF condition in the second IF condition???

Vince28_Statcan
Quartz | Level 8

No. Your output is controlled by the OUTPUT statements in the two conditionnal statements in the second DO loop. Thus, as I mentioned in my first post, the first DO block is useless code. It only slows your program down.

Since both set statement use a different copy of the same dataset, there is a single PDV with the variables in HAVE. The PDV gets overwritten for each set statement in the first do block nothing gets output. The second do block begins, reads from "another copy" of HAVE and overwrites the PDV until one of the two IF conditions are met in which case there is an output (another copy conceptually that is not exactly how it works in practice - I believe that each set statement uses a different pointer, think like an internal _N_ to the set statement so they both read through the entire dataset and the rows read in the first block have no impact on the rows read in the second block).

I'm not sure my explanation is great above but anyway it gives some insight on how the set statements apply to the PDV and why the first DO block is useless to the desired output. DO UNTIL blocks are always entered because the until condition only applies at the end of the do block. Thus, even if you are at last.flag from the first block, you still enter the second block and then last.flag is set to 0 and first.flag is set to 1 from the new row read from set have.

Vince28_Statcan
Quartz | Level 8

Yes, you could use an ELSE IF because the conditions are mutually exclusive. In fact, it would save you from testing the second condition for each first.flag thus slightly improving the program efficiency.

Be aware though, first.flag and last.flag are not mutually exclusive, the empty intersection is that of flag='IN' and flag='OUT'.

For example

if first.flag then output;

if last.flag then output;

behaves differently from

if first.flag then output;

else if last.flag then output;

In the former, anytime there is a single element in a by flag group will be output twice as it is both the first and the last flag of the group so you will absolutely always have 2 rows produced for each by flag group.

In the later, the else removes double printing whenever a group has a single element.

robertrao
Quartz | Level 8

Hi,

Thanks for the detailed explanation....

In my case even if there is a single element "I am saying

if flag='IN' and first.flag then output;

if flag='OUT' and last.flag then output;

I am specifying what the flag is so i think it would not pull any duplicates?

Thanks

Vince28_Statcan
Quartz | Level 8

Indeed I said that in the previous post. Two ifs with mutually exclusive conditions cannot both be true at the same time so you will not have duplicate output from not using an ELSE statement. You simply slightly reduce program efficiency

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 852 views
  • 0 likes
  • 2 in conversation