Solved: Proc Transpose - Page 2

Tom · Posted 08-22-2013 03:03 PM

IF first.id then GROUP=1;

ELSE IF .... THEN GROUP+1;

If you leave out the ELSE then you could cause a double increment.

robertrao · Posted 08-22-2013 03:19 PM

Thanks for the reply

So

if first.id then group=0;

if first.id or flag='IN' or (flag='OUT' and flag=lag(flag)) then group+1 ;

IS SAME AS BELOW?????

IF first.id then GROUP=1;

ELSE IF flag='IN' or (flag='OUT' and flag=lag(flag)) then group+1 ;

Also while doing the imputations we do it on the final transposed dataset??

when we say out(i) - 1; how can we guarentee that its subtracting minutes since out is a time value???

Tom · Posted 08-22-2013 04:48 PM

You can use the INTNX() function to make sure that you are converting datetime values properly.

SAS(R) 9.3 Functions and CALL Routines: Reference

Or since datetime values are stored as a number of seconds, you could just add or subtract 60 to represent 1 minute instead of 1 which would represent one second.

Tom · Posted 08-23-2013 11:03 PM

BUT you do NOT want to use the ELSE clause because of the inclusion of the LAG() function in the IF conditional.

Read up on how the LAG function works. It does NOT retrieve the value from the previous observation, but instead retrieves it from a stack of values created be the previous executions of that LAG() function. So it is normally a mistake to place a LAG() function call were it will not execute every time through the data step (such as an ELSE clause) .

robertrao · Posted 08-22-2013 05:03 PM

Also TOM

I am pasting the log

If i understand correctly the highlighted line in the log for Array method should have i=4 BUT WHY DOES IT SHOW AS I=3 ??????????????

last.id=0 ID=101 flag=IN time=04SEP89:07:30:00 FIRST.ID=1 in1=04SEP89:07:30:00 in2=. in3=. in4=. in5=. in6=. in7=. out1=. out2=. out3=. out4=. out5=. out6=. out7=.
i=1 _ERROR_=0 _N_=1

last.id=0 ID=101 flag=IN time=04SEP89:13:45:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=. in4=. in5=. in6=. in7=. out1=. out2=. out3=. out4=.
out5=. out6=. out7=. i=2 _ERROR_=0 _N_=1

last.id=0 ID=101 flag=OUT time=05SEP89:07:15:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=. in4=. in5=. in6=. in7=. out1=. out2=. out3=. out4=.
out5=. out6=. out7=. i=2 _ERROR_=0 _N_=1

last.id=0 ID=101 flag=IN time=21SEP89:17:55:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=21SEP89:17:55:00 in4=. in5=. in6=. in7=. out1=.
out2=05SEP89:07:15:00 out3=. out4=. out5=. out6=. out7=. i=3 _ERROR_=0 _N_=1

last.id=0 ID=101 flag=OUT time=22SEP89:06:00:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=21SEP89:17:55:00 in4=. in5=. in6=. in7=. out1=.
out2=05SEP89:07:15:00 out3=. out4=. out5=. out6=. out7=. i=3 _ERROR_=0 _N_=1

last.id=0 ID=101 flag=OUT time=23SEP89:06:00:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=21SEP89:17:55:00 in4=. in5=. in6=. in7=. out1=.
out2=05SEP89:07:15:00 out3=22SEP89:06:00:00 out4=. out5=. out6=. out7=. i=3 _ERROR_=0 _N_=1

last.id=1 ID=101 flag=IN time=24SEP89:06:00:00 FIRST.ID=0 in1=04SEP89:07:30:00 in2=04SEP89:13:45:00 in3=21SEP89:17:55:00 in4=. in5=24SEP89:06:00:00 in6=. in7=.
out1=. out2=05SEP89:07:15:00 out3=22SEP89:06:00:00 out4=23SEP89:06:00:00 out5=. out6=. out7=. i=5 _ERROR_=0 _N_=1

robertrao · Posted 08-27-2013 02:35 PM

Hi Tom,

I understood the code you suggested

But when I apply it on my data it gives me an error like this.

IT SEEMS TO NE NOT WORKING IF THE FIRST OF ID STARTS WITH AN OUT!!!!

Could you point me where i am doing a mistake????

478 data final ;
479    do until (last.id);
480       set &month2 ;
481         by id ;
482
483 array in (6);
NOTE: The array in has the same name as a SAS-supplied or user-defined function. Parentheses following this name are treated as array references and not function
      references.
484 array out (6) ;   /*length of the array has to be adjusted to the maximum number of records per ID in the dataset*/
485     if first.id then i=0;
486     if flag='IN' then do;
487        i=i+1;
488        in(i)=recorded_time;
489         end;
490 if flag='OUT' then do;
491 if out(i)^=. then i=i+1;
492           out(i)=recorded_time;
493            end;
494              end;
495 keep id in: out: ;
496 format in: out: datetime.;
497 run;

ERROR: Array subscript out of range at line 491 column 4.

last.id=0 flag=OUT RECORDED_TIME=02JUL2013:10:15:00.000 ID=110 FIRST.ID=1 in1=. in2=. in3=. in4=. in5=. in6=. out1=.
out2=. out3=. out4=. out5=. out6=. i=0 _ERROR_=1 _N_=46

NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 90 observations read from the data set WORK.JUL.
WARNING: The data set FINAL may be incomplete. When this step was stopped there were 45 observations and 13 variables.

WARNING: Data set WORK.FINAL was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      Memory                            242k
      OS Memory                         18292k
      Timestamp            8/27/2013 1:29:57 PM

110 ID has the following order:

flag Rec_Time

OUT 02JUL

IN 08JUL

OUT 12JUL

Vince28_Statcan · Posted 08-27-2013 02:51 PM

This is because the array iterator begins at i=0 and only iterates to 1 before subscripting the array when the flag for the first record of a by-group (ID) is "IN". It appears as though you have cases in your data where in1=. out1=some date. So basically, SAS attempts to retrive out{0} but array subscripts in SAS go from 1 to dim(arrayname) rather than from 0.

Real quick without any testing, I believe you could fix your issue by changing the following segment

484 array out (6) ; /*length of the array has to be adjusted to the maximum number of records per ID in the dataset*/

485 if first.id then i=0;

486 if flag='IN' then do;

487 i=i+1;

488 in(i)=recorded_time;

489 end;

to

array out (6);

if first.id then i=1;

if flag='IN' then do;

in(i)=recorded_time;

i=i+1;

end;

Notice the reordering as well as the index starting point changes.

At first glance, this logic change on the iterator start shouldn't impact on the segment for flag='out'

You may still get subscript out of range if you ever have more than 6 set of data for a given ID. You can use a proc freq by ID and take the max value as a safe bet for your array dimensions if that ever occurs.

Vincent

robertrao · Posted 08-27-2013 03:03 PM

Hi Vince,

That was a fast reply. Thanks so much

Do you think the code below will work for all the scenarios i have wanted(previous cases and the one i faced now)???

I need not change anything excepts for the bolded part???/

Please let me know if anything more is required

data want ;

do until (last.id);

set have ;

by id;

array in (6);

array out (6) ;

if first.id then i=1;
if flag='IN' then do;
in(i)=recorded_time;
i=i+1;
end;

if flag='OUT' then do;
if out(i)^=. then i=i+1;
out(i)=recorded_time;
end;
end;

keep id in: out: ;

format in: out: datetime.;

run;

robertrao · Posted 08-27-2013 03:09 PM

But Vince,

The code u suggested wont give me the result i am looking!!!!

Example:

if you try the code on the below data I wanted

IN1 OUT1 IN2 OUT2

. 02JUL2013:10:15 08JUL2013:12:20 12JUL2013:13:00

BUT I GET:

IN1 OUT1 IN2 OUT2

08JUL2013:12:20 02JUL2013:10:15 . 12JUL2013:13:00

data have;

input ID $ flag $ time :datetime. cnt;

format time datetime.;

datalines;

101 OUT 02JUL2013:10:15:00.000 1

101 IN 08JUL2013:12:20:00.000 3

101 OUT 12JUL2013:13:00:00.000 2

;

run;

Vince28_Statcan · Posted 08-27-2013 03:42 PM

Hi Robert,

I definitely overlooked that case. It should be easy to fix by adding a condition to increment the 'IN' case when 'OUT' already has a value.

data want ;

do until (last.id);

set have ;

by id;

array in (6);

array out (6) ;

if first.id then i=0;
if flag='IN' then do;

i=i+1;
in(i)=recorded_time;

end;

if flag='OUT' then do;

if i=0 then i=i+1;
if out(i)^=. then i=i+1;
out(i)=recorded_time;
end;
end;

keep id in: out: ;

format in: out: datetime.;

run;

I'm sorry I didn't think of it in the first reply.

Basically, the idea was that Tom's code expected your data to always have, at the very least, one 'IN' before and 'OUT' for each ID. I fixed the array subscript issue but had forget to account for what happens when you read a 'IN' immediately after a 'OUT' as first value. I'm going to think about it some more to make sure I didn't overlook anything with this additionnal fix but it will probably be faster for you to just test the code and come back if I overlooked something.

As I put it up, I realized I was still missing out on a case. I reverted back a few things towards Tom's code and changed the way to handle the special case where the first record for an ID is a 'OUT' flag. Reverts are bolded in the code. The new net change from Tom's original code should be the if i=0 then i=i+1; line.

Vincent

robertrao · Posted 08-27-2013 03:58 PM

Awesome Vince....

It works true for this below data and i guess it works for all the conditions i wanted!!!!

Thanks so much

data have;
input ID $ flag $ time :datetime. cnt;
format time datetime.;
datalines;

101    IN        04Sep1989:7:30        1
101    IN        04Sep1989:13:45       2
101    IN        21SEP1989:17:55       3
101   OUT        05SEP1989:7:15        1
101   OUT        22SEP1989:06:00       2
101   OUT        23SEP1989:06:00      12
101    IN        24SEP1989:06:00      15

102   OUT        02JUL2013:10:15       1
102    IN        08JUL2013:12:20       3
102   OUT        12JUL2013:13:00       2
;
run;

robertrao · Posted 08-27-2013 04:57 PM

Hi Vince,

I went back and saw that Tom had considered that case of an ID having an OUT without the IN

I was surprised how he missed it in the array code????

In His explanation.........

To prevent IN > OUT you always want to start a new pair (group) when see an IN record. But when you see and OUT record it could be the OUT for the current pair or an unmatched OUT that needs a new pair. One way to tell this is if the current pair already has an OUT time value. So we test if OUT(I) has a value when processing an OUT record, but we don't care for IN record as we know that it marks a new pair.

So a new group starts when
•the first record for an ID group (FIRST.ID)
•any IN record (FLAG='IN')
•an OUT record when it is not preceded by an IN record. (FLAG='OUT' and LAG(FLAG)='OUT')

To avoid incrementing the group counter twice when more than one of the conditions apply to the same record the program first sets it to 0 at the start of a ID group. Then it can use the same logic test about whether to increment the group counter for every record it processes.

Vince28_Statcan · Posted 08-28-2013 07:43 AM

Hi Robert,

The reference to Tom in post #34 is how he handled cases where there would be 2 'OUT' in a sequence. However, it does not discuss handling the special case where the very first event has no 'IN' data. He has actually come up with the same alternative as I did in post #35.

As for the proc transpose. On a large dataset where "at least one occurence" of, say, IN4 has data, the variable will be created. In the unlikely event that for all ID, an IN# or OUT# has all occurences missing, you would need to add an additionnal data step to create a "dummy row". Basically, the variables are only created by the transpose procedure if the ID is is encountered at least once in the data. Here is one of many ways that you could use to create such dummy rows

data middle2;
set middle;
by id group;
if first.group and last.group and flag='IN' then do;
output;
flag='OUT'; time=.; cnt=cnt+.1; output;
end;
else if first.group and flag='OUT' then do;
output;
flag='IN'; time=.; cnt=cnt-.1; output;
end;
run;

proc transpose data=middle2 out=want (drop=_name_);

by id;

id flag group ;

var time;
run;

This will add a whole bunch of records with a missing time variable for given groups. Don't mind the +.1 and -.1 for cnt=cnt±.1; that was only for continuity of middle2 serves for anything besides the transpose. It allows you to redo a sort by id cnt and achieve the logical/sequential results. It could very well not have been modified or set to missing instead and the proc transpose would've achieved the same result.

Vincent

robertrao · Posted 08-28-2013 08:04 AM

Hi Vince,

I have confusion . Basically I can use either of the below two methods?right?????

/*METHOD1*/

data want ;

do until (last.id);

set have ;

by id;

array in (6);

array out (6) ;

if first.id then i=0;
if flag='IN' then do;

i=i+1;
in(i)=recorded_time;

end;

if flag='OUT' then do;

if i=0 then i=i+1;
if out(i)^=. then i=i+1;
out(i)=recorded_time;
end;
end;

keep id in: out: ;

format in: out: datetime.;

run;

/*METHOD2*/

data middle2;
set middle;
by id group;
if first.group and last.group and flag='IN' then do;
output;
flag='OUT'; time=.; cnt=cnt+.1; output;
end;
else if first.group and flag='OUT' then do;
output;
flag='IN'; time=.; cnt=cnt-.1; output;
end;
run;

proc transpose data=middle2 out=want (drop=_name_);

by id;

id flag group ;

var time;
run;

If what i understood is correct could you tell me about cnt=cnt-.1 and cnt=cnt+.1???????

Thanks

Vince28_Statcan · Posted 08-28-2013 08:36 AM

Hi Robert,

Yes, both method should achieve your desired results. The main reason why Tom has suggested the transpose method is that the array method requires you to know, a priori, the maximum of groups for any given day (so as to set the array dimensions). You could, for instance, not care to add additionnal missing columns to your dataset and simply set a value "high enough" that it will never be reached but that depends on your requirements and what further processing you may want to do on this data.

The transpose method has the advantage of creating variables only for each distinct ID (as defined by the variables in the ID statement of the proc transpose, not the ID variable in your dataset). So for instance, if you used this for 2 distinct datasets (lets say year 2011 and year 2012) and in year 2011, the maximum number of in/out pair for an ID is 30 and in 2012, it is instead 42, the transpose method on the 2011 dataset will provide you with a dataset with 61 columns and 2012 with 85 columns.

On the other hand, if you used the array method and set

array in (50);

array out (50) ;

"just to be safe" as you didn't know, a priori, how many distinct events occured on any given ID, then both 2011 and 2012 data would yield a dataset with 101 columns with the array method. You would effectively have the last 40 (resp. 16) columns missing for all records in the 2011 (resp. 2012) resulting datasets.

So really, the approach to use depends on your task requirements and your a priori knowledge of your data.

As to go back to the cnt=cnt-.1 and cnt=cnt+.1, looking back at the data you have provided, I realize it does not achieve what I wanted as your cnt variable is not a counter of sequential events for an ID but rather a counter of sequential events for an ID FLAG pair. The underlying logic would've been something like this:

101 IN 04Sep1989:7:30 1

101 IN 04Sep1989:13:45 2

101 IN 21SEP1989:17:55 3

101 OUT 05SEP1989:7:15 1

101 OUT 22SEP1989:06:00 2

101 OUT 23SEP1989:06:00 12

101 IN 24SEP1989:06:00 15

If you sort this by id time, you get

101 IN 04Sep1989:7:30 1

101 IN 04Sep1989:13:45 2

101 OUT 05SEP1989:7:15 1

101 IN 21SEP1989:17:55 3

101 OUT 22SEP1989:06:00 2

101 OUT 23SEP1989:06:00 12

101 IN 24SEP1989:06:00 15

Now imagine that you had a different "counter" that represents the sequence of events within a given ID as follow

NEWCNT

101 IN 04Sep1989:7:30 1 1

101 IN 04Sep1989:13:45 2 2

101 OUT 05SEP1989:7:15 1 3

101 IN 21SEP1989:17:55 3 4

101 OUT 22SEP1989:06:00 2 5

101 OUT 23SEP1989:06:00 12 6

101 IN 24SEP1989:06:00 15 7

The logic of the +.1 -.1 was that the created records with a missing date would allow you to sort by id newcnt to achieve the appropriate sorting by time within an ID even though you had a bunch of missing time.

101 IN 04Sep1989:7:30 1 1

101 OUT . . 1.1

101 IN 04Sep1989:13:45 2 2

101 OUT 05SEP1989:7:15 1 3

101 IN 21SEP1989:17:55 3 4

101 OUT 22SEP1989:06:00 2 5

101 IN . . 5.9

101 OUT 23SEP1989:06:00 12 6

101 IN 24SEP1989:06:00 15 7

101 OUT . . 7.1

Due to the way my data step is built for middle2, the IN records with missing time are output after their respective OUT records so the way to rebuild the sequence of events with missing values would be through something similar. Again though, it will not work as intended after review since your CNT variable is not built the way I had in mind when I wrote the code. Nonetheless, it is irrelevant to the proc transpose. It is merely a conceptual tool if you wanted to have a full sequence of events vertically rather than horizontally.

Vincent

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Re: Proc Transpose

Classroom Training Available!