Solved: data step input several observations %2F many variables in one record

rjandl · Posted 12-01-2015 11:38 AM

Dear colleagues

I have a data file with several variables each having multiple observations in one record.

The dataline statement has 'car_type' 'car_color' afterwards three consecutive informations on 'mile'

afterwards 'car_dirt' and again three consecutive informations on 'oil'

my strategy was using '@' in order to hold the line and read the multiple observations in do-loops.

The output example shows that only the first variable with multiple observations is read properly.

what am I doing wrong?

all the best

Robert

data new;
input car $ color $ @;

do i = 1 to 3;
input miles @;
output;
end;
input dirty $ @;

do j = 1 to 3;
   input oil ;
   output;
end;

datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no 0.15 0.25 0.18
;
run;
proc print;
run;

OUTPUT

              Beob.    car    color    i    miles    dirty    j     oil

                     1      VW     green    1      25              .     .
                     2      VW     green    2      35              .     .
                     3      VW     green    3      28              .     .
                     4      VW     green    4      28      yes     1    0.25
                     5      VW     green    4      28      yes     2     .

PGStats · Posted 12-01-2015 12:34 PM

"I'd consider reading it all as one line and then parsing. Curious to see the methods to read it and separate it at once though "

data want;
array m{3} miles1-miles3;
array o{3} oil1-oil3;
input car $ color $ miles1-miles3 dirty $ oil1-oil3;
do i = 1 to 3;
    miles = m{i};
    oil = o{i};
    output;
    end;
drop i miles1-miles3 oil1-oil3;
datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;

proc print data=want noobs; run;

PG

View solution in original post

Reeza · Posted 12-01-2015 11:49 AM

I'd consider reading it all as one line and then parsing. Curious to see the methods to read it and separate it at once though 🙂

data have;
input car $ color $ miles1-miles3 dirty $ oil1-oil3;

datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;
run;

proc print data=have;
run;

data want;
set have;
array o(3) oil1-oil3;
array m(3) miles1-miles3;

do i=1 to 3;
oil=o(i);
miles=m(i);
output;
end;

drop miles1-miles3 oil1-oil3 i;
run;

proc print data=want;
run;

PGStats · Posted 12-01-2015 12:34 PM

"I'd consider reading it all as one line and then parsing. Curious to see the methods to read it and separate it at once though "

data want;
array m{3} miles1-miles3;
array o{3} oil1-oil3;
input car $ color $ miles1-miles3 dirty $ oil1-oil3;
do i = 1 to 3;
    miles = m{i};
    oil = o{i};
    output;
    end;
drop i miles1-miles3 oil1-oil3;
datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;

proc print data=want noobs; run;

PG

Reeza · Posted 12-01-2015 01:15 PM

I don't know why I always forgot that you can process data in the same data step that you import data...

PGStats · Posted 12-01-2015 02:32 PM

I notice that we don't see many examples where processing tools like SET, MERGE, INPUT and UPDATE are combined in the same data step. That's one thing I'd like to explore though.

PG

rjandl · Posted 12-02-2015 03:31 AM

thanks

rjandl · Posted 12-02-2015 03:31 AM

thanks a lot!

FreelanceReinh · Posted 12-01-2015 04:10 PM

Hello Robert,

As to your question ("what am I doing wrong?"): After reading CAR, COLOR and the first MILES value (all with the so called "list input") you force SAS to write the current content of the program data vector to dataset NEW (by using an OUTPUT statement). Therefore, DIRTY and OIL (and J, for that matter) have missing values in the first observation.

Then, similarly, the second and third MILES value are read and written to the second and third observation, respectively. The values of CAR and COLOR are repeated (without requiring a RETAIN), because we are still in the same, first iteration of the data step.

Now, DIRTY is read (from the next column). Thanks to the trailing @ line-hold specifiers that have been used so far, the pointer is still located in the first data line, at the appropriate position to read the first OIL value, which is done with the INPUT statement in the first round of the DO J loop. The subsequent OUTPUT statement writes the fourth observation. Again, CAR and COLOR and now also I and MILES "retain" their values. (Please note that variable I had been incremented to 4 at the end of the DO I loop after the last OUTPUT statement in that loop.) Since you forgot (?) the trailing @ after "oil", SAS moves the pointer to the beginning of the next data line.

However, the task of the next INPUT statement (in the second round of the DO J loop) is to read another value of the numeric variable OIL. The attempt to store the character value "bmw" in OIL must fail, obviously, and causes

1. a note in the log:

NOTE: Invalid data for oil in line 17 1-3.

(The line number depends on the session history, but the column range 1-3 indicates where "bmw" is located.)

2. a missing value in variable OIL, which is written (by the OUTPUT statement) to the 5th observation of NEW, together with all the other "retained" values, as we are still in the first iteration of the data step.

Again, due to the missing trailing @ after "oil", the pointer is moved to the next data line, but there is none, so it is kind of nowhere! This would be no problem if the first iteration of the data step was finished. But, in fact, we haven't executed the third round of the DO J loop yet. This is attempted now, but the INPUT statement (input oil;) is bound to fail (the pointer being "nowhere"). SAS documents this situation accurately in the log:

NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+--- ...
18 ;
car=VW color=green i=4 miles=28 dirty=yes j=3 oil=. _ERROR_=1 _N_=1

Due to the error condition encountered with the INPUT statement, the data step terminates and the OUTPUT statement (in the third round of the DO J loop) is not executed. That's why we end up with 5 observations and the variable values as explained.

I agree with @Reeza and @PGStats that it is favorable to read the values first into different variables (using pure list input) and then populate variables MILES and OIL.

But, of course, SAS is capable of populating these two variables directly:

/* Using trailing @ line-hold specifiers, OUTPUT statements and @n pointer controls */

data new;
input     car   $
          color $
          miles
      @21 dirty $
          oil @;
output;
input @15 miles
      @30 oil @;
output;
input @18 miles
      @35 oil;
output;
datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;

The line-hold specifiers keep the pointer in the same data line, until the last value has been read. OUTPUT statements are necessary to create multiple observations of dataset NEW within the same iteration of the data step. A significant drawback of this solution is that we need @n pointer controls to jump to the appropriate positions in the input lines for certain values. This requires that the corresponding values are located in separate columns, which luckily is the case here.

To avoid repeating the names of variables MILES and OIL, we could read them in a DO loop, utilizing the fact that their respective column widths are constant:

/* Using @(expression) pointer controls in a DO loop */

data new;
do _n_=0 to 2;
  input @1                 car   $
                           color $
        @(12+3*mod(_n_,3)) miles
        @21                dirty $
        @(25+5*mod(_n_,3)) oil @;
  output;
end;
datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;

The above data step reads the same values of CAR, COLOR and DIRTY three times, but this doesn't matter. The @1 pointer control is required. After the loop has been executed three times for the first data line, the iteration of the data step ends and the pointer moves automatically to the next line.

Alternatively, we could use the double trailing @ line-hold specifier and omit the OUTPUT statement:

/* Using a double trailing @ line-hold specifier and no OUTPUT statement */

data new;
retain i 0;
input @1        car   $
                color $
      @(12+3*i) miles
      @21       dirty $
      @(25+5*i) oil @@;
i=mod(i+1,3);
if i=0 then input; /* empty INPUT statement to release the @@ */
drop i;
datalines;
VW green   25 35 28 yes 0.25 0.35 0.28
bmw yellow 15 25 18 no  0.15 0.25 0.18
;

Here, the observations are written to the output dataset because this happens automatically at the end of each iteration of the data step. The @@ keeps the pointer on the same data line across data step iterations. Therefore, we have to release it "manually" by executing an empty INPUT statement after the last value of a data line has been read.

(Edit: Only minor changes to the wording.)

rjandl · Posted 12-02-2015 03:32 AM

thanks for the detailed response with many insights in the DATA step.

data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Re: data step input several observations %2F many variables in one record

Classroom Training Available!