Hello DingDing, First of all, you should be aware that there are four different styles of reading raw data with the INPUT statement: column, list, formatted and named input. These are described in the online documentation on the INPUT statement and in more detailed pages which are linked there. (It is even possible to mix two or more of these styles within a single INPUT statement.) Working in different industries during the past 18 years, I have mostly used list or formatted input, less frequently column input and only rarely named input. What you are using in your example is called formatted input. For understanding when and where (so called) column pointer controls such as the "+1" are required, it is important to know how the column pointer (a kind of invisible "cursor") moves across the raw data while the INPUT statement is executed. To quote the documentation on formatted input: "The pointer moves the length that the informat specifies and stops at the next column." Let's see what this means in practice by looking at your example (actually, RW9 did that already, while I was typing this lengthy post): Detailed explanations for your INPUT statement "with +1": At the start, the column pointer is located at the beginning of line 1. As you specified formatted input using informat $16. for reading variable NAME, the informat is applied to the first 16 characters ("columns") of line 1. This is because 16 is the length of that informat. Well, the first 16 characters contain the name "Alicia Grossman" (length=15) followed by a single blank. So, this is stored in variable NAME. (Apparently, the "$16." has been thoughtfully chosen in view of the longest name in the data, "Elizabeth Garcia".) Next is AGE, read with informat 3., hence looking at the next 3 columns (no. 17 - 19). Thanks to the alignment in pumpkin.txt, all of these numbers are read correctly, in particular the "13" in line 1. Now, the column pointer is located at column 20 ("between" columns 19 and 20 if you like), but there is nothing of interest in column 20, only a blank, in all lines of the .txt file. In order to read the next relevant portion of the line (the single character "c" in column 21) into variable TYPE with informat $1., the column pointer must be moved forward by 1 column. This is what the "+1" pointer control does. Same situation after reading TYPE: The column pointer is located between columns 21 and 22, ready to continue reading at column 22. But the length-10 date value "10-28-2012" starts only in column 23! Therefore, the informat MMDDYY10. used for reading variable DATE would look at the wrong 10 columns (namely columns 22 - 31) if we didn't move the pointer again one position to the right (the second "+1")! Having read the date, the pointer rests between columns 32 and 33. Finally, 5 variables are to be read -- all with informat 4.1, which has length 4. Side note: Please note that using a w.d informat such as 4.1 is risky in case that possibly some values do not contain a decimal point. For example, if in your data the value 8.0 was written simply as 8, it would be read as 0.8 without further notice! This is because SAS would regard the rightmost d digits (here: d=1) as decimals. I strongly recommend to use informat 4. instead (and informat w. with appropriate width w in general), because it recognizes decimal points and will not cause this potential error. [End of side note] The content of columns 33 - 36, i.e. the number 7.8 preceded by a blank, is now available for reading, which is suitable content for the numeric variable SCORE1. Similarly, the remaining four blocks of four columns each (37 - 40, 41 - 44, 45 - 48 and 49 - 52) are read into variables SCORE2-SCORE5. The latter being the last variable in the INPUT statement, the pointer now moves to column 1 of line 2 and is ready for reading the next record in the same way. Unlike the human eye, SAS is not at all confused by what looks like a "missing gap" between values 9.5 and 10.0 of "Jose Martinez". The characters " 9.5" and "10.0" belong to distinct blocks of columns: 37 - 40 and 41 - 44, respectively (see above). With formatted input there is no need to separate them. Sometimes the use of pointer controls such as "+1" can be reduced by using correspondingly longer informats. You can see an example of this by comparing your first INPUT statement to that in RW9's first data step: He inserted a "+1" between NAME and AGE, because he reads AGE only with informat 2. rather than 3., so that the blank in column 17 must be skipped. You read this blank column into variable AGE, which does no harm and makes no difference to the stored numeric value.
I think, given the above explanations you can see not only why your INPUT statement "without +1" fails, but also exactly in which way it does so. You will see each of the missing and non-missing values in your erroneous result table "TEST" explained (cf. RW9's pertinent explanations), when you consider which columns are read into a certain variable and whether the content of these columns is valid data for that variable (check the SAS log for notes on "Invalid data"). However, I'm wondering if the SCORE3=510 for "Jose Martinez" results from the INPUT statement you quote. I obtain 0.51 instead (which is plausible, because at this point columns 39 - 42 are read into SCORE3 and these contain the characters ".510"). Please note that RW9 uses list input (more precisely: modified list input, namely modified by the informats he assigned by means of an INFORMAT statement) for the SCOREn variables in his second data step. So, this is an example of what I referred to as mixing input styles within a single INPUT statement (here: formatted input and list input). For the list input he needs the blanks between the score columns -- or other delimiters like the comma he uses in his third data step. The latter uses (partially modified) list input only, not formatted input.
[Minor edits of wording and formatting done.]
... View more