Hello DingDing,
First of all, you should be aware that there are four different styles of reading raw data with the INPUT statement: column, list, formatted and named input. These are described in the online documentation on the INPUT statement and in more detailed pages which are linked there. (It is even possible to mix two or more of these styles within a single INPUT statement.)
Working in different industries during the past 18 years, I have mostly used list or formatted input, less frequently column input and only rarely named input.
What you are using in your example is called formatted input. For understanding when and where (so called) column pointer controls such as the "+1" are required, it is important to know how the column pointer (a kind of invisible "cursor") moves across the raw data while the INPUT statement is executed. To quote the documentation on formatted input: "The pointer moves the length that the informat specifies and stops at the next column." Let's see what this means in practice by looking at your example (actually, RW9 did that already, while I was typing this lengthy post):
Detailed explanations for your INPUT statement "with +1":
At the start, the column pointer is located at the beginning of line 1. As you specified formatted input using informat $16. for reading variable NAME, the informat is applied to the first 16 characters ("columns") of line 1. This is because 16 is the length of that informat. Well, the first 16 characters contain the name "Alicia Grossman" (length=15) followed by a single blank. So, this is stored in variable NAME. (Apparently, the "$16." has been thoughtfully chosen in view of the longest name in the data, "Elizabeth Garcia".)
Next is AGE, read with informat 3., hence looking at the next 3 columns (no. 17 - 19). Thanks to the alignment in pumpkin.txt, all of these numbers are read correctly, in particular the "13" in line 1. Now, the column pointer is located at column 20 ("between" columns 19 and 20 if you like), but there is nothing of interest in column 20, only a blank, in all lines of the .txt file. In order to read the next relevant portion of the line (the single character "c" in column 21) into variable TYPE with informat $1., the column pointer must be moved forward by 1 column. This is what the "+1" pointer control does. Same situation after reading TYPE: The column pointer is located between columns 21 and 22, ready to continue reading at column 22. But the length-10 date value "10-28-2012" starts only in column 23! Therefore, the informat MMDDYY10. used for reading variable DATE would look at the wrong 10 columns (namely columns 22 - 31) if we didn't move the pointer again one position to the right (the second "+1")! Having read the date, the pointer rests between columns 32 and 33. Finally, 5 variables are to be read -- all with informat 4.1, which has length 4.
Side note:
Please note that using a w.d informat such as 4.1 is risky in case that possibly some values do not contain a decimal point. For example, if in your data the value 8.0 was written simply as 8, it would be read as 0.8 without further notice! This is because SAS would regard the rightmost d digits (here: d=1) as decimals. I strongly recommend to use informat 4. instead (and informat w. with appropriate width w in general), because it recognizes decimal points and will not cause this potential error.
[End of side note]
The content of columns 33 - 36, i.e. the number 7.8 preceded by a blank, is now available for reading, which is suitable content for the numeric variable SCORE1. Similarly, the remaining four blocks of four columns each (37 - 40, 41 - 44, 45 - 48 and 49 - 52) are read into variables SCORE2-SCORE5. The latter being the last variable in the INPUT statement, the pointer now moves to column 1 of line 2 and is ready for reading the next record in the same way.
Unlike the human eye, SAS is not at all confused by what looks like a "missing gap" between values 9.5 and 10.0 of "Jose Martinez". The characters " 9.5" and "10.0" belong to distinct blocks of columns: 37 - 40 and 41 - 44, respectively (see above). With formatted input there is no need to separate them.
Sometimes the use of pointer controls such as "+1" can be reduced by using correspondingly longer informats. You can see an example of this by comparing your first INPUT statement to that in RW9's first data step: He inserted a "+1" between NAME and AGE, because he reads AGE only with informat 2. rather than 3., so that the blank in column 17 must be skipped. You read this blank column into variable AGE, which does no harm and makes no difference to the stored numeric value.
I think, given the above explanations you can see not only why your INPUT statement "without +1" fails, but also exactly in which way it does so. You will see each of the missing and non-missing values in your erroneous result table "TEST" explained (cf. RW9's pertinent explanations), when you consider which columns are read into a certain variable and whether the content of these columns is valid data for that variable (check the SAS log for notes on "Invalid data"). However, I'm wondering if the SCORE3=510 for "Jose Martinez" results from the INPUT statement you quote. I obtain 0.51 instead (which is plausible, because at this point columns 39 - 42 are read into SCORE3 and these contain the characters ".510").
Please note that RW9 uses list input (more precisely: modified list input, namely modified by the informats he assigned by means of an INFORMAT statement) for the SCOREn variables in his second data step. So, this is an example of what I referred to as mixing input styles within a single INPUT statement (here: formatted input and list input). For the list input he needs the blanks between the score columns -- or other delimiters like the comma he uses in his third data step. The latter uses (partially modified) list input only, not formatted input.
[Minor edits of wording and formatting done.]
The problem is arising where you have a value with no gap in it. You are looking at a fixed width file here (personally I avoid these, delimited files are far easier to work with). Your data looks like:
data want; infile datalines; informat score1-score5 4.1; input name $16. +1 age 2. +1 type $1. +1 date mmddyy10. +1 score1-score5; datalines; Alicia Grossman 13 c 10-28-2012 7.8 6.5 7.2 8.0 7.9 Matthew Lee 9 D 10-30-2012 6.5 5.9 6.8 6.0 8.1 Elizabeth Garcia 10 C 10-29-2012 8.9 7.9 8.5 9.0 8.8 Lori Newcombe 6 D 10-30-2012 6.7 5.6 4.9 5.2 6.1 Jose Martinez 7 d 10-31-2012 8.9 9.510.0 9.7 9.0 Brian Williams 11 C 10-29-2012 7.8 8.4 8.5 7.9 8.0 ; run;
With a 1 where there should normally be a gap for Jose row. If you modify that slightly to be:
data want; infile datalines; informat score1-score5 4.1; input name $16. +1 age 2. +1 type $1. +1 date mmddyy10. +1 score1-score5; datalines; Alicia Grossman 13 c 10-28-2012 7.8 6.5 7.2 8.0 7.9 Matthew Lee 9 D 10-30-2012 6.5 5.9 6.8 6.0 8.1 Elizabeth Garcia 10 C 10-29-2012 8.9 7.9 8.5 9.0 8.8 Lori Newcombe 6 D 10-30-2012 6.7 5.6 4.9 5.2 6.1 Jose Martinez 7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0 Brian Williams 11 C 10-29-2012 7.8 8.4 8.5 7.9 8.0 ; run;
It works fine.
As a note, the delimited file would like:
data want; infile datalines; input name $ age type $ date score1-score5; informat date mmddyy10.; format date date9.;
length name $16; datalines; Alicia Grossman,13,c,10-28-2012,7.8,6.5,7.2,8.0,7.9 Matthew Lee,9,D,10-30-2012,6.5,5.9,6.8,6.0,8.1 Elizabeth Garcia,10,C,10-29-2012,8.9,7.9,8.5,9.0,8.8 Lori Newcombe,6,D,10-30-2012,6.7,5.6,4.9,5.2,6.1 Jose Martinez,7,d,10-31-2012,8.9,9.5,10.0,9.7,9.0 Brian Williams,11,C,10-29-2012,7.8,8.4,8.5,7.9,8.0 ; run;
Much easier to code, and more flexible.
Thank you for your answer, but i still have some questions upon your answer
Q1, as you said " With a 1 where there should normally be a gap for Jose row. If you modify that slightly to be:",
where is the slight modification? I don't find any change on your second code example, and I run your former two code examples,
both return the same as below:
Q2, what does the dlm= "¬" mean?
Q3, as for your third code example, it does work if the columns are separated by a "," , but in my datafile (see attached), they are separated by a space, so I still don't know why it is correct only when I add +1 after the variables of age and type (even add +1 for all the variables, it doesn't work)?
Q1 - In your text file, the line for Jose has:
Jose Martinez 7 d 10-31-2012 8.9 9.510.0 9.7 9.0
You will see no space between 9.510.0. In my second example I put a space:
Jose Martinez 7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0
And then it works.
Q2 - dlm= is an infile statement which allows you to specify the delimiter of a file, can be any character. You can probably omit that from this datastep, its just a habit that I always deal with delimited files, so always have it in.
Q3 - Yes, my suggestion is to get the file delimited from your source.
As for the +1, well its all about the length of the string coming in:
12345678901234567890123456789012345678901234567890123
Jose Martinez 7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0
^ ^
So in your code, you are reading 16 characters for name, which takes the pointer up to the first ^. Then you read age which is 3 in your code. Then you read one character for type, but that is space, unless you put the +1 to move the pointer 1 forward, the D appears in the date read if not moved +1 forward. This is the concept of reading files in by using fixed length fields, you are moving a pointer across the string you are reading in, so input +5 name $, would move the pointer from the start of the string 5 characters in and then read into name. The problem is with that line of data, it doesn't like the fact there is no space between 9.5 and 10.0
Apologies, RW9,
I overlooked that you certainly meant to read score1-score5 in your data steps. (And thanks for the "Like"! :-))
Moreover, I share DingDing's confusion about the uncommon delimiter "¬". This equals 'AC'x, but I think it should be just a blank ('20'x) to make your second data step work (using modifed list input for the score variables). As this is the default delimiter, dlm=" " would actually be redundant.
Well, all dlm="¬" means is that it reads the whole line as that dlm is not found. In this instance it makes no difference if you have it in or not. Will update the code.
Hmm, when I test it (in a Windows SAS 9.4 session), the step with dlm="¬" fails with "Invalid data" notes (as it should, since the delimiter in the data is the blank), missing values and observations, but works fine with dlm=" " (and without using the dlm= option). I had assumed the "¬" character was just the result of some strange display issue or should indicate the blank between the quotation marks.
Sorry for being a pain, RW9,
I think you have inadvertently dropped the DLM="," option from your third data step, where it is required since comma is not the default delimiter. Also, the LENGTH statement does not work if you put it after the INPUT statement: at this point the length "has already been set" (to the insufficient default 8), as is pointed out by a warning in the log.
You need the +1 position modifier because your formats don't account for all character columns in the input lines. Where additional blanks are inserted, you need to "jump over", which you do with the +1.
Hello DingDing,
First of all, you should be aware that there are four different styles of reading raw data with the INPUT statement: column, list, formatted and named input. These are described in the online documentation on the INPUT statement and in more detailed pages which are linked there. (It is even possible to mix two or more of these styles within a single INPUT statement.)
Working in different industries during the past 18 years, I have mostly used list or formatted input, less frequently column input and only rarely named input.
What you are using in your example is called formatted input. For understanding when and where (so called) column pointer controls such as the "+1" are required, it is important to know how the column pointer (a kind of invisible "cursor") moves across the raw data while the INPUT statement is executed. To quote the documentation on formatted input: "The pointer moves the length that the informat specifies and stops at the next column." Let's see what this means in practice by looking at your example (actually, RW9 did that already, while I was typing this lengthy post):
Detailed explanations for your INPUT statement "with +1":
At the start, the column pointer is located at the beginning of line 1. As you specified formatted input using informat $16. for reading variable NAME, the informat is applied to the first 16 characters ("columns") of line 1. This is because 16 is the length of that informat. Well, the first 16 characters contain the name "Alicia Grossman" (length=15) followed by a single blank. So, this is stored in variable NAME. (Apparently, the "$16." has been thoughtfully chosen in view of the longest name in the data, "Elizabeth Garcia".)
Next is AGE, read with informat 3., hence looking at the next 3 columns (no. 17 - 19). Thanks to the alignment in pumpkin.txt, all of these numbers are read correctly, in particular the "13" in line 1. Now, the column pointer is located at column 20 ("between" columns 19 and 20 if you like), but there is nothing of interest in column 20, only a blank, in all lines of the .txt file. In order to read the next relevant portion of the line (the single character "c" in column 21) into variable TYPE with informat $1., the column pointer must be moved forward by 1 column. This is what the "+1" pointer control does. Same situation after reading TYPE: The column pointer is located between columns 21 and 22, ready to continue reading at column 22. But the length-10 date value "10-28-2012" starts only in column 23! Therefore, the informat MMDDYY10. used for reading variable DATE would look at the wrong 10 columns (namely columns 22 - 31) if we didn't move the pointer again one position to the right (the second "+1")! Having read the date, the pointer rests between columns 32 and 33. Finally, 5 variables are to be read -- all with informat 4.1, which has length 4.
Side note:
Please note that using a w.d informat such as 4.1 is risky in case that possibly some values do not contain a decimal point. For example, if in your data the value 8.0 was written simply as 8, it would be read as 0.8 without further notice! This is because SAS would regard the rightmost d digits (here: d=1) as decimals. I strongly recommend to use informat 4. instead (and informat w. with appropriate width w in general), because it recognizes decimal points and will not cause this potential error.
[End of side note]
The content of columns 33 - 36, i.e. the number 7.8 preceded by a blank, is now available for reading, which is suitable content for the numeric variable SCORE1. Similarly, the remaining four blocks of four columns each (37 - 40, 41 - 44, 45 - 48 and 49 - 52) are read into variables SCORE2-SCORE5. The latter being the last variable in the INPUT statement, the pointer now moves to column 1 of line 2 and is ready for reading the next record in the same way.
Unlike the human eye, SAS is not at all confused by what looks like a "missing gap" between values 9.5 and 10.0 of "Jose Martinez". The characters " 9.5" and "10.0" belong to distinct blocks of columns: 37 - 40 and 41 - 44, respectively (see above). With formatted input there is no need to separate them.
Sometimes the use of pointer controls such as "+1" can be reduced by using correspondingly longer informats. You can see an example of this by comparing your first INPUT statement to that in RW9's first data step: He inserted a "+1" between NAME and AGE, because he reads AGE only with informat 2. rather than 3., so that the blank in column 17 must be skipped. You read this blank column into variable AGE, which does no harm and makes no difference to the stored numeric value.
I think, given the above explanations you can see not only why your INPUT statement "without +1" fails, but also exactly in which way it does so. You will see each of the missing and non-missing values in your erroneous result table "TEST" explained (cf. RW9's pertinent explanations), when you consider which columns are read into a certain variable and whether the content of these columns is valid data for that variable (check the SAS log for notes on "Invalid data"). However, I'm wondering if the SCORE3=510 for "Jose Martinez" results from the INPUT statement you quote. I obtain 0.51 instead (which is plausible, because at this point columns 39 - 42 are read into SCORE3 and these contain the characters ".510").
Please note that RW9 uses list input (more precisely: modified list input, namely modified by the informats he assigned by means of an INFORMAT statement) for the SCOREn variables in his second data step. So, this is an example of what I referred to as mixing input styles within a single INPUT statement (here: formatted input and list input). For the list input he needs the blanks between the score columns -- or other delimiters like the comma he uses in his third data step. The latter uses (partially modified) list input only, not formatted input.
[Minor edits of wording and formatting done.]
example from little sas book, you are an excenllent tutor FreelanceReinhard 🙂
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.