BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DingDing
Quartz | Level 8
 
I try to use INPUT to read raw data, but I found that it should use the +1 correctly then SAS could read the data correctly, for the below example, why it is correct only when I add +1 after the variables of age and type? Could anyone explain for me that how do I know when to add +1 to a skip a column from the dateset?
 
with+1 after columns age and type
 
INPUT name $16. age 3. + 1 type $1. +1 date mmddyy10. (score1 score2 score3 score4 score5) (4.1 );
 
ScreenClip.png
without +1
INPUT name $16. age 3.  type $1.  date mmddyy10. (score1 score2 score3 score4 score5) (4.1 );
So when to use +1
ScreenClip.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello DingDing,

First of all, you should be aware that there are four different styles of reading raw data with the INPUT statement: column, list, formatted and named input. These are described in the online documentation on the INPUT statement and in more detailed pages which are linked there. (It is even possible to mix two or more of these styles within a single INPUT statement.)

Working in different industries during the past 18 years, I have mostly used list or formatted input, less frequently column input and only rarely named input.

What you are using in your example is called formatted input. For understanding when and where (so called) column pointer controls such as the "+1" are required, it is important to know how the column pointer (a kind of invisible "cursor") moves across the raw data while the INPUT statement is executed. To quote the documentation on formatted input: "The pointer moves the length that the informat specifies and stops at the next column." Let's see what this means in practice by looking at your example (actually, RW9 did that already, while I was typing this lengthy post):

Detailed explanations for your INPUT statement "with +1":

At the start, the column pointer is located at the beginning of line 1. As you specified formatted input using informat $16. for reading variable NAME, the informat is applied to the first 16 characters ("columns") of line 1. This is because 16 is the length of that informat. Well, the first 16 characters contain the name "Alicia Grossman" (length=15) followed by a single blank. So, this is stored in variable NAME. (Apparently, the "$16." has been thoughtfully chosen in view of the longest name in the data, "Elizabeth Garcia".)

Next is AGE, read with informat 3., hence looking at the next 3 columns (no. 17 - 19). Thanks to the alignment in pumpkin.txt, all of these numbers are read correctly, in particular the "13" in line 1. Now, the column pointer is located at column 20 ("between" columns 19 and 20 if you like), but there is nothing of interest in column 20, only a blank, in all lines of the .txt file. In order to read the next relevant portion of the line (the single character "c" in column 21) into variable TYPE with informat $1., the column pointer must be moved forward by 1 column. This is what the "+1" pointer control does. Same situation after reading TYPE: The column pointer is located between columns 21 and 22, ready to continue reading at column 22. But the length-10 date value "10-28-2012" starts only in column 23! Therefore, the informat MMDDYY10. used for reading variable DATE would look at the wrong 10 columns (namely columns 22 - 31) if we didn't move the pointer again one position to the right (the second "+1")! Having read the date, the pointer rests between columns 32 and 33. Finally, 5 variables are to be read -- all with informat 4.1, which has length 4.

Side note:
Please note that using a w.d informat such as 4.1 is risky in case that possibly some values do not contain a decimal point. For example, if in your data the value 8.0 was written simply as 8, it would be read as 0.8 without further notice! This is because SAS would regard the rightmost d digits (here: d=1) as decimals. I strongly recommend to use informat 4. instead (and informat w. with appropriate width w in general), because it recognizes decimal points and will not cause this potential error.
[End of side note]

The content of columns 33 - 36, i.e. the number 7.8 preceded by a blank, is now available for reading, which is suitable content for the numeric variable SCORE1. Similarly, the remaining four blocks of four columns each (37 - 40, 41 - 44, 45 - 48 and 49 - 52) are read into variables SCORE2-SCORE5. The latter being the last variable in the INPUT statement, the pointer now moves to column 1 of line 2 and is ready for reading the next record in the same way.

Unlike the human eye, SAS is not at all confused by what looks like a "missing gap" between values 9.5 and 10.0 of "Jose Martinez". The characters " 9.5" and "10.0" belong to distinct blocks of columns: 37 - 40 and 41 - 44, respectively (see above). With formatted input there is no need to separate them.

Sometimes the use of pointer controls such as "+1" can be reduced by using correspondingly longer informats. You can see an example of this by comparing your first INPUT statement to that in RW9's first data step: He inserted a "+1" between NAME and AGE, because he reads AGE only with informat 2. rather than 3., so that the blank in column 17 must be skipped. You read this blank column into variable AGE, which does no harm and makes no difference to the stored numeric value.

 

I think, given the above explanations you can see not only why your INPUT statement "without +1" fails, but also exactly in which way it does so. You will see each of the missing and non-missing values in your erroneous result table "TEST" explained (cf. RW9's pertinent explanations), when you consider which columns are read into a certain variable and whether the content of these columns is valid data for that variable (check the SAS log for notes on "Invalid data"). However, I'm wondering if the SCORE3=510 for "Jose Martinez" results from the INPUT statement you quote. I obtain 0.51 instead (which is plausible, because at this point columns 39 - 42 are read into SCORE3 and these contain the characters ".510").

Please note that RW9 uses list input (more precisely: modified list input, namely modified by the informats he assigned by means of an INFORMAT statement) for the SCOREn variables in his second data step. So, this is an example of what I referred to as mixing input styles within a single INPUT statement (here: formatted input and list input). For the list input he needs the blanks between the score columns -- or other delimiters like the comma he uses in his third data step. The latter uses (partially modified) list input only, not formatted input.

 

[Minor edits of wording and formatting done.]

View solution in original post

13 REPLIES 13
RW9
Diamond | Level 26 RW9
Diamond | Level 26

The problem is arising where you have a value with no gap in it.  You are looking at a fixed width file here (personally I avoid these, delimited files are far easier to work with).  Your data looks like:

data want;
  infile datalines;
  informat score1-score5 4.1;
  input name $16. +1 age 2. +1 type $1. +1 date mmddyy10. +1 score1-score5;
datalines;
Alicia Grossman  13 c 10-28-2012 7.8 6.5 7.2 8.0 7.9
Matthew Lee       9 D 10-30-2012 6.5 5.9 6.8 6.0 8.1
Elizabeth Garcia 10 C 10-29-2012 8.9 7.9 8.5 9.0 8.8
Lori Newcombe     6 D 10-30-2012 6.7 5.6 4.9 5.2 6.1
Jose Martinez     7 d 10-31-2012 8.9 9.510.0 9.7 9.0
Brian Williams   11 C 10-29-2012 7.8 8.4 8.5 7.9 8.0
;
run;

With a 1 where there should normally be a gap for Jose row.  If you modify that slightly to be:

data want;
  infile datalines;
  informat score1-score5 4.1;
  input name $16. +1 age 2. +1 type $1. +1 date mmddyy10. +1 score1-score5;
datalines;
Alicia Grossman  13 c 10-28-2012 7.8 6.5 7.2 8.0 7.9
Matthew Lee       9 D 10-30-2012 6.5 5.9 6.8 6.0 8.1
Elizabeth Garcia 10 C 10-29-2012 8.9 7.9 8.5 9.0 8.8
Lori Newcombe     6 D 10-30-2012 6.7 5.6 4.9 5.2 6.1
Jose Martinez     7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0
Brian Williams   11 C 10-29-2012 7.8 8.4 8.5 7.9 8.0
;
run;

It works fine. 

As a note, the delimited file would like:

data want;
  infile datalines;
  input name $ age type $ date score1-score5;
  informat date mmddyy10.;
  format date date9.;
length name $16; datalines; Alicia Grossman,13,c,10-28-2012,7.8,6.5,7.2,8.0,7.9 Matthew Lee,9,D,10-30-2012,6.5,5.9,6.8,6.0,8.1 Elizabeth Garcia,10,C,10-29-2012,8.9,7.9,8.5,9.0,8.8 Lori Newcombe,6,D,10-30-2012,6.7,5.6,4.9,5.2,6.1 Jose Martinez,7,d,10-31-2012,8.9,9.5,10.0,9.7,9.0 Brian Williams,11,C,10-29-2012,7.8,8.4,8.5,7.9,8.0 ; run;

Much easier to code, and more flexible.

DingDing
Quartz | Level 8

Thank you for your answer, but i still have some questions upon your answer

 

Q1, as you said " With a 1 where there should normally be a gap for Jose row.  If you modify that slightly to be:",

where is the slight modification? I don't find any change on your second code example, and I run your former two code examples,

both return the same as below:

ScreenClip.png

 

Q2, what does the dlm= "¬" mean?

 

Q3, as for your third code example, it does work if the columns are separated by a "," , but in my datafile (see attached), they are separated by a space, so I still don't know why it is correct only when I add +1 after the variables of age and type (even add +1 for all the variables, it doesn't work)?

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Q1 - In your text file, the line for Jose has:

Jose Martinez     7 d 10-31-2012 8.9 9.510.0 9.7 9.0

You will see no space between 9.510.0.  In my second example I put a space:

 

Jose Martinez     7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0

And then it works. 

Q2 - dlm= is an infile statement which allows you to specify the delimiter of a file, can be any character.  You can probably omit that from this datastep, its just a habit that I always deal with delimited files, so always have it in.

 

Q3 - Yes, my suggestion is to get the file delimited from your source.

As for the +1, well its all about the length of the string coming in:

12345678901234567890123456789012345678901234567890123
Jose Martinez 7 d 10-31-2012 8.9 9.5 10.0 9.7 9.0
^ ^

So in your code, you are reading 16 characters for name, which takes the pointer up to the first ^.  Then you read age which is 3 in your code.  Then you read one character for type, but that is space, unless you put the +1 to move the pointer 1 forward, the D appears in the date read if not moved +1 forward.  This is the concept of reading files in by using fixed length fields, you are moving a pointer across the string you are reading in, so input +5 name $, would move the pointer from the start of the string 5 characters in and then read into name.  The problem is with that line of data, it doesn't like the fact there is no space between 9.5 and 10.0

 

FreelanceReinh
Jade | Level 19
Hello RW9,
you may want to insert a LENGTH statement for variable NAME in your third data step. Otherwise, NAME will be truncated to the default 8 characters.
FreelanceReinh
Jade | Level 19

Apologies, RW9,


I overlooked that you certainly meant to read score1-score5 in your data steps. (And thanks for the "Like"! :-))

 

Moreover, I share DingDing's confusion about the uncommon delimiter "¬". This equals 'AC'x, but I think it should be just a blank ('20'x) to make your second data step work (using modifed list input for the score variables). As this is the default delimiter, dlm=" " would actually be redundant.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, all dlm="¬" means is that it reads the whole line as that dlm is not found. In this instance it makes no difference if you have it in or not.  Will update the code.

FreelanceReinh
Jade | Level 19

Hmm, when I test it (in a Windows SAS 9.4 session), the step with dlm="¬" fails with "Invalid data" notes (as it should, since the delimiter in the data is the blank), missing values and observations, but works fine with dlm=" " (and without using the dlm= option). I had assumed the "¬" character was just the result of some strange display issue or should indicate the blank between the quotation marks.

FreelanceReinh
Jade | Level 19

Sorry for being a pain, RW9,

 

I think you have inadvertently dropped the DLM="," option from your third data step, where it is required since comma is not the default delimiter. Also, the LENGTH statement does not work if you put it after the INPUT statement: at this point the length "has already been set" (to the insufficient default 8), as is pointed out by a warning in the log.

 

Kurt_Bremser
Super User

You need the +1 position modifier because your formats don't account for all character columns in the input lines. Where additional blanks are inserted, you need to "jump over", which you do with the +1.

FreelanceReinh
Jade | Level 19

Hello DingDing,

First of all, you should be aware that there are four different styles of reading raw data with the INPUT statement: column, list, formatted and named input. These are described in the online documentation on the INPUT statement and in more detailed pages which are linked there. (It is even possible to mix two or more of these styles within a single INPUT statement.)

Working in different industries during the past 18 years, I have mostly used list or formatted input, less frequently column input and only rarely named input.

What you are using in your example is called formatted input. For understanding when and where (so called) column pointer controls such as the "+1" are required, it is important to know how the column pointer (a kind of invisible "cursor") moves across the raw data while the INPUT statement is executed. To quote the documentation on formatted input: "The pointer moves the length that the informat specifies and stops at the next column." Let's see what this means in practice by looking at your example (actually, RW9 did that already, while I was typing this lengthy post):

Detailed explanations for your INPUT statement "with +1":

At the start, the column pointer is located at the beginning of line 1. As you specified formatted input using informat $16. for reading variable NAME, the informat is applied to the first 16 characters ("columns") of line 1. This is because 16 is the length of that informat. Well, the first 16 characters contain the name "Alicia Grossman" (length=15) followed by a single blank. So, this is stored in variable NAME. (Apparently, the "$16." has been thoughtfully chosen in view of the longest name in the data, "Elizabeth Garcia".)

Next is AGE, read with informat 3., hence looking at the next 3 columns (no. 17 - 19). Thanks to the alignment in pumpkin.txt, all of these numbers are read correctly, in particular the "13" in line 1. Now, the column pointer is located at column 20 ("between" columns 19 and 20 if you like), but there is nothing of interest in column 20, only a blank, in all lines of the .txt file. In order to read the next relevant portion of the line (the single character "c" in column 21) into variable TYPE with informat $1., the column pointer must be moved forward by 1 column. This is what the "+1" pointer control does. Same situation after reading TYPE: The column pointer is located between columns 21 and 22, ready to continue reading at column 22. But the length-10 date value "10-28-2012" starts only in column 23! Therefore, the informat MMDDYY10. used for reading variable DATE would look at the wrong 10 columns (namely columns 22 - 31) if we didn't move the pointer again one position to the right (the second "+1")! Having read the date, the pointer rests between columns 32 and 33. Finally, 5 variables are to be read -- all with informat 4.1, which has length 4.

Side note:
Please note that using a w.d informat such as 4.1 is risky in case that possibly some values do not contain a decimal point. For example, if in your data the value 8.0 was written simply as 8, it would be read as 0.8 without further notice! This is because SAS would regard the rightmost d digits (here: d=1) as decimals. I strongly recommend to use informat 4. instead (and informat w. with appropriate width w in general), because it recognizes decimal points and will not cause this potential error.
[End of side note]

The content of columns 33 - 36, i.e. the number 7.8 preceded by a blank, is now available for reading, which is suitable content for the numeric variable SCORE1. Similarly, the remaining four blocks of four columns each (37 - 40, 41 - 44, 45 - 48 and 49 - 52) are read into variables SCORE2-SCORE5. The latter being the last variable in the INPUT statement, the pointer now moves to column 1 of line 2 and is ready for reading the next record in the same way.

Unlike the human eye, SAS is not at all confused by what looks like a "missing gap" between values 9.5 and 10.0 of "Jose Martinez". The characters " 9.5" and "10.0" belong to distinct blocks of columns: 37 - 40 and 41 - 44, respectively (see above). With formatted input there is no need to separate them.

Sometimes the use of pointer controls such as "+1" can be reduced by using correspondingly longer informats. You can see an example of this by comparing your first INPUT statement to that in RW9's first data step: He inserted a "+1" between NAME and AGE, because he reads AGE only with informat 2. rather than 3., so that the blank in column 17 must be skipped. You read this blank column into variable AGE, which does no harm and makes no difference to the stored numeric value.

 

I think, given the above explanations you can see not only why your INPUT statement "without +1" fails, but also exactly in which way it does so. You will see each of the missing and non-missing values in your erroneous result table "TEST" explained (cf. RW9's pertinent explanations), when you consider which columns are read into a certain variable and whether the content of these columns is valid data for that variable (check the SAS log for notes on "Invalid data"). However, I'm wondering if the SCORE3=510 for "Jose Martinez" results from the INPUT statement you quote. I obtain 0.51 instead (which is plausible, because at this point columns 39 - 42 are read into SCORE3 and these contain the characters ".510").

Please note that RW9 uses list input (more precisely: modified list input, namely modified by the informats he assigned by means of an INFORMAT statement) for the SCOREn variables in his second data step. So, this is an example of what I referred to as mixing input styles within a single INPUT statement (here: formatted input and list input). For the list input he needs the blanks between the score columns -- or other delimiters like the comma he uses in his third data step. The latter uses (partially modified) list input only, not formatted input.

 

[Minor edits of wording and formatting done.]

DingDing
Quartz | Level 8
Thank you so much. Dear FreelanceReinhard, you just like my private tutor. Appreciated.
sz79
Calcite | Level 5

example from little sas book, you are an excenllent tutor FreelanceReinhard 🙂

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 4864 views
  • 10 likes
  • 5 in conversation