Solved: importing a txt file with consecutive delimiters (**)

morenayan · Posted 04-14-2024 12:13 PM

Basically my question is how to understand "DSD" "COLON MODIFIER", because I'm facing with some problems when importing a txt file

here are my raw data, experiment codes 1-4 and their results.

1. My raw data is like (in a txt file names t24.txt as an attachment)

Susan*12/29/1970*10
Michael**6

2. The following output is desired:

Obs employee bdate years

1 Susan 4015 10
2 Michael . 6

3. experiment codes 1-4 and their results

I tried different codes to solve this problem (from CODE1-4 as follows), and I feel confused about the results.

data t24ex1; /* CODE1 */
infile "&path/t24.txt" dsd dlm='*';
input employee $ bdate mmddyy10. years $;
run;

data t24ex2; /* CODE2 */
infile "&path/t24.txt" dlm='*';
input employee $ bdate mmddyy10. years $;
run;

data t24ex3; /* CODE3 */
infile "&path/t24.txt" dlm='*';
input employee $ bdate :mmddyy10. years $;
run;

data t24ex4; /* CODE4 */
infile "&path/t24.txt" dsd dlm='*';
input employee $ bdate :mmddyy10. years $;
run;

results:

4. My confusion

as far as I know,

1) DSD can treat 2 consecutive separators as one missing value

e.g. a,b,,c => a,b,missing value,c

2) COLON can stop until meeting the next one separator

e.g.  My dog Sam Breed:Rottweiler Vet Bills: $478 (raw data)
INPUT @' Breed:' DogBreed $20.; ----->    Rottweiler Vet Bill
INPUT @'Breed:' DogBreed : $20.; ----->   Rottweiler

So when I used CODE1, I was like assuming the dataset would become

              "Susan", "12/29/1970"(=4015), and "10"
              "Michael", .(missing), "6"

but it didn't work in this way, actually the result of the first row was

               Susan 4015 .

And it didn't generate the result of the second row because of an error in log:

              NOTE: LOST CARD.
              employee=Michael bdate=. years=  _ERROR_=1 _N_=2

Even though the problem of the second row could be solved by adding a COLON modifier(as CODE4), I couldn't understand how it worked ...

My environment is sas studio.

Thank you for your time, I'm a beginner for SAS, it must be that I have a misunderstanding of SAS, I would be very grateful if you could point it out😘

Tom · Posted 04-14-2024 01:11 PM

Simplest thing is to understand the three basic modes of the INPUT statement.

The default is LIST MODE.

input employee bdate years ;

For each variable listed it reads the next "word" on the line.

There is also FORMATTED MODE were you follow each variable with the informat specification to use. In that case it reads the number of bytes specified (or implied) by the informat specification.

input employee $8. bdate mmddyy10. years 5. ;

And COLUMN MODE where you specify the range of columns to read for each variable.

input employee $ 1-8 bdate 9-18 years 19-23 ;

The problem with your first data step is that you are reading two of the variables in LIST MODE and one in FORMATTED MODE. So the current column pointer has to be exactly under the first character of the date value for the informat to work right. If it is one to the left it will include the * before the date. If it is one to the right it will miss the first digit of the month and include the * to the right. And if you date values ever have less than 10 bytes (say they have only one digit for month or day of month) then there is no way that reading 10 characters can avoid reading one or more of the * characters.

The COLON modifier on the INPUT statement allows you to include an informat specification in-line in the INPUT statement, but without switching from LIST MODE to FORMATTED MODE input when reading that variable.

Also note that when reading a value in LIST MODE the width attached to the informat is ignored. The whole next "word" on the line is read and passed through the informat. You could think of it as adjusting the width of the informat specification to match the width of the word.

So you just need:

input employee $ bdate :mmddyy. years ;

Note that the bare $ modifiers in your INPUT statement is only needed when you have not previously defined the variables as character. SAS will default variables to numeric if there is no information it can use to decide otherwise.

So to be clear you should define the variables before referencing them in the INPUT statement. That way the INPUT statement itself just needs to list the variables in the right order.

data want;
  infile "&path/t24.txt" dsd dlm='*' truncover;
  length employee $8 bdate years 8 ;
  input employee bdate years ;
  informat bdate mmddyy. ;
  format bdate yymmdd10.;
run;

And if you define the variables in the order they appear in the source lines then you can use a positional variable list.

input employee -- years ;

If you want to play with learning how INPUT moves the current column pointer you can add the COLUMN= option to your INFILE statement and then check the value between the reads.

data test;
  infile cards dsd dlm='*' truncover column=cc;
  put 'BEFORE employee ' cc= ;
  input employee $ @ ;
  put 'BEFORE bdate ' cc= ;
  input bdate mmddyy10. @ ;
  put 'BEFORE years ' cc= ;
  input years $ @;
  put 'AFTER years ' cc= ;
cards;
Susan*12/29/1970*10
Michael**6
;

View solution in original post

Tom · Posted 04-14-2024 01:11 PM

Simplest thing is to understand the three basic modes of the INPUT statement.

The default is LIST MODE.

input employee bdate years ;

For each variable listed it reads the next "word" on the line.

There is also FORMATTED MODE were you follow each variable with the informat specification to use. In that case it reads the number of bytes specified (or implied) by the informat specification.

input employee $8. bdate mmddyy10. years 5. ;

And COLUMN MODE where you specify the range of columns to read for each variable.

input employee $ 1-8 bdate 9-18 years 19-23 ;

The problem with your first data step is that you are reading two of the variables in LIST MODE and one in FORMATTED MODE. So the current column pointer has to be exactly under the first character of the date value for the informat to work right. If it is one to the left it will include the * before the date. If it is one to the right it will miss the first digit of the month and include the * to the right. And if you date values ever have less than 10 bytes (say they have only one digit for month or day of month) then there is no way that reading 10 characters can avoid reading one or more of the * characters.

The COLON modifier on the INPUT statement allows you to include an informat specification in-line in the INPUT statement, but without switching from LIST MODE to FORMATTED MODE input when reading that variable.

Also note that when reading a value in LIST MODE the width attached to the informat is ignored. The whole next "word" on the line is read and passed through the informat. You could think of it as adjusting the width of the informat specification to match the width of the word.

So you just need:

input employee $ bdate :mmddyy. years ;

Note that the bare $ modifiers in your INPUT statement is only needed when you have not previously defined the variables as character. SAS will default variables to numeric if there is no information it can use to decide otherwise.

So to be clear you should define the variables before referencing them in the INPUT statement. That way the INPUT statement itself just needs to list the variables in the right order.

data want;
  infile "&path/t24.txt" dsd dlm='*' truncover;
  length employee $8 bdate years 8 ;
  input employee bdate years ;
  informat bdate mmddyy. ;
  format bdate yymmdd10.;
run;

And if you define the variables in the order they appear in the source lines then you can use a positional variable list.

input employee -- years ;

If you want to play with learning how INPUT moves the current column pointer you can add the COLUMN= option to your INFILE statement and then check the value between the reads.

data test;
  infile cards dsd dlm='*' truncover column=cc;
  put 'BEFORE employee ' cc= ;
  input employee $ @ ;
  put 'BEFORE bdate ' cc= ;
  input bdate mmddyy10. @ ;
  put 'BEFORE years ' cc= ;
  input years $ @;
  put 'AFTER years ' cc= ;
cards;
Susan*12/29/1970*10
Michael**6
;

Tom · Posted 04-14-2024 01:26 PM

Note that when reading delimited files where every line represents a single observation you should include the TRUNCOVER option on the INFILE statement.

You should also learn the difference between the default FLOWOVER and the ancient MISSOVER (almost never the option you want) and the modern (probably only 30+ years old) TRUNCOVER options of the INFILE statement.

Try this example:

options parmcards=example;
filename example temp;
parmcards;
Susan*12/29/1970*10
Michael**6
;

data flowover;
  infile example dsd dlm='*' flowover;
  input employee $ bdate :mmddyy. years 10. ;
  format bdate yymmdd10.;
run;

data missover;
  infile example dsd dlm='*' missover;
  input employee $ bdate :mmddyy. years 10. ;
  format bdate yymmdd10.;
run;

data truncover;
  infile example dsd dlm='*' truncover;
  input employee $ bdate :mmddyy. years 10. ;
  format bdate yymmdd10.;
run;

data all;
  length type $10;
  set flowover missover truncover indsname=dsname;
  type=scan(dsname,-1,'.');
run;

proc print;
run;

morenayan · Posted 04-14-2024 11:52 PM

thank you sir, got it, very clearly! have a good day:)

importing a txt file with consecutive delimiters (**)

Re: importing a txt file with consecutive delimiters (**)

Re: importing a txt file with consecutive delimiters (**)

Re: importing a txt file with consecutive delimiters (**)

Re: importing a txt file with consecutive delimiters (**)