BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zula
Obsidian | Level 7

Hi.

I wanted to import text file into SAS, but this file has 62 variables and tab and line spaced (see an example data with only a few variables). How can I read this file correctly into SAS if I do not know the variables' format and length?

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Looks like some process has inserted random line breaks at around 256 characters.  Probably some type of misguided download process or strange file specification on the program that made the file.

 

But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.

FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;

data _null_;
  infile rawinput termstr=crlf end=eof;
  file fixed termstr=crlf ;
  do until(eof or ntabs > 60);
    input ;
    put _infile_ @ ;
    ntabs = sum(ntabs,countc(_infile_,'09'x));
  end;
  put;
  putlog _n_= ntabs=;
run;

Now you can read the new file using normal code.  For example here is your program fixed up.

  1. Uses DSD option on INFILE statement to properly parse when there are missing values.
  2. Removes the old style MISSOVER option (which was being overridden by the TRUNCOVER option that followed it anyway).
  3. Uses LENGTH to define the variables instead of attaching unneeded informats. That way SAS does not have to guess that it how it should define the variables by assuming that the width of the informat that has been attached should match the length of the variable to be created.
  4. Simplifies the INPUT statement since the variables have already been defined in the order they appear in the file.
data new;
  infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
  length   
    DOD_YR $4 
    DSTATE $2 
    FILENO $6 
    GNAME $50 
    MNAME $1 
    LNAME $50 
    SEX $1 
    SSN $9 
    AGETYPE $1 
    AGE $3 
    DOB_YR $4 
    DOB_MO $2 
    DOB_DY $2 
    BPLACE_CNT $2 
    BPLACE_ST $2 
    STATEC $2 
    MARITAL $1 
    DPLACE $1 
    DOD_MO $2 
    DOD_DY $2 
    DEDUC $1 
    OCCUP $40 
    OCCUPC $3 
    INDUST $40 
    INDUSTC $3 
    ACME_UC $5 
    AUTOP $1 
    CERTL $30 
    DINSTI $30 
    STNUM_R $10 
    STNAME_R $50 
    STDESIG_R $10 
    POSTDIR_R $10 
    UNITNUM_R $7 
    CITYTEXT_R $28 
    ZIP9_R $9 
    STATETEXT_R $28 
    ADDRESS_R $50 
    DETHNICE $3 
    NCHSBRIDGE $2 
    HISPOLDC $1 
    RACEOLDC $1 
    DMIDDLE $50 
    COD1A $120 
    INTERVAL1A $20 
    COD1B $120 
    INTERVAL1B $20 
    COD1C $120 
    INTERVAL1C $20 
    COD1D $120 
    INTERVAL1D $20 
    OTHERCONDITION $240 
    CERTFIRST $50 
    CERTMIDDLE $50 
    CERTLAST $50 
    MEC_YR $4 
    MEC_MO $2 
    MEC_DAY $2 
    MEC_NUMBER $8 
    AGE_CALC $3 
    MAR_WARD $6 
    MAR_CENSUS_TRACT $9 
  ;
  input DOD_YR -- MAR_CENSUS_TRACT ;
run;
NOTE: 3 records were read from the infile FIXED.
      The minimum record length was 313.
      The maximum record length was 381.
NOTE: The data set WORK.NEW has 3 observations and 62 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.03 seconds

 

View solution in original post

13 REPLIES 13
tomrvincent
Rhodochrosite | Level 12
Is it always going to be a pair of lines defining a record?
Zula
Obsidian | Level 7

Nope, very random. Some variables with a long string and short depending on the specific ID.

tomrvincent
Rhodochrosite | Level 12
So no way to know when one record ends and a new one begins.

I'd bounce the file and tell the supplier to provide it in a standard format and structure.
ballardw
Super User

@tomrvincent wrote:
So no way to know when one record ends and a new one begins.

I'd bounce the file and tell the supplier to provide it in a standard format and structure.

Agree.

 

And provide details of how you, as a recipient are supposed to know; 1) where a record starts, 2) the contents and 3) expected formats and lengths values.

Zula
Obsidian | Level 7
I have informat for all variables, but it is still not working. I think there are many line breaks for many string variables. I used truncover missover, but those do not help.
Tom
Super User Tom
Super User

We probably need more examples to see if there is a pattern.

 

The example you showed seems to have two header lines and two data lines.

Do all of the data lines take exactly two lines?

 

Do all of the files have that structure?

Do other files have a different number of lines of headers?

Does the number of lines of data per observation always match the number of header lines?

If the number of variables changes between files is it constant per file?

Zula
Obsidian | Level 7


FILENAME rawinput "\\ZULA\test.txt" ;

data new;
retain DOD_YR;
infile rawinput dlm='09'X firstobs=2 missover truncover ;
informat DOD_YR $4.;
informat DSTATE $2.;
informat FILENO $6.;
informat GNAME $50.;
informat MNAME $1.;
informat LNAME $50.;
informat SEX $1.;
informat SSN $9.;
informat AGETYPE $1.;
informat AGE $3.;
informat DOB_YR $4.;
informat DOB_MO $2.;
informat DOB_DY $2.;
informat BPLACE_CNT $2.;
informat BPLACE_ST $2.;
informat STATEC $2.;
informat MARITAL $1.;
informat DPLACE $1.;
informat DOD_MO $2.;
informat DOD_DY $2.;
informat DEDUC $1.;
informat OCCUP $40.;
informat OCCUPC $3.;
informat INDUST $40.;
informat INDUSTC $3.;
informat ACME_UC $5.;
informat AUTOP $1.;
informat CERTL $30.;
informat DINSTI $30.;
informat STNUM_R $10.;
informat STNAME_R $50.;
informat STDESIG_R $10.;
informat POSTDIR_R $10.;
informat UNITNUM_R $7.;
informat CITYTEXT_R $28.;
informat ZIP9_R $9.;
informat STATETEXT_R $28.;
informat ADDRESS_R $50.;
informat DETHNICE $3.;
informat NCHSBRIDGE $2.;
informat HISPOLDC $1.;
informat RACEOLDC $1.;
informat DMIDDLE $50.;
informat COD1A $120.;
informat INTERVAL1A $20.;
informat COD1B $120.;
informat INTERVAL1B $20.;
informat COD1C $120.;
informat INTERVAL1C $20.;
informat COD1D $120.;
informat INTERVAL1D $20.;
informat OTHERCONDITION $240.;
informat CERTFIRST $50.;
informat CERTMIDDLE $50.;
informat CERTLAST $50.;
informat MEC_YR $4.;
informat MEC_MO $2.;
informat MEC_DAY $2.;
informat MEC_NUMBER $8.;
informat AGE_CALC $3.;
informat MAR_WARD $6.;
informat MAR_CENSUS_TRACT $9.;
;
input
DOD_YR $ DSTATE $ FILENO $
GNAME $ MNAME $ LNAME $
SEX $ SSN $ AGETYPE $ AGE $
DOB_YR $ DOB_MO $ DOB_DY $
BPLACE_CNT $ BPLACE_ST $
STATEC $ MARITAL $
DPLACE $ DOD_MO $ DOD_DY $
DEDUC $ OCCUP $ OCCUPC $ INDUST $
INDUSTC $
ACME_UC $
AUTOP $
CERTL $
DINSTI $
STNUM_R $
STNAME_R $
STDESIG_R $
POSTDIR_R $
UNITNUM_R $
CITYTEXT_R $
ZIP9_R $
STATETEXT_R $
ADDRESS_R $
DETHNICE $
NCHSBRIDGE $
HISPOLDC $
RACEOLDC $
DMIDDLE $
COD1A $
INTERVAL1A $
COD1B $
INTERVAL1B $
COD1C $
INTERVAL1C $
COD1D $
INTERVAL1D $
OTHERCONDITION $
CERTFIRST $
CERTMIDDLE $
CERTLAST $
MEC_YR $
MEC_MO $
MEC_DAY $
MEC_NUMBER $
AGE_CALC $
MAR_WARD $
MAR_CENSUS_TRACT $ ;
run;

 

Tom
Super User Tom
Super User

Looks like some process has inserted random line breaks at around 256 characters.  Probably some type of misguided download process or strange file specification on the program that made the file.

 

But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.

FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;

data _null_;
  infile rawinput termstr=crlf end=eof;
  file fixed termstr=crlf ;
  do until(eof or ntabs > 60);
    input ;
    put _infile_ @ ;
    ntabs = sum(ntabs,countc(_infile_,'09'x));
  end;
  put;
  putlog _n_= ntabs=;
run;

Now you can read the new file using normal code.  For example here is your program fixed up.

  1. Uses DSD option on INFILE statement to properly parse when there are missing values.
  2. Removes the old style MISSOVER option (which was being overridden by the TRUNCOVER option that followed it anyway).
  3. Uses LENGTH to define the variables instead of attaching unneeded informats. That way SAS does not have to guess that it how it should define the variables by assuming that the width of the informat that has been attached should match the length of the variable to be created.
  4. Simplifies the INPUT statement since the variables have already been defined in the order they appear in the file.
data new;
  infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
  length   
    DOD_YR $4 
    DSTATE $2 
    FILENO $6 
    GNAME $50 
    MNAME $1 
    LNAME $50 
    SEX $1 
    SSN $9 
    AGETYPE $1 
    AGE $3 
    DOB_YR $4 
    DOB_MO $2 
    DOB_DY $2 
    BPLACE_CNT $2 
    BPLACE_ST $2 
    STATEC $2 
    MARITAL $1 
    DPLACE $1 
    DOD_MO $2 
    DOD_DY $2 
    DEDUC $1 
    OCCUP $40 
    OCCUPC $3 
    INDUST $40 
    INDUSTC $3 
    ACME_UC $5 
    AUTOP $1 
    CERTL $30 
    DINSTI $30 
    STNUM_R $10 
    STNAME_R $50 
    STDESIG_R $10 
    POSTDIR_R $10 
    UNITNUM_R $7 
    CITYTEXT_R $28 
    ZIP9_R $9 
    STATETEXT_R $28 
    ADDRESS_R $50 
    DETHNICE $3 
    NCHSBRIDGE $2 
    HISPOLDC $1 
    RACEOLDC $1 
    DMIDDLE $50 
    COD1A $120 
    INTERVAL1A $20 
    COD1B $120 
    INTERVAL1B $20 
    COD1C $120 
    INTERVAL1C $20 
    COD1D $120 
    INTERVAL1D $20 
    OTHERCONDITION $240 
    CERTFIRST $50 
    CERTMIDDLE $50 
    CERTLAST $50 
    MEC_YR $4 
    MEC_MO $2 
    MEC_DAY $2 
    MEC_NUMBER $8 
    AGE_CALC $3 
    MAR_WARD $6 
    MAR_CENSUS_TRACT $9 
  ;
  input DOD_YR -- MAR_CENSUS_TRACT ;
run;
NOTE: 3 records were read from the infile FIXED.
      The minimum record length was 313.
      The maximum record length was 381.
NOTE: The data set WORK.NEW has 3 observations and 62 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.03 seconds

 

Zula
Obsidian | Level 7

So cool--it worked:)

Thank you very much.

 

Zula
Obsidian | Level 7

Do all of the data lines take exactly two lines?

 

RE: No

 

Do all of the files have that structure?

RE: No, there are missing data (see attached a sample test data).

Do other files have a different number of lines of headers?

RE: I think I have tab and line breaks (it was transferred through ftp).

Does the number of lines of data per observation always match the number of header lines?

Re: I do not think so. 

If the number of variables changes between files is it constant per file?

Re: Yes, I guess you are asking about if I have missing data?

Zula
Obsidian | Level 7

When I read the infile with informat option, I still have the wrong data. All line breaks become a new observation. 

Thank you.

Tom
Super User Tom
Super User

Missing data should not matter if the TABs are always there between the values.

You mentioned 62 variables before.  Do you just have one file that has 62 variables and some number of observations?

Are there always 61 tabs that appear between the the 62 values (no matter how many lines they take)?

 

Is it possible it just looks confusing to you because some of the character variables contain end-of-line characters inside them so it makes the values for that observation look like it takes more than one line?  That does not seem likely since you posted a file where what looked like the header line with the column names also took two lines.

 

Is it possible what you have is the output of a process that was writing to a file with a maximum line length and so it wrapped some values to start on new lines?  Like what would happen if you used the PUT statement in a SAS program.

Zula
Obsidian | Level 7
Please see the text attached file (I have first rows with all the variables and sample 2-3 observations), there are not always 62 tabs because of missing data and random line breaks (new line).

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1721 views
  • 6 likes
  • 4 in conversation