Solved: Re: How to import tab and line spaced text data

Zula · Posted 08-26-2019 10:56 AM

Hi.

I wanted to import text file into SAS, but this file has 62 variables and tab and line spaced (see an example data with only a few variables). How can I read this file correctly into SAS if I do not know the variables' format and length?

Thank you.

Tom · Posted 08-26-2019 10:06 PM

Looks like some process has inserted random line breaks at around 256 characters. Probably some type of misguided download process or strange file specification on the program that made the file.

But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.

FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;

data _null_;
  infile rawinput termstr=crlf end=eof;
  file fixed termstr=crlf ;
  do until(eof or ntabs > 60);
    input ;
    put _infile_ @ ;
    ntabs = sum(ntabs,countc(_infile_,'09'x));
  end;
  put;
  putlog _n_= ntabs=;
run;

Now you can read the new file using normal code. For example here is your program fixed up.

Uses DSD option on INFILE statement to properly parse when there are missing values.
Removes the old style MISSOVER option (which was being overridden by the TRUNCOVER option that followed it anyway).
Uses LENGTH to define the variables instead of attaching unneeded informats. That way SAS does not have to guess that it how it should define the variables by assuming that the width of the informat that has been attached should match the length of the variable to be created.
Simplifies the INPUT statement since the variables have already been defined in the order they appear in the file.

data new;
  infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
  length   
    DOD_YR $4 
    DSTATE $2 
    FILENO $6 
    GNAME $50 
    MNAME $1 
    LNAME $50 
    SEX $1 
    SSN $9 
    AGETYPE $1 
    AGE $3 
    DOB_YR $4 
    DOB_MO $2 
    DOB_DY $2 
    BPLACE_CNT $2 
    BPLACE_ST $2 
    STATEC $2 
    MARITAL $1 
    DPLACE $1 
    DOD_MO $2 
    DOD_DY $2 
    DEDUC $1 
    OCCUP $40 
    OCCUPC $3 
    INDUST $40 
    INDUSTC $3 
    ACME_UC $5 
    AUTOP $1 
    CERTL $30 
    DINSTI $30 
    STNUM_R $10 
    STNAME_R $50 
    STDESIG_R $10 
    POSTDIR_R $10 
    UNITNUM_R $7 
    CITYTEXT_R $28 
    ZIP9_R $9 
    STATETEXT_R $28 
    ADDRESS_R $50 
    DETHNICE $3 
    NCHSBRIDGE $2 
    HISPOLDC $1 
    RACEOLDC $1 
    DMIDDLE $50 
    COD1A $120 
    INTERVAL1A $20 
    COD1B $120 
    INTERVAL1B $20 
    COD1C $120 
    INTERVAL1C $20 
    COD1D $120 
    INTERVAL1D $20 
    OTHERCONDITION $240 
    CERTFIRST $50 
    CERTMIDDLE $50 
    CERTLAST $50 
    MEC_YR $4 
    MEC_MO $2 
    MEC_DAY $2 
    MEC_NUMBER $8 
    AGE_CALC $3 
    MAR_WARD $6 
    MAR_CENSUS_TRACT $9 
  ;
  input DOD_YR -- MAR_CENSUS_TRACT ;
run;

NOTE: 3 records were read from the infile FIXED.
      The minimum record length was 313.
      The maximum record length was 381.
NOTE: The data set WORK.NEW has 3 observations and 62 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.03 seconds

View solution in original post

tomrvincent · Posted 08-26-2019 11:09 AM

Is it always going to be a pair of lines defining a record?

Zula · Posted 08-26-2019 11:19 AM

Nope, very random. Some variables with a long string and short depending on the specific ID.

tomrvincent · Posted 08-26-2019 11:40 AM

So no way to know when one record ends and a new one begins.

I'd bounce the file and tell the supplier to provide it in a standard format and structure.

ballardw · Posted 08-26-2019 11:55 AM

@tomrvincent wrote:
So no way to know when one record ends and a new one begins.

I'd bounce the file and tell the supplier to provide it in a standard format and structure.

Agree.

And provide details of how you, as a recipient are supposed to know; 1) where a record starts, 2) the contents and 3) expected formats and lengths values.

Zula · Posted 08-26-2019 07:56 PM

I have informat for all variables, but it is still not working. I think there are many line breaks for many string variables. I used truncover missover, but those do not help.

Tom · Posted 08-26-2019 09:09 PM

We probably need more examples to see if there is a pattern.

The example you showed seems to have two header lines and two data lines.

Do all of the data lines take exactly two lines?

Do all of the files have that structure?

Do other files have a different number of lines of headers?

Does the number of lines of data per observation always match the number of header lines?

If the number of variables changes between files is it constant per file?

Zula · Posted 08-26-2019 09:34 PM

FILENAME rawinput "\\ZULA\test.txt" ;

data new;
retain DOD_YR;
infile rawinput dlm='09'X firstobs=2 missover truncover ;
informat DOD_YR $4.;
informat DSTATE $2.;
informat FILENO $6.;
informat GNAME $50.;
informat MNAME $1.;
informat LNAME $50.;
informat SEX $1.;
informat SSN $9.;
informat AGETYPE $1.;
informat AGE $3.;
informat DOB_YR $4.;
informat DOB_MO $2.;
informat DOB_DY $2.;
informat BPLACE_CNT $2.;
informat BPLACE_ST $2.;
informat STATEC $2.;
informat MARITAL $1.;
informat DPLACE $1.;
informat DOD_MO $2.;
informat DOD_DY $2.;
informat DEDUC $1.;
informat OCCUP $40.;
informat OCCUPC $3.;
informat INDUST $40.;
informat INDUSTC $3.;
informat ACME_UC $5.;
informat AUTOP $1.;
informat CERTL $30.;
informat DINSTI $30.;
informat STNUM_R $10.;
informat STNAME_R $50.;
informat STDESIG_R $10.;
informat POSTDIR_R $10.;
informat UNITNUM_R $7.;
informat CITYTEXT_R $28.;
informat ZIP9_R $9.;
informat STATETEXT_R $28.;
informat ADDRESS_R $50.;
informat DETHNICE $3.;
informat NCHSBRIDGE $2.;
informat HISPOLDC $1.;
informat RACEOLDC $1.;
informat DMIDDLE $50.;
informat COD1A $120.;
informat INTERVAL1A $20.;
informat COD1B $120.;
informat INTERVAL1B $20.;
informat COD1C $120.;
informat INTERVAL1C $20.;
informat COD1D $120.;
informat INTERVAL1D $20.;
informat OTHERCONDITION $240.;
informat CERTFIRST $50.;
informat CERTMIDDLE $50.;
informat CERTLAST $50.;
informat MEC_YR $4.;
informat MEC_MO $2.;
informat MEC_DAY $2.;
informat MEC_NUMBER $8.;
informat AGE_CALC $3.;
informat MAR_WARD $6.;
informat MAR_CENSUS_TRACT $9.;
;
input
DOD_YR $ DSTATE $ FILENO $
GNAME $ MNAME $ LNAME $
SEX $ SSN $ AGETYPE $ AGE $
DOB_YR $ DOB_MO $ DOB_DY $
BPLACE_CNT $ BPLACE_ST $
STATEC $ MARITAL $
DPLACE $ DOD_MO $ DOD_DY $
DEDUC $ OCCUP $ OCCUPC $ INDUST $
INDUSTC $
ACME_UC $
AUTOP $
CERTL $
DINSTI $
STNUM_R $
STNAME_R $
STDESIG_R $
POSTDIR_R $
UNITNUM_R $
CITYTEXT_R $
ZIP9_R $
STATETEXT_R $
ADDRESS_R $
DETHNICE $
NCHSBRIDGE $
HISPOLDC $
RACEOLDC $
DMIDDLE $
COD1A $
INTERVAL1A $
COD1B $
INTERVAL1B $
COD1C $
INTERVAL1C $
COD1D $
INTERVAL1D $
OTHERCONDITION $
CERTFIRST $
CERTMIDDLE $
CERTLAST $
MEC_YR $
MEC_MO $
MEC_DAY $
MEC_NUMBER $
AGE_CALC $
MAR_WARD $
MAR_CENSUS_TRACT $ ;
run;

Tom · Posted 08-26-2019 10:06 PM

Looks like some process has inserted random line breaks at around 256 characters. Probably some type of misguided download process or strange file specification on the program that made the file.

But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.

FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;

data _null_;
  infile rawinput termstr=crlf end=eof;
  file fixed termstr=crlf ;
  do until(eof or ntabs > 60);
    input ;
    put _infile_ @ ;
    ntabs = sum(ntabs,countc(_infile_,'09'x));
  end;
  put;
  putlog _n_= ntabs=;
run;

Now you can read the new file using normal code. For example here is your program fixed up.

Uses DSD option on INFILE statement to properly parse when there are missing values.
Removes the old style MISSOVER option (which was being overridden by the TRUNCOVER option that followed it anyway).
Uses LENGTH to define the variables instead of attaching unneeded informats. That way SAS does not have to guess that it how it should define the variables by assuming that the width of the informat that has been attached should match the length of the variable to be created.
Simplifies the INPUT statement since the variables have already been defined in the order they appear in the file.

data new;
  infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
  length   
    DOD_YR $4 
    DSTATE $2 
    FILENO $6 
    GNAME $50 
    MNAME $1 
    LNAME $50 
    SEX $1 
    SSN $9 
    AGETYPE $1 
    AGE $3 
    DOB_YR $4 
    DOB_MO $2 
    DOB_DY $2 
    BPLACE_CNT $2 
    BPLACE_ST $2 
    STATEC $2 
    MARITAL $1 
    DPLACE $1 
    DOD_MO $2 
    DOD_DY $2 
    DEDUC $1 
    OCCUP $40 
    OCCUPC $3 
    INDUST $40 
    INDUSTC $3 
    ACME_UC $5 
    AUTOP $1 
    CERTL $30 
    DINSTI $30 
    STNUM_R $10 
    STNAME_R $50 
    STDESIG_R $10 
    POSTDIR_R $10 
    UNITNUM_R $7 
    CITYTEXT_R $28 
    ZIP9_R $9 
    STATETEXT_R $28 
    ADDRESS_R $50 
    DETHNICE $3 
    NCHSBRIDGE $2 
    HISPOLDC $1 
    RACEOLDC $1 
    DMIDDLE $50 
    COD1A $120 
    INTERVAL1A $20 
    COD1B $120 
    INTERVAL1B $20 
    COD1C $120 
    INTERVAL1C $20 
    COD1D $120 
    INTERVAL1D $20 
    OTHERCONDITION $240 
    CERTFIRST $50 
    CERTMIDDLE $50 
    CERTLAST $50 
    MEC_YR $4 
    MEC_MO $2 
    MEC_DAY $2 
    MEC_NUMBER $8 
    AGE_CALC $3 
    MAR_WARD $6 
    MAR_CENSUS_TRACT $9 
  ;
  input DOD_YR -- MAR_CENSUS_TRACT ;
run;

NOTE: 3 records were read from the infile FIXED.
      The minimum record length was 313.
      The maximum record length was 381.
NOTE: The data set WORK.NEW has 3 observations and 62 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.03 seconds

Zula · Posted 08-26-2019 10:17 PM

So cool--it worked:)

Thank you very much.

Zula · Posted 08-26-2019 09:37 PM

Do all of the data lines take exactly two lines?

RE: No

Do all of the files have that structure?

RE: No, there are missing data (see attached a sample test data).

Do other files have a different number of lines of headers?

RE: I think I have tab and line breaks (it was transferred through ftp).

Does the number of lines of data per observation always match the number of header lines?

Re: I do not think so.

If the number of variables changes between files is it constant per file?

Re: Yes, I guess you are asking about if I have missing data?

Zula · Posted 08-26-2019 09:39 PM

When I read the infile with informat option, I still have the wrong data. All line breaks become a new observation.

Thank you.

Tom · Posted 08-26-2019 09:46 PM

Missing data should not matter if the TABs are always there between the values.

You mentioned 62 variables before. Do you just have one file that has 62 variables and some number of observations?

Are there always 61 tabs that appear between the the 62 values (no matter how many lines they take)?

Is it possible it just looks confusing to you because some of the character variables contain end-of-line characters inside them so it makes the values for that observation look like it takes more than one line? That does not seem likely since you posted a file where what looked like the header line with the column names also took two lines.

Is it possible what you have is the output of a process that was writing to a file with a maximum line length and so it wrapped some values to start on new lines? Like what would happen if you used the PUT statement in a SAS program.

Zula · Posted 08-26-2019 09:56 PM

Please see the text attached file (I have first rows with all the variables and sample 2-3 observations), there are not always 62 tabs because of missing data and random line breaks (new line).