Hi.
I wanted to import text file into SAS, but this file has 62 variables and tab and line spaced (see an example data with only a few variables). How can I read this file correctly into SAS if I do not know the variables' format and length?
Thank you.
Looks like some process has inserted random line breaks at around 256 characters. Probably some type of misguided download process or strange file specification on the program that made the file.
But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.
FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;
data _null_;
infile rawinput termstr=crlf end=eof;
file fixed termstr=crlf ;
do until(eof or ntabs > 60);
input ;
put _infile_ @ ;
ntabs = sum(ntabs,countc(_infile_,'09'x));
end;
put;
putlog _n_= ntabs=;
run;
Now you can read the new file using normal code. For example here is your program fixed up.
data new;
infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
length
DOD_YR $4
DSTATE $2
FILENO $6
GNAME $50
MNAME $1
LNAME $50
SEX $1
SSN $9
AGETYPE $1
AGE $3
DOB_YR $4
DOB_MO $2
DOB_DY $2
BPLACE_CNT $2
BPLACE_ST $2
STATEC $2
MARITAL $1
DPLACE $1
DOD_MO $2
DOD_DY $2
DEDUC $1
OCCUP $40
OCCUPC $3
INDUST $40
INDUSTC $3
ACME_UC $5
AUTOP $1
CERTL $30
DINSTI $30
STNUM_R $10
STNAME_R $50
STDESIG_R $10
POSTDIR_R $10
UNITNUM_R $7
CITYTEXT_R $28
ZIP9_R $9
STATETEXT_R $28
ADDRESS_R $50
DETHNICE $3
NCHSBRIDGE $2
HISPOLDC $1
RACEOLDC $1
DMIDDLE $50
COD1A $120
INTERVAL1A $20
COD1B $120
INTERVAL1B $20
COD1C $120
INTERVAL1C $20
COD1D $120
INTERVAL1D $20
OTHERCONDITION $240
CERTFIRST $50
CERTMIDDLE $50
CERTLAST $50
MEC_YR $4
MEC_MO $2
MEC_DAY $2
MEC_NUMBER $8
AGE_CALC $3
MAR_WARD $6
MAR_CENSUS_TRACT $9
;
input DOD_YR -- MAR_CENSUS_TRACT ;
run;
NOTE: 3 records were read from the infile FIXED. The minimum record length was 313. The maximum record length was 381. NOTE: The data set WORK.NEW has 3 observations and 62 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.03 seconds
Nope, very random. Some variables with a long string and short depending on the specific ID.
@tomrvincent wrote:
So no way to know when one record ends and a new one begins.
I'd bounce the file and tell the supplier to provide it in a standard format and structure.
Agree.
And provide details of how you, as a recipient are supposed to know; 1) where a record starts, 2) the contents and 3) expected formats and lengths values.
We probably need more examples to see if there is a pattern.
The example you showed seems to have two header lines and two data lines.
Do all of the data lines take exactly two lines?
Do all of the files have that structure?
Do other files have a different number of lines of headers?
Does the number of lines of data per observation always match the number of header lines?
If the number of variables changes between files is it constant per file?
FILENAME rawinput "\\ZULA\test.txt" ;
data new;
retain DOD_YR;
infile rawinput dlm='09'X firstobs=2 missover truncover ;
informat DOD_YR $4.;
informat DSTATE $2.;
informat FILENO $6.;
informat GNAME $50.;
informat MNAME $1.;
informat LNAME $50.;
informat SEX $1.;
informat SSN $9.;
informat AGETYPE $1.;
informat AGE $3.;
informat DOB_YR $4.;
informat DOB_MO $2.;
informat DOB_DY $2.;
informat BPLACE_CNT $2.;
informat BPLACE_ST $2.;
informat STATEC $2.;
informat MARITAL $1.;
informat DPLACE $1.;
informat DOD_MO $2.;
informat DOD_DY $2.;
informat DEDUC $1.;
informat OCCUP $40.;
informat OCCUPC $3.;
informat INDUST $40.;
informat INDUSTC $3.;
informat ACME_UC $5.;
informat AUTOP $1.;
informat CERTL $30.;
informat DINSTI $30.;
informat STNUM_R $10.;
informat STNAME_R $50.;
informat STDESIG_R $10.;
informat POSTDIR_R $10.;
informat UNITNUM_R $7.;
informat CITYTEXT_R $28.;
informat ZIP9_R $9.;
informat STATETEXT_R $28.;
informat ADDRESS_R $50.;
informat DETHNICE $3.;
informat NCHSBRIDGE $2.;
informat HISPOLDC $1.;
informat RACEOLDC $1.;
informat DMIDDLE $50.;
informat COD1A $120.;
informat INTERVAL1A $20.;
informat COD1B $120.;
informat INTERVAL1B $20.;
informat COD1C $120.;
informat INTERVAL1C $20.;
informat COD1D $120.;
informat INTERVAL1D $20.;
informat OTHERCONDITION $240.;
informat CERTFIRST $50.;
informat CERTMIDDLE $50.;
informat CERTLAST $50.;
informat MEC_YR $4.;
informat MEC_MO $2.;
informat MEC_DAY $2.;
informat MEC_NUMBER $8.;
informat AGE_CALC $3.;
informat MAR_WARD $6.;
informat MAR_CENSUS_TRACT $9.;
;
input
DOD_YR $ DSTATE $ FILENO $
GNAME $ MNAME $ LNAME $
SEX $ SSN $ AGETYPE $ AGE $
DOB_YR $ DOB_MO $ DOB_DY $
BPLACE_CNT $ BPLACE_ST $
STATEC $ MARITAL $
DPLACE $ DOD_MO $ DOD_DY $
DEDUC $ OCCUP $ OCCUPC $ INDUST $
INDUSTC $
ACME_UC $
AUTOP $
CERTL $
DINSTI $
STNUM_R $
STNAME_R $
STDESIG_R $
POSTDIR_R $
UNITNUM_R $
CITYTEXT_R $
ZIP9_R $
STATETEXT_R $
ADDRESS_R $
DETHNICE $
NCHSBRIDGE $
HISPOLDC $
RACEOLDC $
DMIDDLE $
COD1A $
INTERVAL1A $
COD1B $
INTERVAL1B $
COD1C $
INTERVAL1C $
COD1D $
INTERVAL1D $
OTHERCONDITION $
CERTFIRST $
CERTMIDDLE $
CERTLAST $
MEC_YR $
MEC_MO $
MEC_DAY $
MEC_NUMBER $
AGE_CALC $
MAR_WARD $
MAR_CENSUS_TRACT $ ;
run;
Looks like some process has inserted random line breaks at around 256 characters. Probably some type of misguided download process or strange file specification on the program that made the file.
But it looks like you can FIX the file to have one line per observation by just counting the number of tabs that you have read so far.
FILENAME rawinput "c:\downloads\test.txt" ;
filename fixed temp;
data _null_;
infile rawinput termstr=crlf end=eof;
file fixed termstr=crlf ;
do until(eof or ntabs > 60);
input ;
put _infile_ @ ;
ntabs = sum(ntabs,countc(_infile_,'09'x));
end;
put;
putlog _n_= ntabs=;
run;
Now you can read the new file using normal code. For example here is your program fixed up.
data new;
infile fixed dsd dlm='09'x firstobs=2 termstr=crlf truncover ;
length
DOD_YR $4
DSTATE $2
FILENO $6
GNAME $50
MNAME $1
LNAME $50
SEX $1
SSN $9
AGETYPE $1
AGE $3
DOB_YR $4
DOB_MO $2
DOB_DY $2
BPLACE_CNT $2
BPLACE_ST $2
STATEC $2
MARITAL $1
DPLACE $1
DOD_MO $2
DOD_DY $2
DEDUC $1
OCCUP $40
OCCUPC $3
INDUST $40
INDUSTC $3
ACME_UC $5
AUTOP $1
CERTL $30
DINSTI $30
STNUM_R $10
STNAME_R $50
STDESIG_R $10
POSTDIR_R $10
UNITNUM_R $7
CITYTEXT_R $28
ZIP9_R $9
STATETEXT_R $28
ADDRESS_R $50
DETHNICE $3
NCHSBRIDGE $2
HISPOLDC $1
RACEOLDC $1
DMIDDLE $50
COD1A $120
INTERVAL1A $20
COD1B $120
INTERVAL1B $20
COD1C $120
INTERVAL1C $20
COD1D $120
INTERVAL1D $20
OTHERCONDITION $240
CERTFIRST $50
CERTMIDDLE $50
CERTLAST $50
MEC_YR $4
MEC_MO $2
MEC_DAY $2
MEC_NUMBER $8
AGE_CALC $3
MAR_WARD $6
MAR_CENSUS_TRACT $9
;
input DOD_YR -- MAR_CENSUS_TRACT ;
run;
NOTE: 3 records were read from the infile FIXED. The minimum record length was 313. The maximum record length was 381. NOTE: The data set WORK.NEW has 3 observations and 62 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.03 seconds
So cool--it worked:)
Thank you very much.
Do all of the data lines take exactly two lines?
RE: No
Do all of the files have that structure?
RE: No, there are missing data (see attached a sample test data).
Do other files have a different number of lines of headers?
RE: I think I have tab and line breaks (it was transferred through ftp).
Does the number of lines of data per observation always match the number of header lines?
Re: I do not think so.
If the number of variables changes between files is it constant per file?
Re: Yes, I guess you are asking about if I have missing data?
When I read the infile with informat option, I still have the wrong data. All line breaks become a new observation.
Thank you.
Missing data should not matter if the TABs are always there between the values.
You mentioned 62 variables before. Do you just have one file that has 62 variables and some number of observations?
Are there always 61 tabs that appear between the the 62 values (no matter how many lines they take)?
Is it possible it just looks confusing to you because some of the character variables contain end-of-line characters inside them so it makes the values for that observation look like it takes more than one line? That does not seem likely since you posted a file where what looked like the header line with the column names also took two lines.
Is it possible what you have is the output of a process that was writing to a file with a maximum line length and so it wrapped some values to start on new lines? Like what would happen if you used the PUT statement in a SAS program.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.