Hi,
I am importing some big txt files as it is attached. These text files are supposedly delimited by "|". However, somehow, there are some line breaks in some lines (for example line 3 and line 4). Since the dataset is super large, I cannot clean it from the original text files. I am trying to import it by infile. But it seems the output is not correct.
The code I've tried as follows:
data test;
infile "E:\Data\test.txt"
firstobs=2
delimiter="|" missover dsd IGNOREDOSEOF;
length NAME $100. PO_BOX $100. STREET $100. CITY $100. STATE $100.
ZIP $50. TYPE_OR_CLASS $100.;
input PAYER_PAYEE_ID PAYER_PAYEE_TYPE RCPT_DISB_TYPE NAME
PO_BOX STREET CITY STATE ZIP TYPE_OR_CLASS
ITEMIZED NON_ITEMIZED TOTAL RPT_ID;
run;
I've also tried to add
TERMSTR=CRLF
TERMSTR=LF
But none of them seems working.
I would appreciate it very much if someone can help out here. Thanks!
You can do it by manipulating the _INFILE_ automatic variable:
data test1;
length
PAYER_PAYEE_ID $ 9
PAYER_PAYEE_TYPE $ 14
RCPT_DISB_TYPE $ 4
NAME $ 46
PO_BOX $ 50
STREET $ 25
CITY $ 16
STATE $ 6
ZIP $ 5
TYPE_OR_CLASS $ 19
ITEMIZED 8
NON_ITEMIZED 8
TOTAL 8
RPT_ID 8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
inline = _infile_;
input;
input@;
_infile_ = strip(inline) !! _infile_;
end;
input @1
PAYER_PAYEE_ID
PAYER_PAYEE_TYPE
RCPT_DISB_TYPE
NAME
PO_BOX
STREET
CITY
STATE
ZIP
TYPE_OR_CLASS
ITEMIZED
NON_ITEMIZED
TOTAL
RPT_ID
;
drop inline;
run;
I can't look at the file's non-printables characters at the moment, however the most sensible solution seems to do something like this:
1. Read the data into the _INFILE_buffer.
input @;
2.1 Count the number of pipe characters in buffer _INFILE_
NBPIPE=countc(_INFILE_,'|')
2.2 Look at the first characters in the buffer (you want 4 digits and a pipe)
if prxmatch('/{\d}4\|/',substr(_INFILE_1,5))=1
3. Depending on what you find in the 2.1 and 2.2 tests, use an input statement to read a full line, or an input statement to read the first half, or the second half.
[Edit: corrected typo in 2.2 code]
Your subject does not match your problem description.
Your problem has been asked here before many times.
The basic idea is to copy the file to a new file that does not include those line breaks in the middle of line. The way to do that is to count how many actual delimiters have been seen.
Here is a link to one version. https://communities.sas.com/t5/SAS-Programming/Getting-Rid-of-Carriage-Returns-Embedded-in-a-Text-tx...
Tom, rewriting the input file was specifically excluded.
Or at least that's what I make of this ambiguous sentence:
> Since the data set is super large, I cannot clean it from the original text files.
@ChrisNZ wrote:
Tom, rewriting the input file was specifically excluded.
Or at least that's what I make of this ambiguous sentence:
> Since the data set is super large, I cannot clean it from the original text files.
I suspect that comment meant that they could not clean it BY HAND in a TEXT EDITOR.
They have to make at least one copy of the file to as data in a dataset. So making a temporary copy of the text file (which could be compressed using the ZIP filename engine) is probably not going to cause any extra trouble.
@Tom Fair enough. That's probably a more likely interpretation.
Looking into your data it appears that the data for some of the observations are distributed over two lines.
If - and only IF - you can be 100% sure that the source data for an observation always has the same number of delimiters then you could use the FLOWOVER default as there SAS will just read from the next line of source data until it used up all the variables listed in the input statement.
Below code mainly generated using the EG import wizard with only a few amendments to INFILE statement options.
DATA WORK.test;
LENGTH
PAYER_PAYEE_ID $ 9
PAYER_PAYEE_TYPE $ 14
RCPT_DISB_TYPE $ 4
NAME $ 46
PO_BOX $ 50
STREET $ 25
CITY $ 16
STATE $ 6
ZIP 8
TYPE_OR_CLASS $ 19
ITEMIZED 8
NON_ITEMIZED 8
TOTAL 8
RPT_ID 8 ;
FORMAT
PAYER_PAYEE_ID $CHAR9.
PAYER_PAYEE_TYPE $CHAR14.
RCPT_DISB_TYPE $CHAR4.
NAME $CHAR46.
PO_BOX $CHAR50.
STREET $CHAR25.
CITY $CHAR16.
STATE $CHAR6.
ZIP BEST6.
TYPE_OR_CLASS $CHAR19.
ITEMIZED BEST6.
NON_ITEMIZED BEST4.
TOTAL BEST6.
RPT_ID BEST6. ;
INFORMAT
PAYER_PAYEE_ID $CHAR9.
PAYER_PAYEE_TYPE $CHAR14.
RCPT_DISB_TYPE $CHAR4.
NAME $CHAR46.
PO_BOX $CHAR50.
STREET $CHAR25.
CITY $CHAR16.
STATE $CHAR6.
ZIP BEST6.
TYPE_OR_CLASS $CHAR19.
ITEMIZED BEST6.
NON_ITEMIZED BEST4.
TOTAL BEST6.
RPT_ID BEST6. ;
INFILE '~/test/test.txt'
LRECL=256
firstobs=2
DLM='|'
flowover
DSD
termstr=CRLF
;
INPUT
PAYER_PAYEE_ID : $CHAR9.
PAYER_PAYEE_TYPE : $CHAR14.
RCPT_DISB_TYPE : $CHAR4.
NAME : $CHAR46.
PO_BOX : $CHAR50.
STREET : $CHAR25.
CITY : $CHAR16.
STATE : $CHAR6.
ZIP : ?? BEST6.
TYPE_OR_CLASS : $CHAR19.
ITEMIZED : ?? BEST6.
NON_ITEMIZED : ?? BEST4.
TOTAL : ?? BEST6.
RPT_ID : ?? BEST6. ;
RUN;
The FLOWOVER option will only work for line breaks that are inserted at the beginning or end of a field. If the break is in the middle of a field then it will cause that field to look like two fields and SAS will start reading the following values into the wrong variables.
You can do it by manipulating the _INFILE_ automatic variable:
data test1;
length
PAYER_PAYEE_ID $ 9
PAYER_PAYEE_TYPE $ 14
RCPT_DISB_TYPE $ 4
NAME $ 46
PO_BOX $ 50
STREET $ 25
CITY $ 16
STATE $ 6
ZIP $ 5
TYPE_OR_CLASS $ 19
ITEMIZED 8
NON_ITEMIZED 8
TOTAL 8
RPT_ID 8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
inline = _infile_;
input;
input@;
_infile_ = strip(inline) !! _infile_;
end;
input @1
PAYER_PAYEE_ID
PAYER_PAYEE_TYPE
RCPT_DISB_TYPE
NAME
PO_BOX
STREET
CITY
STATE
ZIP
TYPE_OR_CLASS
ITEMIZED
NON_ITEMIZED
TOTAL
RPT_ID
;
drop inline;
run;
If you look at the source data as attached...
...then I believe FLOWOVER is what's required.
If one can't rely that there is always a constant number of delimiters per output observation then things would become much more complicated and I guess the only way to read the data would be to somehow analyse the input buffer first to then decide which variables to map against it and when to write a new observation.
Flowover causes a skip in columns, as the linefeed is also taken for a delimiter.
I ran this:
DATA WORK.test;
LENGTH
PAYER_PAYEE_ID $ 9
PAYER_PAYEE_TYPE $ 14
RCPT_DISB_TYPE $ 4
NAME $ 46
PO_BOX $ 50
STREET $ 25
CITY $ 16
STATE $ 6
ZIP 8
TYPE_OR_CLASS $ 19
ITEMIZED 8
NON_ITEMIZED 8
TOTAL 8
RPT_ID 8 ;
FORMAT
PAYER_PAYEE_ID $CHAR9.
PAYER_PAYEE_TYPE $CHAR14.
RCPT_DISB_TYPE $CHAR4.
NAME $CHAR46.
PO_BOX $CHAR50.
STREET $CHAR25.
CITY $CHAR16.
STATE $CHAR6.
ZIP BEST6.
TYPE_OR_CLASS $CHAR19.
ITEMIZED BEST6.
NON_ITEMIZED BEST4.
TOTAL BEST6.
RPT_ID BEST6. ;
INFORMAT
PAYER_PAYEE_ID $CHAR9.
PAYER_PAYEE_TYPE $CHAR14.
RCPT_DISB_TYPE $CHAR4.
NAME $CHAR46.
PO_BOX $CHAR50.
STREET $CHAR25.
CITY $CHAR16.
STATE $CHAR6.
ZIP BEST6.
TYPE_OR_CLASS $CHAR19.
ITEMIZED BEST6.
NON_ITEMIZED BEST4.
TOTAL BEST6.
RPT_ID BEST6. ;
INFILE '$HOME/sascommunity/test.txt'
LRECL=256
firstobs=2
DLM='|'
flowover
DSD
termstr=CRLF
;
INPUT
PAYER_PAYEE_ID : $CHAR9.
PAYER_PAYEE_TYPE : $CHAR14.
RCPT_DISB_TYPE : $CHAR4.
NAME : $CHAR46.
PO_BOX : $CHAR50.
STREET : $CHAR25.
CITY : $CHAR16.
STATE : $CHAR6.
ZIP : ?? BEST6.
TYPE_OR_CLASS : $CHAR19.
ITEMIZED : ?? BEST6.
NON_ITEMIZED : ?? BEST4.
TOTAL : ?? BEST6.
RPT_ID : ?? BEST6. ;
RUN;
data test_k;
length
PAYER_PAYEE_ID $ 9
PAYER_PAYEE_TYPE $ 14
RCPT_DISB_TYPE $ 4
NAME $ 46
PO_BOX $ 50
STREET $ 25
CITY $ 16
STATE $ 6
ZIP $ 5
TYPE_OR_CLASS $ 19
ITEMIZED 8
NON_ITEMIZED 8
TOTAL 8
RPT_ID 8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2 termstr=CRLF;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
inline = _infile_;
input;
input@;
_infile_ = strip(inline) !! _infile_;
end;
input @1
PAYER_PAYEE_ID
PAYER_PAYEE_TYPE
RCPT_DISB_TYPE
NAME
PO_BOX
STREET
CITY
STATE
ZIP
TYPE_OR_CLASS
ITEMIZED
NON_ITEMIZED
TOTAL
RPT_ID
;
drop inline;
run;
proc print data=test noobs;
run;
proc print data=test_k noobs;
run;
and got this:
08:29 Thursday, February 13, 2020 1 RCPT_ PAYER_ PAYER_PAYEE_ DISB_ PAYEE_ID TYPE TYPE NAME PO_BOX 5369 1001 1201 NATIONAL ELEVATOR INDUSTRY EDUCATIONAL PROGRAM 5370 1001 1201 NATIONAL ELEVATOR INDUSTRY BENEFIT PLANS 5391 1002 501 MYERS WEINBERG, LLP 5392 1002 501 DANIEL A. WINTERS & COMPANY 5393 1002 501 O'DONOGHUE & O'DONOGHUE 5410 1002 504 MICROSEARCH CORPORATION 5411 1002 504 NETIFICE COMMUNICATIONS 5417 1002 504 DANIEL A. WINTERS & COMPANY NON_ STREET CITY STATE ZIP TYPE_OR_CLASS ITEMIZED ITEMIZED TOTAL RPT_ID 11 LARSEN WAY ATTLEBORO FALLS MA 2763 APPRENTICE FUND 6274 0 6274 248665 19 CAMPUS BLVD. SUITE 200 NEWTOW . 19073 . 0 36300 36300 240 GRAHAM AVE WINNIPEG, CANADA . LEGAL 30350 1305 31655 248665 6 DICKINSON DRIVE SUITE 205 CHADDS . 19317 . 0 8428 8428 4748 WISCONSIN AVE., N.W. WASHINGTON DC 20016 LEGAL 381709 5600 387309 248665 999 BROADWAY SUITE 101 SAUGUS . 01906 . 0 49776 49776 DEPARTMENT LA 22231 PASADENA CA 91185 T1 SERVICE PROVIDER 0 8652 8652 248665 6 DICKINSON DRIVE SUITE 205 CHADDS . 19317 . 16E4 0 162368 08:29 Thursday, February 13, 2020 2 PAYER_ RCPT_ PAYER_ PAYEE_ DISB_ PAYEE_ID TYPE TYPE NAME PO_BOX STREET 5369 1001 1201 NATIONAL ELEVATOR INDUSTRY EDUCATIONAL PROGRAM 11 LARSEN WAY 5370 1001 1201 NATIONAL ELEVATOR INDUSTRY BENEFIT PLANS 19 CAMPUS BLVD.SUITE 200 5391 1002 501 MYERS WEINBERG, LLP 240 GRAHAM AVE 5392 1002 501 DANIEL A. WINTERS & COMPANY 6 DICKINSON DRIVESUITE 20 5393 1002 501 O'DONOGHUE & O'DONOGHUE 4748 WISCONSIN AVE., N.W. 5410 1002 504 MICROSEARCH CORPORATION 999 BROADWAYSUITE 101 5411 1002 504 NETIFICE COMMUNICATIONS DEPARTMENT LA 22231 5417 1002 504 DANIEL A. WINTERS & COMPANY 6 DICKINSON DRIVESUITE 20 NON_ CITY STATE ZIP TYPE_OR_CLASS ITEMIZED ITEMIZED TOTAL RPT_ID ATTLEBORO FALLS MA 02763 APPRENTICE FUND 6274 0 6274 248665 NEWTOWN SQUARE PA 19073 PENSION & HEALTH BE 0 36300 36300 248665 WINNIPEG, CANADA LEGAL 30350 1305 31655 248665 CHADDS FORD PA 19317 AUDITORS 0 8428 8428 248665 WASHINGTON DC 20016 LEGAL 381709 5600 387309 248665 SAUGUS MA 01906 COMPUTER CONSULTING 0 49776 49776 248665 PASADENA CA 91185 T1 SERVICE PROVIDER 0 8652 8652 248665 CHADDS FORD PA 19317 AUDITORS 162368 0 162368 248665
One just has to look at the last column to see the effect.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.