BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daradanye
Obsidian | Level 7

Hi,

 

I am importing some big txt files as it is attached.  These text files are supposedly delimited by "|".  However, somehow, there are some line breaks in some lines (for example line 3 and line 4).  Since the dataset is super large, I cannot clean it from the original text files. I am trying to import it by infile. But it seems the output is not correct.

 

The code I've tried as follows:

 

 

data test;
infile "E:\Data\test.txt"
firstobs=2
delimiter="|"  missover dsd IGNOREDOSEOF;
length NAME $100. PO_BOX $100. STREET $100. CITY $100. STATE $100.
ZIP $50. TYPE_OR_CLASS $100.;

input PAYER_PAYEE_ID PAYER_PAYEE_TYPE RCPT_DISB_TYPE NAME 
PO_BOX STREET CITY STATE ZIP TYPE_OR_CLASS 
ITEMIZED NON_ITEMIZED TOTAL RPT_ID;

run;

I've also tried to add

TERMSTR=CRLF
TERMSTR=LF

But none of them seems working.

 

I would appreciate it very much if someone can help out here.  Thanks!

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

You can do it by manipulating the _INFILE_ automatic variable:

data test1;
length
  PAYER_PAYEE_ID   $ 9
  PAYER_PAYEE_TYPE $ 14
  RCPT_DISB_TYPE   $ 4
  NAME             $ 46
  PO_BOX           $ 50
  STREET           $ 25
  CITY             $ 16
  STATE            $ 6
  ZIP              $ 5
  TYPE_OR_CLASS    $ 19
  ITEMIZED           8
  NON_ITEMIZED       8
  TOTAL              8
  RPT_ID             8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
  inline = _infile_;
  input;
  input@;
  _infile_ = strip(inline) !! _infile_;
end;
input @1
  PAYER_PAYEE_ID
  PAYER_PAYEE_TYPE
  RCPT_DISB_TYPE
  NAME
  PO_BOX
  STREET
  CITY
  STATE
  ZIP
  TYPE_OR_CLASS
  ITEMIZED
  NON_ITEMIZED
  TOTAL
  RPT_ID
;
drop inline;
run;

View solution in original post

10 REPLIES 10
ChrisNZ
Tourmaline | Level 20

I can't look at the file's non-printables characters at the moment, however the most sensible solution seems to do something like this:

 

1. Read the data into the _INFILE_buffer.

input @;

2.1 Count the number of pipe characters in buffer _INFILE_   

NBPIPE=countc(_INFILE_,'|')

2.2 Look at the first characters in the buffer (you want 4 digits and a pipe)

if prxmatch('/{\d}4\|/',substr(_INFILE_1,5))=1

 

3. Depending on what you find in the 2.1 and 2.2 tests, use an input statement to read a full line, or an input statement to read the first half, or the second half.

 

[Edit: corrected typo in 2.2 code]

 

 

Tom
Super User Tom
Super User

Your subject does not match your problem description.

Your problem has been asked here before many times.

The basic idea is to copy the file to a new file that does not include those line breaks in the middle of line. The way to do that is to count how many actual delimiters have been seen.

 

Here is a link to one version. https://communities.sas.com/t5/SAS-Programming/Getting-Rid-of-Carriage-Returns-Embedded-in-a-Text-tx...

ChrisNZ
Tourmaline | Level 20

Tom, rewriting the input file was specifically excluded. 

Or at least that's what I make of this ambiguous sentence:

> Since the data set is super large, I cannot clean it from the original text files. 

 

Tom
Super User Tom
Super User

@ChrisNZ wrote:

Tom, rewriting the input file was specifically excluded. 

Or at least that's what I make of this ambiguous sentence:

> Since the data set is super large, I cannot clean it from the original text files. 

 


I suspect that comment meant that they could not clean it BY HAND in a TEXT EDITOR.

They have to make at least one copy of the file to as data in a dataset.  So making a temporary copy of the text file (which could be compressed using the ZIP filename engine) is probably not going to cause any extra trouble.

ChrisNZ
Tourmaline | Level 20

@Tom Fair enough. That's probably a more likely interpretation.

Patrick
Opal | Level 21

Looking into your data it appears that the data for some of the observations are distributed over two lines. 

If - and only IF - you can be 100% sure that the source data for an observation always has the same number of delimiters then you could use the FLOWOVER default as there SAS will just read from the next line of source data until it used up all the variables listed in the input statement.

 

Below code mainly generated using the EG import wizard with only a few amendments to INFILE statement options.

DATA WORK.test;
    LENGTH
        PAYER_PAYEE_ID   $ 9
        PAYER_PAYEE_TYPE $ 14
        RCPT_DISB_TYPE   $ 4
        NAME             $ 46
        PO_BOX           $ 50
        STREET           $ 25
        CITY             $ 16
        STATE            $ 6
        ZIP                8
        TYPE_OR_CLASS    $ 19
        ITEMIZED           8
        NON_ITEMIZED       8
        TOTAL              8
        RPT_ID             8 ;
    FORMAT
        PAYER_PAYEE_ID   $CHAR9.
        PAYER_PAYEE_TYPE $CHAR14.
        RCPT_DISB_TYPE   $CHAR4.
        NAME             $CHAR46.
        PO_BOX           $CHAR50.
        STREET           $CHAR25.
        CITY             $CHAR16.
        STATE            $CHAR6.
        ZIP              BEST6.
        TYPE_OR_CLASS    $CHAR19.
        ITEMIZED         BEST6.
        NON_ITEMIZED     BEST4.
        TOTAL            BEST6.
        RPT_ID           BEST6. ;
    INFORMAT
        PAYER_PAYEE_ID   $CHAR9.
        PAYER_PAYEE_TYPE $CHAR14.
        RCPT_DISB_TYPE   $CHAR4.
        NAME             $CHAR46.
        PO_BOX           $CHAR50.
        STREET           $CHAR25.
        CITY             $CHAR16.
        STATE            $CHAR6.
        ZIP              BEST6.
        TYPE_OR_CLASS    $CHAR19.
        ITEMIZED         BEST6.
        NON_ITEMIZED     BEST4.
        TOTAL            BEST6.
        RPT_ID           BEST6. ;
    INFILE '~/test/test.txt'
        LRECL=256
        firstobs=2
        DLM='|'
        flowover
        DSD 
        termstr=CRLF
        ;
    INPUT
        PAYER_PAYEE_ID   : $CHAR9.
        PAYER_PAYEE_TYPE : $CHAR14.
        RCPT_DISB_TYPE   : $CHAR4.
        NAME             : $CHAR46.
        PO_BOX           : $CHAR50.
        STREET           : $CHAR25.
        CITY             : $CHAR16.
        STATE            : $CHAR6.
        ZIP              : ?? BEST6.
        TYPE_OR_CLASS    : $CHAR19.
        ITEMIZED         : ?? BEST6.
        NON_ITEMIZED     : ?? BEST4.
        TOTAL            : ?? BEST6.
        RPT_ID           : ?? BEST6. ;
RUN;

Capture.JPG

Tom
Super User Tom
Super User

The FLOWOVER option will only work for line breaks that are inserted at the beginning or end of a field.  If the break is in the middle of a field then it will cause that field to look like two fields and SAS will start reading the following values into the wrong variables.

Kurt_Bremser
Super User

You can do it by manipulating the _INFILE_ automatic variable:

data test1;
length
  PAYER_PAYEE_ID   $ 9
  PAYER_PAYEE_TYPE $ 14
  RCPT_DISB_TYPE   $ 4
  NAME             $ 46
  PO_BOX           $ 50
  STREET           $ 25
  CITY             $ 16
  STATE            $ 6
  ZIP              $ 5
  TYPE_OR_CLASS    $ 19
  ITEMIZED           8
  NON_ITEMIZED       8
  TOTAL              8
  RPT_ID             8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
  inline = _infile_;
  input;
  input@;
  _infile_ = strip(inline) !! _infile_;
end;
input @1
  PAYER_PAYEE_ID
  PAYER_PAYEE_TYPE
  RCPT_DISB_TYPE
  NAME
  PO_BOX
  STREET
  CITY
  STATE
  ZIP
  TYPE_OR_CLASS
  ITEMIZED
  NON_ITEMIZED
  TOTAL
  RPT_ID
;
drop inline;
run;
Patrick
Opal | Level 21

@Kurt_Bremser 

If you look at the source data as attached...

Capture.JPG

...then I believe FLOWOVER is what's required.

 

If one can't rely that there is always a constant number of delimiters per output observation then things would become much more complicated and I guess the only way to read the data would be to somehow analyse the input buffer first to then decide which variables to map against it and when to write a new observation. 

Kurt_Bremser
Super User

Flowover causes a skip in columns, as the linefeed is also taken for a delimiter.

I ran this:

DATA WORK.test;
    LENGTH
        PAYER_PAYEE_ID   $ 9
        PAYER_PAYEE_TYPE $ 14
        RCPT_DISB_TYPE   $ 4
        NAME             $ 46
        PO_BOX           $ 50
        STREET           $ 25
        CITY             $ 16
        STATE            $ 6
        ZIP                8
        TYPE_OR_CLASS    $ 19
        ITEMIZED           8
        NON_ITEMIZED       8
        TOTAL              8
        RPT_ID             8 ;
    FORMAT
        PAYER_PAYEE_ID   $CHAR9.
        PAYER_PAYEE_TYPE $CHAR14.
        RCPT_DISB_TYPE   $CHAR4.
        NAME             $CHAR46.
        PO_BOX           $CHAR50.
        STREET           $CHAR25.
        CITY             $CHAR16.
        STATE            $CHAR6.
        ZIP              BEST6.
        TYPE_OR_CLASS    $CHAR19.
        ITEMIZED         BEST6.
        NON_ITEMIZED     BEST4.
        TOTAL            BEST6.
        RPT_ID           BEST6. ;
    INFORMAT
        PAYER_PAYEE_ID   $CHAR9.
        PAYER_PAYEE_TYPE $CHAR14.
        RCPT_DISB_TYPE   $CHAR4.
        NAME             $CHAR46.
        PO_BOX           $CHAR50.
        STREET           $CHAR25.
        CITY             $CHAR16.
        STATE            $CHAR6.
        ZIP              BEST6.
        TYPE_OR_CLASS    $CHAR19.
        ITEMIZED         BEST6.
        NON_ITEMIZED     BEST4.
        TOTAL            BEST6.
        RPT_ID           BEST6. ;
    INFILE '$HOME/sascommunity/test.txt'
        LRECL=256
        firstobs=2
        DLM='|'
        flowover
        DSD 
        termstr=CRLF
        ;
    INPUT
        PAYER_PAYEE_ID   : $CHAR9.
        PAYER_PAYEE_TYPE : $CHAR14.
        RCPT_DISB_TYPE   : $CHAR4.
        NAME             : $CHAR46.
        PO_BOX           : $CHAR50.
        STREET           : $CHAR25.
        CITY             : $CHAR16.
        STATE            : $CHAR6.
        ZIP              : ?? BEST6.
        TYPE_OR_CLASS    : $CHAR19.
        ITEMIZED         : ?? BEST6.
        NON_ITEMIZED     : ?? BEST4.
        TOTAL            : ?? BEST6.
        RPT_ID           : ?? BEST6. ;
RUN;

data test_k;
length
  PAYER_PAYEE_ID   $ 9
  PAYER_PAYEE_TYPE $ 14
  RCPT_DISB_TYPE   $ 4
  NAME             $ 46
  PO_BOX           $ 50
  STREET           $ 25
  CITY             $ 16
  STATE            $ 6
  ZIP              $ 5
  TYPE_OR_CLASS    $ 19
  ITEMIZED           8
  NON_ITEMIZED       8
  TOTAL              8
  RPT_ID             8
;
infile '$HOME/sascommunity/test.txt' dlm='|' dsd firstobs=2 termstr=CRLF;
length inline $1000;
input@;
if countc(_infile_,'|') lt 13
then do;
  inline = _infile_;
  input;
  input@;
  _infile_ = strip(inline) !! _infile_;
end;
input @1
  PAYER_PAYEE_ID
  PAYER_PAYEE_TYPE
  RCPT_DISB_TYPE
  NAME
  PO_BOX
  STREET
  CITY
  STATE
  ZIP
  TYPE_OR_CLASS
  ITEMIZED
  NON_ITEMIZED
  TOTAL
  RPT_ID
;
drop inline;
run;

proc print data=test noobs;
run;

proc print data=test_k noobs;
run;

and got this:

                                                                                               08:29 Thursday, February 13, 2020   1

                           RCPT_
PAYER_     PAYER_PAYEE_    DISB_
PAYEE_ID   TYPE            TYPE   NAME                                            PO_BOX

5369       1001            1201   NATIONAL ELEVATOR INDUSTRY EDUCATIONAL PROGRAM                                                    
5370       1001            1201   NATIONAL ELEVATOR INDUSTRY BENEFIT PLANS                                                          
5391       1002            501    MYERS WEINBERG, LLP                                                                               
5392       1002            501    DANIEL A. WINTERS & COMPANY                                                                       
5393       1002            501    O'DONOGHUE & O'DONOGHUE                                                                           
5410       1002            504    MICROSEARCH CORPORATION                                                                           
5411       1002            504    NETIFICE COMMUNICATIONS                                                                           
5417       1002            504    DANIEL A. WINTERS & COMPANY                                                                       

 
                                                                                                          NON_
STREET                       CITY                STATE        ZIP    TYPE_OR_CLASS          ITEMIZED    ITEMIZED     TOTAL    RPT_ID

11 LARSEN WAY                ATTLEBORO FALLS     MA          2763    APPRENTICE FUND           6274          0        6274    248665
19 CAMPUS BLVD.              SUITE 200           NEWTOW         .    19073                        .          0       36300     36300
240 GRAHAM AVE               WINNIPEG, CANADA                   .    LEGAL                    30350       1305       31655    248665
6 DICKINSON DRIVE            SUITE 205           CHADDS         .    19317                        .          0        8428      8428
4748 WISCONSIN AVE., N.W.    WASHINGTON          DC         20016    LEGAL                   381709       5600      387309    248665
999 BROADWAY                 SUITE 101           SAUGUS         .    01906                        .          0       49776     49776
DEPARTMENT LA 22231          PASADENA            CA         91185    T1 SERVICE PROVIDER          0       8652        8652    248665
6 DICKINSON DRIVE            SUITE 205           CHADDS         .    19317                        .       16E4           0    162368
                                                                                               08:29 Thursday, February 13, 2020   2

                    PAYER_    RCPT_
         PAYER_     PAYEE_    DISB_
        PAYEE_ID     TYPE     TYPE     NAME                                              PO_BOX    STREET

          5369       1001     1201     NATIONAL ELEVATOR INDUSTRY EDUCATIONAL PROGRAM              11 LARSEN WAY            
          5370       1001     1201     NATIONAL ELEVATOR INDUSTRY BENEFIT PLANS                    19 CAMPUS BLVD.SUITE 200 
          5391       1002     501      MYERS WEINBERG, LLP                                         240 GRAHAM AVE           
          5392       1002     501      DANIEL A. WINTERS & COMPANY                                 6 DICKINSON DRIVESUITE 20
          5393       1002     501      O'DONOGHUE & O'DONOGHUE                                     4748 WISCONSIN AVE., N.W.
          5410       1002     504      MICROSEARCH CORPORATION                                     999 BROADWAYSUITE 101    
          5411       1002     504      NETIFICE COMMUNICATIONS                                     DEPARTMENT LA 22231      
          5417       1002     504      DANIEL A. WINTERS & COMPANY                                 6 DICKINSON DRIVESUITE 20

 
                                                                                   NON_
        CITY                STATE     ZIP     TYPE_OR_CLASS          ITEMIZED    ITEMIZED     TOTAL    RPT_ID

        ATTLEBORO FALLS      MA      02763    APPRENTICE FUND           6274           0       6274    248665
        NEWTOWN SQUARE       PA      19073    PENSION & HEALTH BE          0       36300      36300    248665
        WINNIPEG, CANADA                      LEGAL                    30350        1305      31655    248665
        CHADDS FORD          PA      19317    AUDITORS                     0        8428       8428    248665
        WASHINGTON           DC      20016    LEGAL                   381709        5600     387309    248665
        SAUGUS               MA      01906    COMPUTER CONSULTING          0       49776      49776    248665
        PASADENA             CA      91185    T1 SERVICE PROVIDER          0        8652       8652    248665
        CHADDS FORD          PA      19317    AUDITORS                162368           0     162368    248665

One just has to look at the last column to see the effect.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1274 views
  • 6 likes
  • 5 in conversation