BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Emma8
Quartz | Level 8

I would like to import text (attached example data). Id and text are 2 variables. There are spaces before id, so I could not import successfully into SAS (the actual data have about 5000 rows). Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You really need to find the process that is writing this file and fix that.

Here is method that appears to work for your example file.

Basically the first step removes the header line and all line breaks from the original file and inserts line breaks after the pipe characters.

Then the second step reads the lines and pulls the ID value from the end of the line.  It uses the LAG() function to move the id from the end of the previous line to the text on the current line.

filename copy temp;
data _null_;
  infile "c:\downloads\text_w2530_15.txt" recfm=f lrecl=1 end=eof;
  file copy ;
  * Get rid of "header" line ;
  do until(x='0A'x); input x $char1.; end;
  do while(not eof);
    do until(x='|'); input x $char1.; if x not in ('0A'x,'0D'x) then put x $char1. @; end;
    put;
  end;
run;

data want;
  length nextid id $30 text $1500 ;
  infile copy ;
  input ;
  text = _infile_;
  if substr(text,length(text))='|' then do;
    nextid = scan(scan(text,-1,' '),1,'|');
    text = substr(text,1,length(text)-lengthn(nextid)-2);
  end;
  id = lag(nextid);
  if _n_ > 1 then output;
  drop nextid ;
run;

You can then easily write it back out as a delimited file than can actually be read.

data _null_;
  file 'c:\downloads\fixed.txt' dsd dlm='|';
  set want;
  put id text;
run;

So you appear to have 17 "lines" in that original file.  Just so you can see it here are the lines truncated to just 50 bytes.

337   data _null_;
338     infile  'c:\downloads\fixed.txt' truncover;
339     input line $char50.;
340     put line $char50.;
341   run;

NOTE: The infile 'c:\downloads\fixed.txt' is:
      Filename=c:\downloads\fixed.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=5562,
      Last Modified=11Nov2020:17:20:57,
      Create Time=11Nov2020:17:20:57

BC51140-00|9/4/15, BILATERAL MAMMOGRAPHY AND LEFT
BC151142-00|
BC251175-00|9/1/2015 (WHC) CT ANGIO (EVAL SOB)  NO
BC351181-00|9/16/2015 (WHC) DX MAMMO/US: 7CM LOBUL
BC751182-00|7/22/15, MRI: MASS WITHIN SEGMENT VII
BC7851184-00|
BCG51185-00|08/05/15 PERIPHERAL BLOOD SMEAR: CHRON
BCT51186-00|SKELETAL SURVEY 8/13/15: DIFFUSE OSTEO
BCTY51187-00|05/27/2015-UNK LOCATION - FLEXIBLE SI
BC0951188-00|- 09/14/2015-WHC - MRI RT UPPER EXTRM
BC56651190-00|
BC56751192-00|CT A/P 9/17/15 -  ENDOMETRIAL MASS W
BC56hu51196-00|09/04/2015-WHC-CT ABD/PELVIS-2.5CM
BC0955237-02|NONE
BCV55391-00|03/23/2015 (SMH) CT ABDOMEN/PELVIS: PA
BCtgy55402-00|02/12/2015: LEFT BREAST US- 1.5 CM I
BCdf55403-00|IMPRESSION 1. RESPONSE TO TREATMENT W
NOTE: 17 records were read from the infile 'c:\downloads\fixed.txt'.
      The minimum record length was 12.
      The maximum record length was 915.

View solution in original post

7 REPLIES 7
Tom
Super User Tom
Super User

How did you try to READ the data (you don't need to "IMPORT" a text file)?

The file you posted as 4 lines of data and one header line.

NOTE: 5 records were read from the infile "c:\downloads\text_w2530_15.txt".
      The minimum record length was 7.
      The maximum record length was 14508.

That second field is very long for a SAS variable. The maximum length of a character variable in SAS is 32K bytes.

 

None of the lines start with spaces.  None of the values of TEXT start with spaces (other than the one that is totally empty).  here are the first 50 characters of each line:

47    data _null_;
48      infile "c:\downloads\text_w2530_15.txt" lrecl=50 ;
49      input;
50      list;
51    run;

NOTE: A byte-order mark in the file "c:\downloads\text_w2530_15.txt" (for fileref "#LN00064") indicates
      that the data is encoded in "utf-8".  This encoding will be used to process the file.
NOTE: The infile "c:\downloads\text_w2530_15.txt" is:
      Filename=c:\downloads\text_w2530_15.txt,
      RECFM=V,LRECL=200,File Size (bytes)=17564,
      Last Modified=11Nov2020:16:00:30,
      Create Time=11Nov2020:16:00:29

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+--
1         ID|Text 7
2         BC51140-00|9/4/15, BILATERAL MAMMOGRAPHY AND LEFT
3         BC151142-00|
4         BC251175-00|9/1/2015 (WHC) CT ANGIO (EVAL SOB)  NO
5         BC351181-00|9/16/2015 (WHC) DX MAMMO/US: 7CM LOBUL
NOTE: 5 records were read from the infile "c:\downloads\text_w2530_15.txt".
      The minimum record length was 7.
      The maximum record length was 50.
      One or more lines were truncated.
Emma8
Quartz | Level 8
I could not by read or import.
There are about 17 observations:
first id=BC51140-00
and the last id=BCdf55403-00

If you scroll right, then you would see how the ID-s have spaced to the right.
Tom
Super User Tom
Super User

Where did you get the file?  Have you attempted to edit it in any way?  If so do you still have the original?

How was it created?

Emma8
Quartz | Level 8
No, those spaced id-s are so many, that no manual cleaning is possible. See
id=BC351181-00 then the next id does not start from a new line, etc. all sequential id-s are not started from a line.
Tom
Super User Tom
Super User

You really need to find the process that is writing this file and fix that.

Here is method that appears to work for your example file.

Basically the first step removes the header line and all line breaks from the original file and inserts line breaks after the pipe characters.

Then the second step reads the lines and pulls the ID value from the end of the line.  It uses the LAG() function to move the id from the end of the previous line to the text on the current line.

filename copy temp;
data _null_;
  infile "c:\downloads\text_w2530_15.txt" recfm=f lrecl=1 end=eof;
  file copy ;
  * Get rid of "header" line ;
  do until(x='0A'x); input x $char1.; end;
  do while(not eof);
    do until(x='|'); input x $char1.; if x not in ('0A'x,'0D'x) then put x $char1. @; end;
    put;
  end;
run;

data want;
  length nextid id $30 text $1500 ;
  infile copy ;
  input ;
  text = _infile_;
  if substr(text,length(text))='|' then do;
    nextid = scan(scan(text,-1,' '),1,'|');
    text = substr(text,1,length(text)-lengthn(nextid)-2);
  end;
  id = lag(nextid);
  if _n_ > 1 then output;
  drop nextid ;
run;

You can then easily write it back out as a delimited file than can actually be read.

data _null_;
  file 'c:\downloads\fixed.txt' dsd dlm='|';
  set want;
  put id text;
run;

So you appear to have 17 "lines" in that original file.  Just so you can see it here are the lines truncated to just 50 bytes.

337   data _null_;
338     infile  'c:\downloads\fixed.txt' truncover;
339     input line $char50.;
340     put line $char50.;
341   run;

NOTE: The infile 'c:\downloads\fixed.txt' is:
      Filename=c:\downloads\fixed.txt,
      RECFM=V,LRECL=32767,File Size (bytes)=5562,
      Last Modified=11Nov2020:17:20:57,
      Create Time=11Nov2020:17:20:57

BC51140-00|9/4/15, BILATERAL MAMMOGRAPHY AND LEFT
BC151142-00|
BC251175-00|9/1/2015 (WHC) CT ANGIO (EVAL SOB)  NO
BC351181-00|9/16/2015 (WHC) DX MAMMO/US: 7CM LOBUL
BC751182-00|7/22/15, MRI: MASS WITHIN SEGMENT VII
BC7851184-00|
BCG51185-00|08/05/15 PERIPHERAL BLOOD SMEAR: CHRON
BCT51186-00|SKELETAL SURVEY 8/13/15: DIFFUSE OSTEO
BCTY51187-00|05/27/2015-UNK LOCATION - FLEXIBLE SI
BC0951188-00|- 09/14/2015-WHC - MRI RT UPPER EXTRM
BC56651190-00|
BC56751192-00|CT A/P 9/17/15 -  ENDOMETRIAL MASS W
BC56hu51196-00|09/04/2015-WHC-CT ABD/PELVIS-2.5CM
BC0955237-02|NONE
BCV55391-00|03/23/2015 (SMH) CT ABDOMEN/PELVIS: PA
BCtgy55402-00|02/12/2015: LEFT BREAST US- 1.5 CM I
BCdf55403-00|IMPRESSION 1. RESPONSE TO TREATMENT W
NOTE: 17 records were read from the infile 'c:\downloads\fixed.txt'.
      The minimum record length was 12.
      The maximum record length was 915.
Emma8
Quartz | Level 8
Thank you so much!
Tom
Super User Tom
Super User

Something is messed up with the last line in that file.  Why is it so much longer than the others? Why does it have multiple pipe characters?

data check;
  infile "c:\downloads\text_w2530_15.txt" length=ll;
  input;
  line+1;
  line_length=ll;
  length=lengthn(_infile_);
  pipes=countc(_infile_,'|');
run;

proc print;
run;
                line_
Obs    line    length    length    pipes

 1       1          7         7       1
 2       2       1012       638       1
 3       3       1013        12       1
 4       4       1013       514       1
 5       5      14508     14024      14

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 904 views
  • 1 like
  • 2 in conversation