Re: correctly reading free text containing paragraphs and delimiter wi...

shc0 · Posted 11-19-2021 05:40 PM

I have a "|" delimited text dataset. One of the fields is free text from doctor notes, which causes issues when trying to read the data into SAS. There are no quotation marks to mark the start and end of the text field. The field may contain either the delimiter itself (creating a false field) or paragraphs (creating false rows). Here is an example:

ID|VAR1|VAR2|VAR3|VAR4

1|1|0|abdebeabe|abc

2|0|0|abdebe|abe|bcd

3|0|1|abedce

dfekld|bcd

4|0|1|abdfkdlaj||dklfjsle|ndw

5|1|0|abedce

dfekld

dafhdj|wer

6|0|0|abdebeabe|lkm

In this example, VAR3 contains free text. ID 2 and 4 would create extra fields. ID 3 and 5 would create extra rows.

The dataset itself contains thousands of records, so I'd like to have a streamlined process that will correctly read these entries as one record and one field.

At this point, I have syntax that would identifying problematic lines by tracking the number of times the delimiter appears in a row, but not how to make corrections without manually doing it by hand.

Has any one done that before? Thanks in advance!

Tom · Posted 11-19-2021 10:31 PM

The first thing to check is if the embedded end of lines are the same as the real end of lines. If you are lucky then the real end of lines will be using CR+LF and the embedded characters will be either single LF or single CR. What would also work is if the real end of line is LF and the embedded characters are just CR.

Try your analysis program using TERMSTR=CRLF and see if that fixes the problem.

Extra delimiters will make the problem impossible to fix with a program to 100% certainty.

Can you narrow the problem down? Is there just one field that has the free text? Or are their multiple fields that can have embedded delimiters and/or embedded end of line characters?

Tom · Posted 11-19-2021 10:39 PM

If you can eliminate the extra end of lines you can then work on fixing the extra delimiters separately.

Is there some pattern to the first field (or first 2 or 3) that will make it possible to detect the start of a new record by just looking at the beginning of the line?

For example in your example the real lines start with a numeric field. You could try reading the first field and checking if it is a number. Or even better if it is a number in a valid range. Or even better still a number from a fixed set of know values.

For example this code will start a new line when the first field on the line is an integer between 1 and 6.

filename in 'original file.txt';
filename out 'fixed file.txt';
data _null_;
  infile in truncover dsd dlm='|';
  file out;
  input id ?? @;
  if _n_>1 then do;
    if id in (1:6) then put ;
    else put ' ' @;
  end;
  put _infile_ @;
run;

Result

ID|VAR1|VAR2|VAR3|VAR4
1|1|0|abdebeabe|abc
2|0|0|abdebe|abe|bcd
3|0|1|abedce dfekld|bcd
4|0|1|abdfkdlaj||dklfjsle|ndw
5|1|0|abedce dfekld dafhdj|wer
6|0|0|abdebeabe|lkm

Ksharp · Posted 11-20-2021 08:14 AM

You need to know the pattern which start a new line .

data have;
infile 'c:\temp\have.txt' lrecl=1000 length=len firstobs=2;
input have $varying1000. len;
if prxmatch('/^\d+\|/',strip(have)) then group+1;
run;
data want;
length want $ 2000;
do until(last.group);
 set have;
 by group;
 want=cats(want,have);
end;

ID=scan(want,1,'|');
VAR1=scan(want,2,'|');
VAR2=scan(want,3,'|');
VAR3=scan(want,4,'|');
VAR4=scan(want,5,'|');
drop have;
run;

correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Registration is open

correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Re: correctly reading free text containing paragraphs and delimiter within delimited datafile

Registration is open

SAS Training: Just a Click Away