BookmarkSubscribeRSS Feed
shc0
Calcite | Level 5

I have a "|" delimited text dataset. One of the fields is free text from doctor notes, which causes issues when trying to read the data into SAS. There are no quotation marks to mark the start and end of the text field.  The field may contain either the delimiter itself (creating a false field) or paragraphs (creating false rows). Here is an example:

 

ID|VAR1|VAR2|VAR3|VAR4

1|1|0|abdebeabe|abc

2|0|0|abdebe|abe|bcd

3|0|1|abedce

dfekld|bcd

4|0|1|abdfkdlaj||dklfjsle|ndw

5|1|0|abedce

dfekld

dafhdj|wer

6|0|0|abdebeabe|lkm

 

In this example, VAR3 contains free text. ID 2 and 4 would create extra fields. ID 3 and 5 would create extra rows.

 

The dataset itself contains thousands of records, so I'd like to have a streamlined process that will correctly read these entries as one record and one field.

 

At this point, I have syntax that would identifying problematic lines by tracking the number of times the delimiter appears in a row, but not how to make corrections without manually doing it by hand.

 

Has any one done that before? Thanks in advance!

3 REPLIES 3
Tom
Super User Tom
Super User

The first thing to check is if the embedded end of lines are the same as the real end of lines.  If you are lucky then the real end of lines will be using CR+LF and the embedded characters will be either single LF or single CR.  What would also work is if the real end of line is LF and the embedded characters are just CR.

 

Try your analysis program using TERMSTR=CRLF and see if that fixes the problem.

 

Extra delimiters will make the problem impossible to fix with a program to 100% certainty. 

 

Can you narrow the problem down?  Is there just one field that has the free text? Or are their multiple fields that can have embedded delimiters and/or embedded end of line characters?

Tom
Super User Tom
Super User

If you can eliminate the extra end of lines you can then work on fixing the extra delimiters separately.

 

Is there some pattern to the first field (or first 2 or 3) that will make it possible to detect the start of a new record by just looking at the beginning of the line?

 

For example in your example the real lines start with a numeric field.  You could try reading the first field and checking if it is a number. Or even better if it is a number in a valid range.  Or even better still a number from a fixed set of know values.

 

For example this code will start a new line when the first field on the line is an integer between 1 and 6.

filename in 'original file.txt';
filename out 'fixed file.txt';
data _null_;
  infile in truncover dsd dlm='|';
  file out;
  input id ?? @;
  if _n_>1 then do;
    if id in (1:6) then put ;
    else put ' ' @;
  end;
  put _infile_ @;
run;

Result

ID|VAR1|VAR2|VAR3|VAR4
1|1|0|abdebeabe|abc
2|0|0|abdebe|abe|bcd
3|0|1|abedce dfekld|bcd
4|0|1|abdfkdlaj||dklfjsle|ndw
5|1|0|abedce dfekld dafhdj|wer
6|0|0|abdebeabe|lkm

 

Ksharp
Super User

You need to know the pattern which start a new line .

 

data have;
infile 'c:\temp\have.txt' lrecl=1000 length=len firstobs=2;
input have $varying1000. len;
if prxmatch('/^\d+\|/',strip(have)) then group+1;
run;
data want;
length want $ 2000;
do until(last.group);
 set have;
 by group;
 want=cats(want,have);
end;

ID=scan(want,1,'|');
VAR1=scan(want,2,'|');
VAR2=scan(want,3,'|');
VAR3=scan(want,4,'|');
VAR4=scan(want,5,'|');
drop have;
run;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1069 views
  • 0 likes
  • 3 in conversation