Solved: Re: pipe delimietd file with no specific layout

SAS Training: Just a Click Away

Catch up on SAS Innovate 2026

Tal · Posted 05-03-2017 01:45 PM

Hi,

So i got a wierd file with a lot of comments in it .

I am sending you a sample.

They did send me the layout but with comments is it still possible to import a file ?

I mean i'd understand if the comments are part of the layout but this looks very weird.

Any hints pls?

or should i ask them to clean it first?

Thx

Tom · Posted 05-04-2017 12:15 PM

So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.

You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.

data _null_;
  infile "&path\temp.txt" lrecl=60 recfm=f ;
  input;
  list;
run;

You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.

In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.

filename dat temp ;
data _null_;
  infile "&path/temp.txt" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 3 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

Then just read it as a normal pipe delimited file.

View solution in original post

Tom · Posted 05-03-2017 02:24 PM

Looks like the file is actually delimited with comma. There are 24 fields on each row if you just use comma as the delimiter.

Perhaps you can read the fields that have pipes into strings and use SCAN() function to parse out the individual pipe delimited values?

mkeintz · Posted 05-03-2017 02:32 PM

Your sample file has a set of lines, each with 23 commas - implying 24 fields. In turn, for some of those records the first comma-separated field has 20 pipe delimiters. All the other records have no pipe symbols at all. I guess the "pipe-less" lines are all comments. If so then this program should give the outline you need.

The first INPUT statement reads no variables put does fill the automatic variable _INFILE_ with the input line contents. Then assign IX to the first column containing a pipe. If ix=0 there are no pipes and the observation should be deleted. But if there are, then re-read the data, starting at column 1 ("INPUT @1") all the variables. Of course, you should use the format you need for the 21 variables to be read. I just used ":$20." as an example.

data want (drop=ix);
  infile datalines dlm='|';
  input @;
  ix=indexc(_infile_,'|');
  if ix=0 then delete;
  input @1 (col1-col21) (:$15.);
datalines;
2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|730697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
.................................................,,,,,,,,,,,,,,,,,,,,,,,
? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,,
? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,,,,,,
? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,,,,,,,,
review,,,,,,,,,,,,,,,,,,,,,,,
? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|xxxx|Lapsed|||yyyy|InPhone|xxxxx|P|500000|49.85|L8L 5B8|22742|0|1970-04-23|F|12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tal · Posted 05-03-2017 03:59 PM

thanks ,

when you say "all the other records have no pipe" are you referring to the comments after the second record?

And they want the comments imported too which does not make any sense. Yes there are 21 fields but how do you see the 23 commas 🙂 just curious

this is one of the comments in the huge original file which is right between the 2nd and 3rd record but also i noticed the value of the 20th field which is a string has some wierd data . The 23 commas u showing in your respond are actually part of the 20th field

Tom · Posted 05-03-2017 04:17 PM

The commas are in the file you posted.

If they are not in the file you are looking at then something has happened to make a change. Did you perhaps read it into Excel and save it as a CSV file?

28   data counts;
29     row+1;
30     infile "&path/pipedelim.csv" termstr=crlf;
31     input ;
32     nwords1=countw(_infile_,'|','mq');
33     nwords2=countw(_infile_,',','mq');
34     startq=_infile_=: '?';
35     if _n_ <= 10 then list;
36   run;

NOTE: The infile "...pipedelim.csv" is:
      Filename=...pipedelim.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=1394,
      Last Modified=03May2017:14:16:30,
      Create Time=03May2017:14:16:30

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
1         2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|73
      93  0697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 151
2         2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L
      93  5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 157
3         .................................................,,,,,,,,,,,,,,,,,,,,,,, 72
4         ? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,, 73
5         ? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,,
      93  ,,,, 96
6         ? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,,
      93  ,,,,,, 98
7         review,,,,,,,,,,,,,,,,,,,,,,, 29
8         ? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,,
      93  , 93
9         1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06
      93  10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,
     185  ,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-0
     277  4-10 280
10        ,,,,,,,,,,,,,,,,,,,,,,, 23
NOTE: 11 records were read from the infile "...pipedelim.csv".
      The minimum record length was 23.
      The maximum record length was 300.

If looks like the comment lines have fewer pipes that then other lines. But at least one of them does have | characters. Is that just part of the comment?

data want (drop=comment) comments(keep=row comment);
   infile "pipedelim.dat" dsd dlm='|' truncover ;
   row+1;
   input @;
   if countw(_infile_,'|','mq') < 21 then do;
     input comment $300.;
     output comments;
   end;
  else do;
     length var1 var2 8;
     informat var1 yymmdd.;
     format var1 yymmdd10.;
     input var1 var2 ;
     output want;
   end;
run;

If you could post all or part of the specifications for the fields? That might help make sense of it.

Do you have any control over the generation of this file? Are lines 3 to 8 intended to be comments on line 2? Can you not have them put onto the same line instead of making new lines?

Tal · Posted 05-03-2017 04:56 PM

it was a csv file they gave me and i am reading it in as such but they only gave me the names of the fields,nothing more so we came up with the specifications here 🙂

input

APPLICATIONDATE :$10.

APPLICATIONNUM :$10.

CERTNUM :$10.

PLANTYPE :$2.

CREATEDBY :$30.

POLICY_STATUS :$15.

POLICY_DETAILED_STATUS :$15.

POLICY_EVENT :$15.

AGENT_ID :$30.

ENROLMENT :$15.

STIMULUS :$30.

OWNER :$3.

PRIMARYFACE :$10.

PREMIUM :$10.

POSTAL_CODE :$7.

CUSTOMERCID :$15.

INSURED2CID :$15.

DOB :$10.

GENDER :$2.

NOTES :$500.

DATE_STAMP :$10.;

run; ;

and here i am attaching a record of this csv file where the 20th field goes into 3 rows and SAS reads the data in the second row as a new field(1st field) and the data in the 3rd row as 2nd field of a new record... and the 20th field is pipe delimited just like the others right?

So no clue why SAS is treating the value of this 20th field like new fields

Tal · Posted 05-03-2017 04:59 PM

yes lines 3-8 are a comment and that comment is not a part of any record .

They have a bunch of different comments throughout this file and now they say they want it imported too???

the comments are not even pipe delimited...

Tom · Posted 05-03-2017 05:36 PM

You need to figure out if those extra commas are really there or not. But if when you look at the real file you still cannot tell the end of lines from the embedded line breaks then you can try using something like this to make a copy of the file with each record on one line.

filename dat temp ;
data _null_;
  infile "&path/pipedelim.csv" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 2 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

But in your example there is some inconsistency in the number of PIPE characters. So for the second record the NOTES column gets shifted one column to the right. So you could try just reading in some extra columns and then trying to figure out which are the notes and which are the DATE_STAMP.

data want ;
  infile dat dsd dlm='|' truncover termstr=crlf;
length
  APPLICATIONDATE $10 
  APPLICATIONNUM $10 
  CERTNUM $10 
  PLANTYPE $2 
  CREATEDBY $30 
  POLICY_STATUS $15 
  POLICY_DETAILED_STATUS $15 
  POLICY_EVENT $15 
  AGENT_ID $30 
  ENROLMENT $15 
  STIMULUS $30 
  OWNER $3 
  PRIMARYFACE $10 
  PREMIUM $10 
  POSTAL_CODE $7 
  CUSTOMERCID $15 
  INSURED2CID $15 
  DOB $10 
  GENDER $2 
  X1-X3 $500
  NOTES $500 
  DATE_STAMP $10 
;
  input APPLICATIONDATE -- DATE_STAMP;
run;

Tom · Posted 05-03-2017 05:02 PM

If you type a CSV file to the screen it would NOT look like the photograph you posted. Did you open the CSV file in Excel? If so then Excel has already potentially changed the file. If you do not have another text editor then just use SAS to read the file and look at what is in it. I posted an example using a DATA step to do this.

If they want to have line breaks within a line of the file then the actual end-of-line for the file needs to be different. For example in Excel you cna enter line breaks into a cell but when you save that as a CSV file the end-of-line is the two character sequence CR+LF ('0D0A'x) and the embedded line breaks are normally just CR ('0D'x). So in that case you can read that with SAS by using the TERMSTR=CRLF option on the INFILE statement.

Tal · Posted 05-03-2017 08:53 PM

Ahh i see what you asking me. Yes, it was opened in excel n then saved as a csv file and given to me like it. I see now the 23 commas. If they want the comment i guess i can assign each line of it to the first variable so a comment of 3 lines would go to 3 separate records but what worries me are some records that have more than 1 column. Original csv file has only one column and the entire record embeded in it right?
But i see records where some of the value of "notes" is going to another column along with the 21th field so when SAS reads in the file assignes the value of that second column to a new line( into var1 n var2)
U think termstr=crlf would take care of that issue?

Thx for your insights Tom

Tom · Posted 05-03-2017 09:25 PM

The file is clearly intended to be pipe delimited and not comma delimited. It is the inserted end of lines in the middle of the strings that is causing the trouble. If you had the original file before it was corrupted by Excel you might have a better chance at either being able to read it directly. Or of being able to convert the split lines back into single lines. For example the original file might have had just CR in the middle of the NOTES field and CRLF at the end of the lines. Or it might have had quotes around the NOTES field that would make it easier to tell when there are embedded end of lines that need to be removed.

Tal · Posted 05-03-2017 10:54 PM

Thx
Cant i open the file in text editor and remove the cr or crlf though?

Tal · Posted 05-03-2017 10:56 PM

And how do i find those characters?

Tal · Posted 05-04-2017 11:51 AM

so it is pipe delimited but there is no way i can find that splitter in the 20th field (Notes).

Any tricks about it? 🙂

Tom · Posted 05-04-2017 12:15 PM

So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.

You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.

data _null_;
  infile "&path\temp.txt" lrecl=60 recfm=f ;
  input;
  list;
run;

You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.

In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.

filename dat temp ;
data _null_;
  infile "&path/temp.txt" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 3 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

Then just read it as a normal pipe delimited file.

.................................................

? R4 policy created for T10 R3 pending application

? Please refer to closed R3 policy for application answers - ABC11014263

? Any certificate reprint requests are to be sent to the Product Group for

? Do not request any certificate reprints through Centtric |2017-04-10