BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

Hi,

So i got  a  wierd file  with a lot of comments in it .

I am sending  you a  sample.

They did send me the layout  but with comments is it still  possible to  import  a  file ?

I mean  i'd understand if the comments are part  of the  layout but  this looks very weird.

 

Any hints  pls?

or should i ask them to clean  it first?

Thx

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.

You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.

data _null_;
  infile "&path\temp.txt" lrecl=60 recfm=f ;
  input;
  list;
run;

 You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.  

 

In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.

filename dat temp ;
data _null_;
  infile "&path/temp.txt" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 3 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

Then just read it as a normal pipe delimited file.

View solution in original post

22 REPLIES 22
Tom
Super User Tom
Super User

Looks like the file is actually delimited with comma.  There are 24 fields on each row if you just use comma as the delimiter.

Perhaps you can read the fields that have pipes into strings and use SCAN() function to parse out the individual pipe delimited values?

mkeintz
PROC Star

Your sample file has a set of lines, each with 23 commas - implying 24 fields. In turn, for some of those records the first comma-separated field has 20 pipe delimiters.  All the other records have no pipe symbols at all.   I guess the "pipe-less" lines are all comments.  If so then this program should give the outline you need.

 

The first INPUT statement reads no variables put does fill the automatic variable _INFILE_ with the input line contents.  Then assign IX to the first column containing a pipe.  If ix=0 there are no pipes and the observation should be deleted.  But if there are, then re-read the data, starting at column 1 ("INPUT @1") all the variables.  Of course, you should use the format you need for the 21 variables to be read.  I just used ":$20." as an example.

 

 

data want (drop=ix);
  infile datalines dlm='|';
  input @;
  ix=indexc(_infile_,'|');
  if ix=0 then delete;
  input @1 (col1-col21) (:$15.);
datalines;
2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|730697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
.................................................,,,,,,,,,,,,,,,,,,,,,,,
? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,,
? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,,,,,,
? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,,,,,,,,
review,,,,,,,,,,,,,,,,,,,,,,,
? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|xxxx|Lapsed|||yyyy|InPhone|xxxxx|P|500000|49.85|L8L 5B8|22742|0|1970-04-23|F|12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

thanks ,

when  you say "all the  other  records  have no pipe"  are you referring to the comments  after the  second  record?

And they want the comments imported too which does not make any sense. Yes  there are  21  fields  but  how do you  see the  23 commas 🙂  just curious

 

this is  one  of the comments  in the  huge  original file  which is  right  between the  2nd and  3rd  record but also  i noticed  the  value of the  20th  field which is a  string has some  wierd  data . The 23 commas  u showing  in  your  respond  are actually part  of the  20th field

.................................................      
? R4 policy created for T10 R3 pending application    
? Please refer to closed R3 policy for application answers - ABC11014263
? Any certificate reprint requests are to be sent to the Product Group for
review            
? Do not request any certificate reprints through Centtric |2017-04-10
Tom
Super User Tom
Super User

The commas are in the file you posted.  

If they are not in the file you are looking at then something has happened to make a change.  Did you perhaps read it into Excel and save it as a CSV file?  

28   data counts;
29     row+1;
30     infile "&path/pipedelim.csv" termstr=crlf;
31     input ;
32     nwords1=countw(_infile_,'|','mq');
33     nwords2=countw(_infile_,',','mq');
34     startq=_infile_=: '?';
35     if _n_ <= 10 then list;
36   run;

NOTE: The infile "...pipedelim.csv" is:
      Filename=...pipedelim.csv,
      RECFM=V,LRECL=32767,File Size (bytes)=1394,
      Last Modified=03May2017:14:16:30,
      Create Time=03May2017:14:16:30

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9--
1         2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|73
      93  0697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 151
2         2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L
      93  5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 157
3         .................................................,,,,,,,,,,,,,,,,,,,,,,, 72
4         ? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,, 73
5         ? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,,
      93  ,,,, 96
6         ? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,,
      93  ,,,,,, 98
7         review,,,,,,,,,,,,,,,,,,,,,,, 29
8         ? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,,
      93  , 93
9         1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06
      93  10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,
     185  ,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-0
     277  4-10 280
10        ,,,,,,,,,,,,,,,,,,,,,,, 23
NOTE: 11 records were read from the infile "...pipedelim.csv".
      The minimum record length was 23.
      The maximum record length was 300.

If looks like the comment lines have fewer pipes that then other lines.  But at least one of them does have | characters. Is that just part of the comment?

data want (drop=comment) comments(keep=row comment);
   infile "pipedelim.dat" dsd dlm='|' truncover ;
   row+1;
   input @;
   if countw(_infile_,'|','mq') < 21 then do;
     input comment $300.;
     output comments;
   end;
  else do;
     length var1 var2 8;
     informat var1 yymmdd.;
     format var1 yymmdd10.;
     input var1 var2 ;
     output want;
   end;
run;

If you could post all or part of the specifications for the fields?  That might help make sense of it.

 

Do you have any control over the generation of this file?  Are lines 3 to 8 intended to be comments on line 2?  Can you not have them put onto the same line instead of making new lines?

 

 

Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

it was a  csv  file they  gave me and  i am  reading it in as such but they only gave me the names of the fields,nothing more  so  we came  up with the specifications  here 🙂

 

input

APPLICATIONDATE :$10.

APPLICATIONNUM :$10.

CERTNUM :$10.

PLANTYPE :$2.

CREATEDBY :$30.

POLICY_STATUS :$15.

POLICY_DETAILED_STATUS :$15.

POLICY_EVENT :$15.

AGENT_ID :$30.

ENROLMENT :$15.

STIMULUS :$30.

OWNER :$3.

PRIMARYFACE :$10.

PREMIUM :$10.

POSTAL_CODE :$7.

CUSTOMERCID :$15.

INSURED2CID :$15.

DOB :$10.

GENDER :$2.

NOTES :$500.

DATE_STAMP :$10.;

run; ;

 

and  here i am attaching a  record  of this  csv  file where the  20th field goes into  3  rows and SAS  reads the data in the second  row as a new field(1st field) and  the  data in the  3rd row as 2nd field of a  new  record... and  the  20th field is pipe delimited  just like the others  right?

So no clue  why SAS is treating the value of this 20th field  like new  fields


Capture1.PNG
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

yes  lines   3-8  are a comment and  that comment is not  a part  of any record .

They  have a  bunch of different comments  throughout this  file and  now they say  they want  it  imported too???

 

the comments  are not even  pipe delimited...

Tom
Super User Tom
Super User

You need to figure out if those extra commas are really there or not.  But if when you look at the real file you still cannot tell the end of lines from the embedded line breaks then you can try using something like this to make a copy of the file with each record on one line.

filename dat temp ;
data _null_;
  infile "&path/pipedelim.csv" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 2 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

But in your example there is some inconsistency in the number of PIPE characters.  So for the second record the NOTES column gets shifted one column to the right.  So you could try just reading in some extra columns and then trying to figure out which are the notes and which are the DATE_STAMP.

data want ;
  infile dat dsd dlm='|' truncover termstr=crlf;
length
  APPLICATIONDATE $10 
  APPLICATIONNUM $10 
  CERTNUM $10 
  PLANTYPE $2 
  CREATEDBY $30 
  POLICY_STATUS $15 
  POLICY_DETAILED_STATUS $15 
  POLICY_EVENT $15 
  AGENT_ID $30 
  ENROLMENT $15 
  STIMULUS $30 
  OWNER $3 
  PRIMARYFACE $10 
  PREMIUM $10 
  POSTAL_CODE $7 
  CUSTOMERCID $15 
  INSURED2CID $15 
  DOB $10 
  GENDER $2 
  X1-X3 $500
  NOTES $500 
  DATE_STAMP $10 
;
  input APPLICATIONDATE -- DATE_STAMP;
run;
Tom
Super User Tom
Super User

If you type a CSV file to the screen it would NOT look like the photograph you posted. Did you open the CSV file in Excel?  If so then Excel has already potentially changed the file.  If you do not have another text editor then just use SAS to read the file and look at what is in it. I posted an example using a DATA step to do this.

 

If they want to have line breaks within a line of the file then the actual end-of-line for the file needs to be different. For example in Excel you cna enter line breaks into a cell but when you save that as a CSV file the end-of-line is the two character sequence CR+LF ('0D0A'x) and the embedded line breaks are normally just CR ('0D'x).  So in that case you can read that with SAS by using the TERMSTR=CRLF option on the INFILE statement.

Tal
Pyrite | Level 9 Tal
Pyrite | Level 9
Ahh i see what you asking me. Yes, it was opened in excel n then saved as a csv file and given to me like it. I see now the 23 commas. If they want the comment i guess i can assign each line of it to the first variable so a comment of 3 lines would go to 3 separate records but what worries me are some records that have more than 1 column. Original csv file has only one column and the entire record embeded in it right?
But i see records where some of the value of "notes" is going to another column along with the 21th field so when SAS reads in the file assignes the value of that second column to a new line( into var1 n var2)
U think termstr=crlf would take care of that issue?

Thx for your insights Tom
Tom
Super User Tom
Super User
The file is clearly intended to be pipe delimited and not comma delimited. It is the inserted end of lines in the middle of the strings that is causing the trouble. If you had the original file before it was corrupted by Excel you might have a better chance at either being able to read it directly. Or of being able to convert the split lines back into single lines. For example the original file might have had just CR in the middle of the NOTES field and CRLF at the end of the lines. Or it might have had quotes around the NOTES field that would make it easier to tell when there are embedded end of lines that need to be removed.
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9
Thx
Cant i open the file in text editor and remove the cr or crlf though?
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9
And how do i find those characters?
Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

so it is  pipe delimited but  there is no way i can  find  that splitter in the  20th field (Notes).

Any tricks about it? 🙂

 

 

Tom
Super User Tom
Super User

So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.

You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.

data _null_;
  infile "&path\temp.txt" lrecl=60 recfm=f ;
  input;
  list;
run;

 You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.  

 

In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.

filename dat temp ;
data _null_;
  infile "&path/temp.txt" end=eof;
  file dat lrecl=32767;
  input ;
  if _n_>1 then do;
    if countw(_infile_,'|','mq') > 3 then put / @;
    else put '  ' @;
  end;
  put _infile_ @;
  if eof then put;
run;

Then just read it as a normal pipe delimited file.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 2406 views
  • 4 likes
  • 3 in conversation