Hi,
So i got a wierd file with a lot of comments in it .
I am sending you a sample.
They did send me the layout but with comments is it still possible to import a file ?
I mean i'd understand if the comments are part of the layout but this looks very weird.
Any hints pls?
or should i ask them to clean it first?
Thx
So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.
You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.
data _null_;
infile "&path\temp.txt" lrecl=60 recfm=f ;
input;
list;
run;
You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.
In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.
filename dat temp ;
data _null_;
infile "&path/temp.txt" end=eof;
file dat lrecl=32767;
input ;
if _n_>1 then do;
if countw(_infile_,'|','mq') > 3 then put / @;
else put ' ' @;
end;
put _infile_ @;
if eof then put;
run;
Then just read it as a normal pipe delimited file.
Looks like the file is actually delimited with comma. There are 24 fields on each row if you just use comma as the delimiter.
Perhaps you can read the fields that have pipes into strings and use SCAN() function to parse out the individual pipe delimited values?
Your sample file has a set of lines, each with 23 commas - implying 24 fields. In turn, for some of those records the first comma-separated field has 20 pipe delimiters. All the other records have no pipe symbols at all. I guess the "pipe-less" lines are all comments. If so then this program should give the outline you need.
The first INPUT statement reads no variables put does fill the automatic variable _INFILE_ with the input line contents. Then assign IX to the first column containing a pipe. If ix=0 there are no pipes and the observation should be deleted. But if there are, then re-read the data, starting at column 1 ("INPUT @1") all the variables. Of course, you should use the format you need for the 21 variables to be read. I just used ":$20." as an example.
data want (drop=ix);
infile datalines dlm='|';
input @;
ix=indexc(_infile_,'|');
if ix=0 then delete;
input @1 (col1-col21) (:$15.);
datalines;
2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|730697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
.................................................,,,,,,,,,,,,,,,,,,,,,,,
? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,,
? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,,,,,,
? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,,,,,,,,
review,,,,,,,,,,,,,,,,,,,,,,,
? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
,,,,,,,,,,,,,,,,,,,,,,,
1996-06-21|36|10000357|AD|xxxx|Lapsed|||yyyy|InPhone|xxxxx|P|500000|49.85|L8L 5B8|22742|0|1970-04-23|F|12/22/06 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,,,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-04-10
run;
thanks ,
when you say "all the other records have no pipe" are you referring to the comments after the second record?
And they want the comments imported too which does not make any sense. Yes there are 21 fields but how do you see the 23 commas 🙂 just curious
this is one of the comments in the huge original file which is right between the 2nd and 3rd record but also i noticed the value of the 20th field which is a string has some wierd data . The 23 commas u showing in your respond are actually part of the 20th field
................................................. | ||||||
? R4 policy created for T10 R3 pending application | ||||||
? Please refer to closed R3 policy for application answers - ABC11014263 | ||||||
? Any certificate reprint requests are to be sent to the Product Group for | ||||||
review | ||||||
? Do not request any certificate reprints through Centtric |2017-04-10 |
The commas are in the file you posted.
If they are not in the file you are looking at then something has happened to make a change. Did you perhaps read it into Excel and save it as a CSV file?
28 data counts; 29 row+1; 30 infile "&path/pipedelim.csv" termstr=crlf; 31 input ; 32 nwords1=countw(_infile_,'|','mq'); 33 nwords2=countw(_infile_,',','mq'); 34 startq=_infile_=: '?'; 35 if _n_ <= 10 then list; 36 run; NOTE: The infile "...pipedelim.csv" is: Filename=...pipedelim.csv, RECFM=V,LRECL=32767,File Size (bytes)=1394, Last Modified=03May2017:14:16:30, Create Time=03May2017:14:16:30 RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9-- 1 2004-04-15|666623|1000000044|CARP|xxxxxx|Active||||OutPhone|Phone Call|R|25000|1.0|L5J4K4|73 93 0697|730698|1979-09-06|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 151 2 2004-04-15|123456|6789956756|CARP|xxx|Lapsed|||yyyyyyy|OutPhone|Phone Call|P|100000|5.95|L5L 93 5M1|730703|730704|1976-01-08|M||2017-04-10,,,,,,,,,,,,,,,,,,,,,,, 157 3 .................................................,,,,,,,,,,,,,,,,,,,,,,, 72 4 ? R4 policy created for T10 R3 pending application,,,,,,,,,,,,,,,,,,,,,,, 73 5 ? Please refer to closed R3 policy for application answers - ABC11014263 ,,,,,,,,,,,,,,,,,,, 93 ,,,, 96 6 ? Any certificate reprint requests are to be sent to the Product Group for ,,,,,,,,,,,,,,,,, 93 ,,,,,, 98 7 review,,,,,,,,,,,,,,,,,,,,,,, 29 8 ? Do not request any certificate reprints through Centtric |2017-04-10,,,,,,,,,,,,,,,,,,,,,, 93 , 93 9 1996-06-21|36|10000357|AD|vvvv|Lapsed|||yyyy|InPhone|uuuuu|R|400000|0.0||22742|0|||12/22/06 93 10:06:42 AM(By Tom)Decline as he is in Canada on a student visa but we would be prepared ,,, 185 ,,,,,,,,,,,,,,,,,,,,to reconsider when he has obtained his permanent resident status |2017-0 277 4-10 280 10 ,,,,,,,,,,,,,,,,,,,,,,, 23 NOTE: 11 records were read from the infile "...pipedelim.csv". The minimum record length was 23. The maximum record length was 300.
If looks like the comment lines have fewer pipes that then other lines. But at least one of them does have | characters. Is that just part of the comment?
data want (drop=comment) comments(keep=row comment);
infile "pipedelim.dat" dsd dlm='|' truncover ;
row+1;
input @;
if countw(_infile_,'|','mq') < 21 then do;
input comment $300.;
output comments;
end;
else do;
length var1 var2 8;
informat var1 yymmdd.;
format var1 yymmdd10.;
input var1 var2 ;
output want;
end;
run;
If you could post all or part of the specifications for the fields? That might help make sense of it.
Do you have any control over the generation of this file? Are lines 3 to 8 intended to be comments on line 2? Can you not have them put onto the same line instead of making new lines?
it was a csv file they gave me and i am reading it in as such but they only gave me the names of the fields,nothing more so we came up with the specifications here 🙂
input
APPLICATIONDATE :$10.
APPLICATIONNUM :$10.
CERTNUM :$10.
PLANTYPE :$2.
CREATEDBY :$30.
POLICY_STATUS :$15.
POLICY_DETAILED_STATUS :$15.
POLICY_EVENT :$15.
AGENT_ID :$30.
ENROLMENT :$15.
STIMULUS :$30.
OWNER :$3.
PRIMARYFACE :$10.
PREMIUM :$10.
POSTAL_CODE :$7.
CUSTOMERCID :$15.
INSURED2CID :$15.
DOB :$10.
GENDER :$2.
NOTES :$500.
DATE_STAMP :$10.;
run; ;
and here i am attaching a record of this csv file where the 20th field goes into 3 rows and SAS reads the data in the second row as a new field(1st field) and the data in the 3rd row as 2nd field of a new record... and the 20th field is pipe delimited just like the others right?
So no clue why SAS is treating the value of this 20th field like new fields
yes lines 3-8 are a comment and that comment is not a part of any record .
They have a bunch of different comments throughout this file and now they say they want it imported too???
the comments are not even pipe delimited...
You need to figure out if those extra commas are really there or not. But if when you look at the real file you still cannot tell the end of lines from the embedded line breaks then you can try using something like this to make a copy of the file with each record on one line.
filename dat temp ;
data _null_;
infile "&path/pipedelim.csv" end=eof;
file dat lrecl=32767;
input ;
if _n_>1 then do;
if countw(_infile_,'|','mq') > 2 then put / @;
else put ' ' @;
end;
put _infile_ @;
if eof then put;
run;
But in your example there is some inconsistency in the number of PIPE characters. So for the second record the NOTES column gets shifted one column to the right. So you could try just reading in some extra columns and then trying to figure out which are the notes and which are the DATE_STAMP.
data want ;
infile dat dsd dlm='|' truncover termstr=crlf;
length
APPLICATIONDATE $10
APPLICATIONNUM $10
CERTNUM $10
PLANTYPE $2
CREATEDBY $30
POLICY_STATUS $15
POLICY_DETAILED_STATUS $15
POLICY_EVENT $15
AGENT_ID $30
ENROLMENT $15
STIMULUS $30
OWNER $3
PRIMARYFACE $10
PREMIUM $10
POSTAL_CODE $7
CUSTOMERCID $15
INSURED2CID $15
DOB $10
GENDER $2
X1-X3 $500
NOTES $500
DATE_STAMP $10
;
input APPLICATIONDATE -- DATE_STAMP;
run;
If you type a CSV file to the screen it would NOT look like the photograph you posted. Did you open the CSV file in Excel? If so then Excel has already potentially changed the file. If you do not have another text editor then just use SAS to read the file and look at what is in it. I posted an example using a DATA step to do this.
If they want to have line breaks within a line of the file then the actual end-of-line for the file needs to be different. For example in Excel you cna enter line breaks into a cell but when you save that as a CSV file the end-of-line is the two character sequence CR+LF ('0D0A'x) and the embedded line breaks are normally just CR ('0D'x). So in that case you can read that with SAS by using the TERMSTR=CRLF option on the INFILE statement.
so it is pipe delimited but there is no way i can find that splitter in the 20th field (Notes).
Any tricks about it? 🙂
So that file does not have the extra commas, but it still has the embedded line breaks created using CR+LF and so there is no way to exactly tell that the line break is part of the NOTES field and not the actual end of the record.
You can see the binary codes for the characters in the file you can try reading the file as FIXED length records.
data _null_;
infile "&path\temp.txt" lrecl=60 recfm=f ;
input;
list;
run;
You could try editing by hand using a text editor, but if you have a large file that can be a pain. To "fix" the file you could first try using the code that I posted before. Basically it looks at the line and if it has a "lot" of pipes then it assumes it is the start of a new record. Otherwise it writes two blanks to replace the existing line break.
In the code below I just asked SAS to make a temporary file to write the joined lines into, but you could save the result to a new file on your disk instead.
filename dat temp ;
data _null_;
infile "&path/temp.txt" end=eof;
file dat lrecl=32767;
input ;
if _n_>1 then do;
if countw(_infile_,'|','mq') > 3 then put / @;
else put ' ' @;
end;
put _infile_ @;
if eof then put;
run;
Then just read it as a normal pipe delimited file.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.