Solved: Re: count specific char in line and put linefeed at specific cnt

havmaage · Posted 12-14-2019 07:07 AM

Hi Community,

I am trying to prepare a file with a speciel format

it is a csv format but with a pund sign(£) as delimiter and a lot if linefeeds/carriage returns in some of the text columns.

a line in my file with look like this

col1£col2£col3£col4£col5$
myname£myaftername£myaddress£this is a text column
wtih crlf all over
the place £col5_account

now i am trying to remove those crtl by count the oocourence if £ and remove them until the end of line, but a am still a newbie in SAS programming so i hoped for some kind advice in this forum.

My approach until now has been this peace of code, it is based on an example from SAS removing crtl inside quoted text columns

but i cant get my count to work properly and uit just writes \n all over right now.

%let repA='@';                    /* replacement character LF */
%let repD='$';                    /* replacement character CR */

                             
%let dsnnme="/folders/myfolders/testdata.txt";

      /* use full path of CSV file */

data test;
  infile &dsnnme recfm=n sharebuffers;
  file &dsnnme recfm=n;

  retain open 0;
   length cnt 8.;
   retain cnt 0;
  input a $char1.;
  /* If the character is a £ i increment the cnt variable by 1 */
  if a = 'A3'x then
     cnt = cnt + 1;
   
  /* if cnt is not end of line remove any crtl */
                             */
  
  if cnt < 5 then do;
    if a = '0D'x then put &repD;
    else if a = '0A'x then put &repA;
  end;
  if cnt = 5 then do;
  /* this should capture the end of line */
  cnt = 0;
 end; 
run;

Tom · Posted 12-14-2019 01:12 PM

First thing is to not overwrite your original file. Make a new file with the fix. That way you can test until you get it right without having to keep re-creating the input.

If you just want to count the number of delimiters then you should be able to do it by just reading and writing lines. To remove the embedded line breaks just don't write end of line until you see the next record starting.

First let's convert your example into an actual file.

filename source temp;
data _null_;
  file source ;
  put
 'col1£col2£col3£col4£col5'
/'myname£myaftername£myaddress£this is a text column'
/'wtih crlf all over'
/'the place £col5_account'
  ;
run;

So here is what it looks like:

164  data _null_;
165    infile source ;
166    input ;
167    list;
168  run;

NOTE: The infile SOURCE is:
      Filename=.../#LN00024,
      Owner Name=XXX,Group Name=YYY,
      Access Permission=-rw-rw-r--,
      Last Modified=14Dec2019:13:05:57,
      File Size (bytes)=119

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6
1         col1£col2£col3£col4£col5 24
2         myname£myaftername£myaddress£this is a text column 50
3         wtih crlf all over 18
4         the place £col5_account 23
NOTE: 4 records were read from the infile SOURCE.
      The minimum record length was 18.
      The maximum record length was 50.

Now let's run a data step to remove the breaks. This one will add a space where the line break was.

filename target temp;
data _null_;
  infile source end=eof;
  file target ;
  input ;
  count=countc(_infile_,'£');
  if total+count > 4 then do ;
    put ;
    total = 0;
  end;
  if total then put ' ' @;
  total+count;
  put _infile_ @;
  if eof then put;
run;

And here is the result:

185  data _null_;
186    infile target ;
187    input ;
188    list;
189  run;

NOTE: The infile TARGET is:
      Filename=.../#LN00025,
      Owner Name=XXX,Group Name=YYY,
      Access Permission=-rw-rw-r--,
      Last Modified=14Dec2019:13:05:57,
      File Size (bytes)=119

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         col1£col2£col3£col4£col5 24
2         myname£myaftername£myaddress£this is a text column wtih crlf all over the place £col5_account 93
NOTE: 2 records were read from the infile TARGET.
      The minimum record length was 24.
      The maximum record length was 93.

View solution in original post

KachiM · Posted 12-14-2019 10:24 AM

HI @havmaage

Can you show the output you require?

Tom · Posted 12-14-2019 12:25 PM

The right solution depends on how the file is formatted.

Are the values with embedded CR and/or LF characters enclosed in quotes? If they are then the solution posted by SAS that counts quotes will work. http://support.sas.com/kb/26/065.html

Do the values ever contain actual British pound symbols? If not then just counting the number of pound symbols might work. But it will get tricky if the embedded line breaks appear in the last value on the line.

If yes then how have the protected those symbols so that you can tell which are delimiters and which are data? There are two normally used styles. One is to enclose the value in quotes (that is what SAS does and the DSD infile option supports). The other is the "escape" the character by preceding it with a special character, normally the backslash, \ , character.

Tom · Posted 12-14-2019 01:12 PM

First thing is to not overwrite your original file. Make a new file with the fix. That way you can test until you get it right without having to keep re-creating the input.

If you just want to count the number of delimiters then you should be able to do it by just reading and writing lines. To remove the embedded line breaks just don't write end of line until you see the next record starting.

First let's convert your example into an actual file.

filename source temp;
data _null_;
  file source ;
  put
 'col1£col2£col3£col4£col5'
/'myname£myaftername£myaddress£this is a text column'
/'wtih crlf all over'
/'the place £col5_account'
  ;
run;

So here is what it looks like:

164  data _null_;
165    infile source ;
166    input ;
167    list;
168  run;

NOTE: The infile SOURCE is:
      Filename=.../#LN00024,
      Owner Name=XXX,Group Name=YYY,
      Access Permission=-rw-rw-r--,
      Last Modified=14Dec2019:13:05:57,
      File Size (bytes)=119

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6
1         col1£col2£col3£col4£col5 24
2         myname£myaftername£myaddress£this is a text column 50
3         wtih crlf all over 18
4         the place £col5_account 23
NOTE: 4 records were read from the infile SOURCE.
      The minimum record length was 18.
      The maximum record length was 50.

Now let's run a data step to remove the breaks. This one will add a space where the line break was.

filename target temp;
data _null_;
  infile source end=eof;
  file target ;
  input ;
  count=countc(_infile_,'£');
  if total+count > 4 then do ;
    put ;
    total = 0;
  end;
  if total then put ' ' @;
  total+count;
  put _infile_ @;
  if eof then put;
run;

And here is the result:

185  data _null_;
186    infile target ;
187    input ;
188    list;
189  run;

NOTE: The infile TARGET is:
      Filename=.../#LN00025,
      Owner Name=XXX,Group Name=YYY,
      Access Permission=-rw-rw-r--,
      Last Modified=14Dec2019:13:05:57,
      File Size (bytes)=119

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         col1£col2£col3£col4£col5 24
2         myname£myaftername£myaddress£this is a text column wtih crlf all over the place £col5_account 93
NOTE: 2 records were read from the infile TARGET.
      The minimum record length was 24.
      The maximum record length was 93.

havmaage · Posted 12-14-2019 01:46 PM

Thanks,

Actually the result im trying to chive is the last three line of this test file attached, i know that there will be no crtl on the last column it is a number

I am extracting data from a system where some text is typed in by human and therefore can contain linefeeds and carriage returns.

Tom · Posted 12-14-2019 02:13 PM

Your example does not seem consistent with 5 fields per record.

Do you have any control over how the values are dumped from the source? If you can get the source to generate the file in a systematic way that it is possible to interpret it will make it much easier.

The first line is the header row. And it looks like the the last 3 lines are normal records.

But the stuff in the middle looks confused, they appear to be grouped in three lines per record.

So lines 2 and 3 look like a record, but then line 4 looks like ii has just the last two fields, or two extra fields.

But the other groups seem ok as they do not have that extra delimiter like line 3 does.

75    options ls=132;
76    data _null_;
77      infile source encoding='utf8' ;
78      input;
79      list ;
80    run;

NOTE: The infile SOURCE is:
      Filename=c:\downloads\testdata.txt,
      RECFM=V,LRECL=131068,File Size (bytes)=702,
      Last Modified=14Dec2019:13:50:36,
      Create Time=14Dec2019:13:50:35

RULE:     ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1         col1£col2£col3£col4£col5 24
2         simon£jespersen£højbovej£Dette er er fri tekst 46
3         med en masse£23  16
4         linieskift i£21 15
5         simon£jespersen£højbovej£Dette er er fri tekst 46
6         med en masse  13
7         linieskift i£20 15
8         simon£jespersen£højbovej£Dette er er fri tekst 46
9         med en masse  13
10        linieskift i£10 15
11        simon£jespersen£højbovej£Dette er er fri tekst 46
12        med en masse  13
13        linieskift i£07 15
14        simon£jespersen£højbovej£Dette er er fri tekst 46
15        med en masse  13
16        linieskift i£01 15
17        simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£25 74
18        simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£26 74
19        simon£jespersen£højbovej£Dette er er fri tekstmed en masse linieskift i£27 74
NOTE: 19 records were read from the infile SOURCE.
      The minimum record length was 13.
      The maximum record length was 74.

havmaage · Posted 12-15-2019 10:32 AM

Thank you very much,

its funny on my computer it did not work by doing count=countc(_infile_, '£');

i needed to test on hex value count=countc(_infile_,'A3'x );

But your solution seems to solve my problem, thank you very much.

Tom · Posted 12-15-2019 02:45 PM

Your example file was using UTF8 encoding. So instead of 'A3'x you had a two byte sequence for that character.
You might have fewer issues if you used a standard ASCII character as the delimiter. Like TAB, '09'X, or pipe, '|'.

ErikLund_Jensen · Posted 12-14-2019 04:14 PM

Hi @havmaage

Instead of writing an output record in pieces, I find it easier to concatenate input lines to a SAS variable, until the expected number of delimiters is found, and then output and start building a new line. The following simple code will work with any number of line breaks if the input follow these rules:

all records are complete, so the number of pound signs is the same in every record,
a full record never exceeds 32k, (about. 20 normal text pages),
there is a always line break between records

filename source 'c:\temp\source.csv';
data _null_;
	file source ;
	put 'col1£col2£col3£col4£col5$';
	put 'myname£myaftername£myaddress£this is a text column';
	put 'wtih crlf all over';
	put 'the place £col5_account';
	put 'col1£col2£col3£col4£col5$';
	put 'myname£myaftername£myaddress£this is a text column';
	put 'with a single crlf £col5_account';
	put 'col1£col2£col3£col4£col5$';
 	put 'myname£myaftername£myaddress£this is a text column without crlf in the free text column £col5_account';
	put 'col1£col2£col3£col4£col5$myname£myaftername£myaddress£this is a text column without crlf in the free text column £col5_account';
  ;
run;

filename target 'c:\temp\target.csv';
data _null_;
	infile source;
	file target lrecl = 32000;
	retain outline;
	length outline $32000;
	input;
	outline = catx(' ',outline,_infile_);
	if countc(outline,'£') >= 8 then do;
		put outline;
		outline = '';
	end;
run;

v.h.

Erik

mkeintz · Posted 12-14-2019 04:44 PM

The issue, as I see it, is that a £ terminates the first 4 variables, while a CRLF terminates the last variable, correct?

If so, then if you assume

the fifth variable never starts with a CRLF, and
the fifth variable never has an embedded CRLF

then this works:

filename oneline temp;

data _null_;
  infile 'c:\temp\t.txt' ;
  length single_line $500;
  do until (n_dlim=4);
    input;
    n_dlim=sum(n_dlim,countc(_infile_,'£'));
    if char(single_line,length(single_line))^='£' then single_line=catx(' ',single_line,_infile_); else
    single_line=cats(single_line,_infile_);
  end;
  file oneline;
  put single_line ;
run;

data want;
  infile oneline dlm='£' truncover ;
  informat col1-col5 $char100. ;
  input col1 col2 col3 col4 col5 ;
run;

Like your program it converts multiple lines into a single line with all 5 variables - i.e. it effectively removes the excess CRLF's. It write the single line to a temporary raw file, which is read normally using the DLM='£' delimiter. A single line is considered complete when it has accommodated 4 £' signs. This is why the program relies on the assumption that the fifth function has no leading or internal CRLF's (a trailing CRLF is of course, indicator of end-of-record).

One note: If a '£' is immediately followed by a CRLF, I insert a blank in the concatenation of the accumulating single line and the record fragment represented by _INFILE_. Otherwise just concatenate normally.

You can even do this in one step, by parsing the complete SINGLE_LINE text with a set of SCAN functions, as in:

data want2 (keep=col:);
  infile 'c:\temp\t.txt' ;
  length single_line $500;
  do until (n_dlim=4);
    input;
    n_dlim=sum(n_dlim,countc(_infile_,'£'));
    if char(single_line,length(single_line))^='£' then single_line=catx(' ',single_line,_infile_); else
    single_line=cats(single_line,_infile_);
  end;
  col1=scan(single_line,1,'£');
  col2=scan(single_line,2,'£');
  col3=scan(single_line,3,'£');
  col4=scan(single_line,4,'£');
  col5=scan(single_line,5,'£');
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away