DATA Step, Macro, Functions and more

Error while reading special characters

Reply
Contributor
Posts: 21

Error while reading special characters

Hi,

I want to determine the ascii value of some special characters like greek symbols α, η, ω etc. using below code snippet:

 

data ascii (encoding="utf-8");
ascii_val=rank('Φ');
put ascii_val=;
run;

 

But these symbols get manipulated and I obtain wrong ascii values in log.
I actually want to read a file using the infile statement and scan every character present in a line and If it is found to be a special symobl(ascii value more than 127) then it should prompt an error message and terminate the execution of the data step.

 

 

Respected Advisor
Posts: 4,173

Re: Error while reading special characters

A variation of below code should give you what you're after.

data have;
  length string $256;
  string = collate(0,255);
  output;
  string = collate(0,127);
  output;
run;

data 
  want (keep=string)
  error (keep=rownum string)
  exceptions (keep=rownum pos ASCII_Col_Seq)
  ;

  set have;
  rownum=_n_;

  retain prxid;
  if _n_=1 then
    prxid=prxparse('/[\x80-\xFF]/');

  _start=1;
  _stop=lengthn(string);
  call prxnext(prxid, _start, _stop, string, pos);
  if pos>0 then
    do;
      output error;
      do while (pos > 0);
        ASCII_Col_Seq = rank(substr(string, pos,1));
        output exceptions;
/*        put ASCII_Col_Seq= pos=;*/
        call prxnext(prxid, _start, _stop, string, pos);
      end;
    end;
  else
    output want;
run;
Contributor
Posts: 21

Re: Error while reading special characters

Could you please explain this code a bit?

Respected Advisor
Posts: 4,173

Re: Error while reading special characters

Hi,

 

Some brief explanations as requested.

My thinking behind this code was that it's often very helpful to get a list of all "forbidden" characters together with their positions in the string. This supports best further investigation and problem resolution.

 

- Data Want: Two rows get created, the first one contains "forbidden" characters (in real life: You might also want to exclude characters in the low range, eg. HEX 00).

 

- The ERROR table simply contains a copy of records from WANT where an issue has been found.

- The EXCEPTION table contains a row per issue found in the row from WANT. It gives you the exact position in the string for the hurting character as well as the HEX value. ROWNUM will allow you to trace back where the issue originates from (you could also use the business key columns instead; if there are any).

 

- prxid=prxparse('/[\x80-\xFF]/');  This compiles a Regular Expression which allows searching for characters in the Hex range of 80 to FF (=single byte encoded characters greater Decimal 127).

 

- The way I've used PRXNEXT() allows you to implement a loop over the source string from have searching for forbidden characters one-by-one. The syntax used is pretty close to the example from the doc and explained there.  https://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#n1obc9u7z3...

 

Hope this sheds some light on the code I've posted.

 

Thanks,

Patrick

Valued Guide
Posts: 765

Re: Error while reading special characters

Hi, here's another idea.  First data step creates a data set with one variable that might contain one or more ASCII characters with a value of more than 127.  Next data step checks the length of the string to the length after removibg all characters ASCII value 128+.

 

data have;
length x $10;
do i=1 to 20;
x=' ';
do j=1 to 10;
   y = ceil(100*ranuni(99)) + 32;
   x = catt(x,byte(y));
end;
output;
end;
keep x;
run;

 

* check for characters with ASCII value 128+ ... OK = 1 means there are none;

data want;
set have;
ok = length(x) eq length(compress(x,collate(128,255)));
run;

 

data set WANT ...

Obs        x         ok

  1    @$[e8€l`}\     0
  2    ,wQbkKƒ?fK     0
  3    .w>L+brMdA     1
  4    +04ƒFKhn<a     0
  5    Q8[U/?{K_y     1
  6    Lt(Bzl{+Wy     1
  7    h„9Zm0kZ7C     0
  8    _xb+RLpa_k     1
  9    4C6Qs&M^#]     1
 10    q3$ypchlqC     1
 11    4SrN>?Xspa     1
 12    jvr|1_X}fT     1
 13    {1|Q}DWQ0i     1
 14    ~f]>Yjz7Gm     1
 15    86K08O*€g1     0
 16    H/sE6ITbSi     1
 17    +b_8J5I?=v     1
 18    O4~vtC3ZPw     1
 19    MFr€-CuAeI     0
 20    h„D24xiM}i     0

 

If you want the data step to just stop when characters 128+ are encountered, you could just use ...

 

if length(x) ne length(compress(x,collate(128,255))) then stop;

 

If you are reading raw data rather than a data set, you could use ...

 

data want;
infile 'z:\ascii.txt';
input;
if length(_infile_) ne length(compress(_infile_,collate(128,255))) then stop;
run;

 

 

Valued Guide
Posts: 765

Re: Error while reading special characters

[ Edited ]

Hi, you can also get a list of "bad" characters in an ERROR data set without resorting to PRX functions (for those of us who have never "gotten the hang of PRX") ... that smiley face in data set WANT is an HTML thing ...

 

* make variable X length 20 to increase the chance of 2+ bad characters;

data have;
length x $20;
do i=1 to 20;
x=' ';
do j=1 to 20;
y = ceil(100*ranuni(99)) + 32;
x = catt(x,byte(y));
end;
output;
end;
keep x;
run;

 

data error (keep=rec pos character ascii) want(keep=x ok);
set have;
ok = length(x) eq length(compress(x,collate(128,255)));
output want;
rec=_n_;
start=1;
do j=1 to length(x);
   pos = findc(x,collate(128,255),start);
   if pos then do;

      character=char(x,pos); ascii=rank(character); start+pos; output error;

   end;
end;
run;

 

the ERROR data set ...

Obs rec pos character ascii

1    1   6    €       128
2    1  17    ƒ       131
3    2  14    ƒ       131
4    4   2    „       132
5    8   8    €       128
6   10   4    €       128
7   10  12    „       132
8   11  17    „       132
9   18  10            129
10  19   4    ‚       130
11  19   8    €       128
12  19  13    ƒ       131
13  20   2    €       128

 

the WANT data set ...

Obs       x            ok

1  @$[e8€l`}\,wQbkKƒ?fK 0
2  .w>L+brMdA+04ƒFKhn<a 0
3  Q8[U/?{K_yLt(Bzl{+Wy 1
4  h„9Zm0kZ7C_xb+RLpa_k 0
5  4C6Qs&M^#]q3$ypchlqC 1
6  4SrN>?Xspajvr|1_X}fT 1
7  {1|Q}DWQ0i~f]>Yjz7Gm 1
8  86K08O*€g1H/sE6ITbSi 0
9  +b_8J5I?=vO4~vtC3ZPw 1
10 MFr€-CuAeIh„D24xiM}i 0
11 n>n$/_a[|.1oK!78„sMx 0
12 S?m-{2]|&6'i$oI{3#T  1
13 X$4Cq#igu&'*eJIgqLw# 1
14 ]5eweSmiley TonguepE3't$}DsII0$ 1
15 U;N=cO*iy}xQ_%uICg^` 1
16 Vw.H\=Y?rI[]u^4g/M)O 1
17 M<pLy)%3wg\?bef`76"^ 1
18 +aJFURsNc_/|HMvDsL   0
19 t1q‚C@S€WCq7ƒK]ƒi3\" 0
20 T€S3sWm">P,!*k1NC08C 0

Ask a Question
Discussion stats
  • 5 replies
  • 319 views
  • 0 likes
  • 3 in conversation