- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Experts
I have the following sample dataset
data have;
length url $3000.;
input url;
datalines;
blogs.sas.com/wan/2022/03/18/sas-eg-跳出錯誤訊
www.dog.it
;
run;
I'm trying to find a way to exclude all the row in the dataset which include not ASCII standard characters or not printable characters. Any hints appreciated
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The COMPRESS function to the rescue! This example keeps only printable characters.
data want;
set have;
where compress(url,,'kw')=url;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @dcortell,
You can use the FINDC function with modifiers corresponding to character classes that you may want to keep or exclude. Or the VERIFY function:
data want;
set have;
if ~verify(url, collate(32,126));
run;
In this example I use the COLLATE function to specify the ASCII characters from blank (decimal ASCII code 32) to tilde (126) as the admissible characters. The subsetting IF statement excludes all observations where URL contains a character outside of this range.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The COMPRESS function to the rescue! This example keeps only printable characters.
data want;
set have;
where compress(url,,'kw')=url;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That is going to include a LOT of non-ASCII characters.
91 data want; 92 url=collate(0,255); 93 expect=collate(32,126); 94 try=compress(url,,'kw'); 95 if try ne expect then do; 96 extra=compress(try,expect); 97 put extra= / extra $hex. ; 98 end; 99 run; extra=€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇ 8082838485868788898A8B8C8E9192939495969798999A9B9C9E9FA0A1A2A3A4A5A6A7A8A9AAABACAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your sample data indicate otherwise but should you by any change be dealing with multibyte characters in your real data then none of the already proposed solutions would work and you need to look into SAS string functions on level I18N Level 2.
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/nlsref/p1pca7vwjjwucin178l8qddjn0gi.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note that UTF-8 encoding is designed not to mess with normal ASCII codes, so the first (best) solution of using VERIFY() with COLLATE(32,126) will work fine on UTF-8 strings.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
length url $3000.;
input url;
datalines;
blogs.sas.com/wan/2022/03/18/sas-eg-跳出錯誤訊
www.dog.it
;
run;
data want;
set have;
if prxmatch('/[[:^ascii:]]/',url) then flag=1;
run;