Hello Community!
I have data that gets aggregated from multiple countries and I need a way to create a flag for a character variable that contains characters that are not found on USA keyboards. Is there a way to do this? Any guidance is appreciated!
Thanks!
@biglerc wrote:
Thank you very much for your response! I ended up using a loop through ascii numbers and it worked.
No need to loop. You can use COLLATE to generate them. All of the printable characters, from space to tilde, can be generated with COLLATE(32,126). Then use INDEXC() or VERIFY() function depending whether you want to test for the existence of of valid characters or existence of invalid characters.
data want;
set have;
any_valid = 0 ne indexc(string,collate(32,126));
any_invalid = 0 ne verify(string,collate(32,126));
run;
Example:
data have;
input string $50.;
if string='alpha' then string=tranwrd(string,'alpha','CEB1'x);
cards4;
`1234567890-=~!@#$%^&*()_+
qwertyuiop[]\QWERTYUIOP{}|
asdfghjkl;'ASDFGHJKL:"
zxcvbnm,./ZXCVBNM<>?
alpha
;;;;
data want;
set have;
any_valid = 0 ne indexc(trim(string),collate(32,126));
any_invalid = 0 ne verify(trim(string),collate(32,126));
run;
Result
Lots of answers here, depending on what "non-USA keyboard characters" really means to you.
Thank you very much for your response! I ended up using a loop through ascii numbers and it worked.
@biglerc If you're only dealing with single byte characters then below will also work and requires less code.
data have;
var='AbIöIxz'; output;
var='AbIoIxz'; output;
run;
data want;
set have;
nonEnglishChar_flg= prxmatch('/[^a-z]/i',strip(var))>0;
run;
proc print data=want;
run;
Thank you, Patrick! I am dealing with loads of them, since the data is coming from different countries. Here's an example: RESULT = תשובה לדוגמה.
My code currently uses substr to go character by character in every text string and identifies the position of a special character (if it finds one). This is working perfectly except for the instances like my example above where the entire string is special characters. In these cases SAS throws an error. Here is my code - it may be too out of context but maybe you can follow it? I'm just wondering why it won't do a substr if the whole value is special characters.
/* step through each character in the field to look for special character(s) */
do i = 1 to length(current_variable) ;
if rank(substr(current_variable),i,1)) gt 0 and rank(substr(current_variable),i,1)) lt 32 or rank(substr(current_variable),i,1)) gt 127 then do;
position= i;
char = byte(rank(substr(varvalue,i,1)));
nonascii_pos = catx(',',nonascii_pos,strip(put(position,8.)));
nonascii = catx(',',nonascii,strip(char ));
end;
end;
Thanks for taking the time to answer my question!
@biglerc wrote:
Thank you very much for your response! I ended up using a loop through ascii numbers and it worked.
No need to loop. You can use COLLATE to generate them. All of the printable characters, from space to tilde, can be generated with COLLATE(32,126). Then use INDEXC() or VERIFY() function depending whether you want to test for the existence of of valid characters or existence of invalid characters.
data want;
set have;
any_valid = 0 ne indexc(string,collate(32,126));
any_invalid = 0 ne verify(string,collate(32,126));
run;
Example:
data have;
input string $50.;
if string='alpha' then string=tranwrd(string,'alpha','CEB1'x);
cards4;
`1234567890-=~!@#$%^&*()_+
qwertyuiop[]\QWERTYUIOP{}|
asdfghjkl;'ASDFGHJKL:"
zxcvbnm,./ZXCVBNM<>?
alpha
;;;;
data want;
set have;
any_valid = 0 ne indexc(trim(string),collate(32,126));
any_invalid = 0 ne verify(trim(string),collate(32,126));
run;
Result
This is fantastic! I have never used COLLATE so this is a great learning experience for me. Thank you sooo much!
Hi Tom,
Might I ask you for guidance on a slight modification to the code you sent? Part of my requirement is to keep track of the position of the special character in a string. So for example, if the variable called "RESULT" contains 5 characters and the 3rd character is a special character, then I need to output 1) the name of the variable, 2) the special character(s) and 3) the position(s) of the special character(s). This is why I was looping through each character in the string. My code works correctly if there is at least one NON-special character in the string, however, if all characters are special characters, it crashes.
I'm currently experimenting to see if I can combine my looping with the collate you taught me to resolve this 🙂
Thanks!
Looping over the characters in a list should work.
You might share your issue with the loop failing when there are no special characters in a new topic. Provide a couple of simple example strings that are causing your problems.
Watch out if you are running with system encoding set to UTF-8 or other multibyte character encodings. It wouldn't matter for generating the simple binary flag variable since the normal 7bit ASCII characters only use one byte. But for locating other characters you will need to use the Kxxxx series of functions that are designed for working with characters that might use more than one byte.
So perhaps something like:
if verify(string,collate(32,126) then do;
do i=1 to klength(string);
length char $4 ;
char = ksubstr(string,i,1);
if verify(char,collate(32,126)) then put i= char= ;
end;
end;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.