- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ACD(#)DEF
This is not a legal SAS variable name. Is there a function that will convert this to a legal SAS name?
I know I can do this with translate function, but then I would need to know every possible illegal character that might appear.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
notname function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002197357.htm
nvalid function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002271952.htm
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
any prx specialists on-line?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
> character that might appear.
Not really. You only need to know what characters are legal. The first character must be alphabetic or an underscore. Subsequent characters can be alphabetic, numeric, or underscores. (See Rules for SAS Variable Names.) Identify legal characters and substitute a legal character (such as '_') for the others.
Or perhaps converting these strings to name literals will do the trick for you? Use the VALIDVARNAME=ANY option and change ABC(#)DEF to 'ABC(#)DEF'n.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
> > character that might appear.
>
> Not really. You only need to know what characters are
> legal. The first character must be alphabetic or an
> underscore. Subsequent characters can be alphabetic,
> numeric, or underscores. (See
> Rules
> for SAS Variable Names.) Identify legal
> characters and substitute a legal character (such as
> '_') for the others.
Your solution still requires me to go through the string character by character, and translate as needed. Which is what I was hoping to avoid. Actually, my idea of using TRANSLATE is a whole lot easier for me to program than looping through each character. I was hoping there was a SAS function that would do the job for me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've wondered about this myself. SAS clearly knows how to do this, with the various IMPORT methods through SAS/ACCESS.
> Your solution still requires me to go through the
> string character by character, and translate as
> needed.
You don't actually need to loop every character. NOTNAME returns character number and when it returns 0 there are no more NOTNAME characters.
NVALID will tell you if you need to try at all.
This might be a good opportunity to learn PROC FCMP.
[pre]
1043 data _null_;
1044 length name $32;
1045 do name = 'ACD(#)DEF','SAS','Hello','1','1C',' ACD(#)DEF ';
1046 if not nvalid(name) then do;
1047 notname = notname(trimN(name));
1048 do while(notname);
1049 substr(name,notname,1)='_';
1050 notname = notname(trimN(name));
1051 end;
1052 if not anyfirst(first(name)) then name = cats('_',name);
1053 end;
1054 put 'NOTE: ' name=;
1055 end;
1056 run;
NOTE: name=ACD___DEF
NOTE: name=SAS
NOTE: name=Hello
NOTE: name=_1
NOTE: name=_1C
NOTE: name=_ACD___DEF
[/pre]
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
> strings produce distinct names? For example, should
> "ABC#DEF" and "ABC!DEF" be converted to two different
> names?
For my purposes, the answer is NO, because the incoming strings aren't truly arbitrary. However, I could see situations where the answer would be YES.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Below a RegEx which will replace all illegal characters with an underscore - more or less what some of the SAS import wizards are doing.
data sample;
have='9ACD(#)D9EF';
want=prxchange('s/^[^_a-z]|[^\w]/_/oi',-1,strip(have));
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is great! I have never used these before, and I just took the time to figure out your string using the SAS help and regexr.com. One thing I haven't figured out is what the final /oi means. SAS Help only defines the s/<string>/<string>/ part. From trial and error, I believe the i means that it's case insensitive. When I removed the i, the matching on the first character of the string only occured when the first character was a lower case a-z, whereas with the i it matched either lower or upper case. What does the o do?
Second question: \w appears to include more than just A-Z, a-z, 0-9, and _. I used your code on the string "Côte d'Ivoire", and SAS returned "Côte_d_Ivoire". Unfortunately, ô is not permissible in a SAS name. Is there a way to get the perl expression to only check regular letters without diacritical marks?
Thanks!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think I figured this out. Just posting the answer for others.
The /o apparently just like the o option in the SAS function compress. If you have something complicated in your arguments, the system will figure out that complication only once, rather than each time. It's good if you know the resolved argument isn't going to change. However, upon reading, people generally say that one should never use /o anymore in Perl, because it automatically does this kind of check itself and /o can introduce other logic issues.
Regarding \w, this matches any letter at all that is deemed a valid letter by your language selection. ô is a valid letter on my computer, as are é, ß, õ, ü, î, etc. So \w will select any of these, and not \w will select none of these. I had to use [_A-Za-z0-9] without the /i option or [_a-z0-9] with the /i option in order to truly select valid SAS name characters. For the formula suggested, it required the 'not' version so it became [^_a-z0-9] with the /i option.
Lastly, prxChange is very slow. I figured it would be faster than SAS statements, but after running tests on various size datasets (1 million - 100 million records), SAS statements seem 3-5x faster. Even when requiring multiple statements, SAS was still quicker than prxChange. I was quite surprised. Here are some times I measured from the log.
Time of simple substitution (s) Time of substitution to valid SAS name (s)
Using prxChange Using SAS statements Using prxChange Using SAS statements
1M records 4.83 4.73 1.46 1.28 5.01 4.90 0.99 0.97 5M records 22.50 25.15 7.63 7.47 10M records 46.42 48.07 15.78 14.24 15M records 68.60 72.20 23.13 24.49 25M records 114.34 120.24 41.45 37.58 50M records 246.17 245.42 77.19 75.68 75M records 392.11 360.53 123.49 118.56 100M records 516.63 472.15 156.91 151.37 143.99 512.24 113.03 108.49 147.51
* Set up datasets of various sizes with random strings. *;
data test100 test075 test050 test025 test015 test010 test005 test001;
length i j length 8. string $ 32. _1-_32 $ 1.;
array __ {*} _1-_32;
do i = 1 to 100000000;
length = floor(32*rand('uniform')) + 1;
do j = 1 to 32;
if j <= length
then __[j] = byte(floor(256*rand('uniform')));
else __[j] = '';
end;
string = cat(of __{*});
output test100;
if i <= 75000000 then output test075;
if i <= 50000000 then output test050;
if i <= 25000000 then output test025;
if i <= 15000000 then output test015;
if i <= 10000000 then output test010;
if i <= 5000000 then output test005;
if i <= 1000000 then output test001;
end;
drop i j _:;
run;
* Test replacing without restriction on the first character not being a digit. *;
%macro repeat;
%do i = 1 %to 17;
data test001PRX; set test001; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test005PRX; set test005; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test010PRX; set test010; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test015PRX; set test015; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test025PRX; set test025; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test050PRX; set test050; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test075PRX; set test075; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test100PRX; set test100; length valid $ 32.; valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test001TRN; set test001; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test005TRN; set test005; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test010TRN; set test010; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test015TRN; set test015; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test025TRN; set test025; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test050TRN; set test050; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test075TRN; set test075; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
data test100TRN; set test100; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); run;
%end;
%mend repeat;
%repeat;
proc compare base = test100TRN compare = test100PRX listAll; run;
* Test again with condition of making first character an underscore if a digit too. *;
%macro repeat2;
%do i = 1 %to 17;
data test001PRX; set test001; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test005PRX; set test005; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test010PRX; set test010; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test015PRX; set test015; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test025PRX; set test025; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test050PRX; set test050; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test075PRX; set test075; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test100PRX; set test100; length valid $ 32.; valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string)); run;
data test001TRN; set test001; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test005TRN; set test005; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test010TRN; set test010; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test015TRN; set test015; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test025TRN; set test025; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test050TRN; set test050; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test075TRN; set test075; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
data test100TRN; set test100; length valid $ 32.; valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di')); if anyDigit(substr(valid, 1, 1)) then substr(valid, 1, 1) = '_'; run;
%end;
%mend repeat2;
%repeat2;
proc compare base = test100TRN compare = test100PRX listAll; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for all the testing and debugging.
As you already figured out:
o compile once
I case insensitive
As I understand it the o wouldn't be necessary when using a constant string. It's only of importance if you pass in the RegEx via a variable.
RegEx are resource intensive and though "slow". But they are really powerful when it comes to pattern matching.
As for the \w: Good catch!
Here some docu links. Especially the 2nd one (metacharacters) is sometimes a bit hard to find.
Cheers,
Patrick
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
[pre]
95 data _null_;
96 string = ' &@ 03 This Is &An. Invalid "SAS" Variable ± Name;';
97 ValidName=compress(string, '_', 'adik');
98 if anydigit(ValidName) = 1 then validname= substr(ValidName,anyalpha(ValidName));
99 put ValidName=;
100 run;
ValidName=ThisIsAnInvalidSASVariableName
[/pre]