Re: SAS Function to convert string to Legal SAS Name?

Paige · Posted 11-03-2010 10:04 AM

Suppose I have an arbitrary text string, such as

ACD(#)DEF

This is not a legal SAS variable name. Is there a function that will convert this to a legal SAS name?

I know I can do this with translate function, but then I would need to know every possible illegal character that might appear.

Daryl · Posted 11-03-2010 11:02 AM

I haven't found a function to convert a string to a SAS variable name, but you might have a look at these functions; they might help you get where you need to go.

notname function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002197357.htm

nvalid function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002271952.htm

Peter_C · Posted 11-03-2010 12:23 PM

it sounds like a good application for a prxchange()

any prx specialists on-line?

Tim_SAS · Posted 11-03-2010 01:38 PM

> I would need to know every possible illegal
> character that might appear.

Not really. You only need to know what characters are legal. The first character must be alphabetic or an underscore. Subsequent characters can be alphabetic, numeric, or underscores. (See Rules for SAS Variable Names.) Identify legal characters and substitute a legal character (such as '_') for the others.

Or perhaps converting these strings to name literals will do the trick for you? Use the VALIDVARNAME=ANY option and change ABC(#)DEF to 'ABC(#)DEF'n.

Paige · Posted 11-03-2010 03:14 PM

> > I would need to know every possible illegal
> > character that might appear.
>
> Not really. You only need to know what characters are
> legal. The first character must be alphabetic or an
> underscore. Subsequent characters can be alphabetic,
> numeric, or underscores. (See
> Rules
> for SAS Variable Names.) Identify legal
> characters and substitute a legal character (such as
> '_') for the others.

Your solution still requires me to go through the string character by character, and translate as needed. Which is what I was hoping to avoid. Actually, my idea of using TRANSLATE is a whole lot easier for me to program than looping through each character. I was hoping there was a SAS function that would do the job for me.

data_null__ · Posted 11-03-2010 04:15 PM

Post=500

I've wondered about this myself. SAS clearly knows how to do this, with the various IMPORT methods through SAS/ACCESS.

> Your solution still requires me to go through the
> string character by character, and translate as
> needed.

You don't actually need to loop every character. NOTNAME returns character number and when it returns 0 there are no more NOTNAME characters.
NVALID will tell you if you need to try at all.

This might be a good opportunity to learn PROC FCMP.

[pre]
1043 data _null_;
1044 length name $32;
1045 do name = 'ACD(#)DEF','SAS','Hello','1','1C',' ACD(#)DEF ';
1046 if not nvalid(name) then do;
1047 notname = notname(trimN(name));
1048 do while(notname);
1049 substr(name,notname,1)='_';
1050 notname = notname(trimN(name));
1051 end;
1052 if not anyfirst(first(name)) then name = cats('_',name);
1053 end;
1054 put 'NOTE: ' name=;
1055 end;
1056 run;

NOTE: name=ACD___DEF
NOTE: name=SAS
NOTE: name=Hello
NOTE: name=_1
NOTE: name=_1C
NOTE: name=_ACD___DEF
[/pre]

Tim_SAS · Posted 11-04-2010 08:26 AM

Does the solution need to ensure that distinct strings produce distinct names? For example, should "ABC#DEF" and "ABC!DEF" be converted to two different names?

data_null__ · Posted 11-04-2010 09:06 AM

I would think YES. Which is a bit complicated than just changing NOTNAME characters.

Peter_C · Posted 11-04-2010 01:07 PM

could push into a hash table

Paige · Posted 11-04-2010 02:05 PM

> Does the solution need to ensure that distinct
> strings produce distinct names? For example, should
> "ABC#DEF" and "ABC!DEF" be converted to two different
> names?

For my purposes, the answer is NO, because the incoming strings aren't truly arbitrary. However, I could see situations where the answer would be YES.

Patrick · Posted 06-25-2018 07:04 AM

@Paige

Below a RegEx which will replace all illegal characters with an underscore - more or less what some of the SAS import wizards are doing.

data sample;
  have='9ACD(#)D9EF';
  want=prxchange('s/^[^_a-z]|[^\w]/_/oi',-1,strip(have));
run;

Kastchei · Posted 06-27-2018 01:01 PM

This is great! I have never used these before, and I just took the time to figure out your string using the SAS help and regexr.com. One thing I haven't figured out is what the final /oi means. SAS Help only defines the s/<string>/<string>/ part. From trial and error, I believe the i means that it's case insensitive. When I removed the i, the matching on the first character of the string only occured when the first character was a lower case a-z, whereas with the i it matched either lower or upper case. What does the o do?

Second question: \w appears to include more than just A-Z, a-z, 0-9, and _. I used your code on the string "Côte d'Ivoire", and SAS returned "Côte_d_Ivoire". Unfortunately, ô is not permissible in a SAS name. Is there a way to get the perl expression to only check regular letters without diacritical marks?

Thanks!!

Kastchei · Posted 06-28-2018 10:59 AM

I think I figured this out. Just posting the answer for others.

The /o apparently just like the o option in the SAS function compress. If you have something complicated in your arguments, the system will figure out that complication only once, rather than each time. It's good if you know the resolved argument isn't going to change. However, upon reading, people generally say that one should never use /o anymore in Perl, because it automatically does this kind of check itself and /o can introduce other logic issues.

Regarding \w, this matches any letter at all that is deemed a valid letter by your language selection. ô is a valid letter on my computer, as are é, ß, õ, ü, î, etc. So \w will select any of these, and not \w will select none of these. I had to use [_A-Za-z0-9] without the /i option or [_a-z0-9] with the /i option in order to truly select valid SAS name characters. For the formula suggested, it required the 'not' version so it became [^_a-z0-9] with the /i option.

Lastly, prxChange is very slow. I figured it would be faster than SAS statements, but after running tests on various size datasets (1 million - 100 million records), SAS statements seem 3-5x faster. Even when requiring multiple statements, SAS was still quicker than prxChange. I was quite surprised. Here are some times I measured from the log.

                 Time of simple substitution (s)            Time of substitution to valid SAS name (s)
              Using prxChange  Using SAS statements           Using prxChange  Using SAS statements
  1M records     4.83    4.73       1.46    1.28                 5.01    4.90       0.99    0.97
  5M records    22.50   25.15       7.63    7.47
 10M records    46.42   48.07      15.78   14.24
 15M records    68.60   72.20      23.13   24.49
 25M records   114.34  120.24      41.45   37.58
 50M records   246.17  245.42      77.19   75.68
 75M records   392.11  360.53     123.49  118.56
100M records   516.63  472.15     156.91  151.37  143.99       512.24             113.03  108.49  147.51

* Set up datasets of various sizes with random strings. *;
data test100 test075 test050 test025 test015 test010 test005 test001;
	length i j length 8. string $ 32. _1-_32 $ 1.;
	array __ {*} _1-_32;

	do i = 1 to 100000000;
		length = floor(32*rand('uniform')) + 1;
		
		do j = 1 to 32;
			if j <= length
				then __[j] = byte(floor(256*rand('uniform')));
				else __[j] = '';
		end;

		string = cat(of __{*});

		                       output test100;
		if i <= 75000000  then output test075;
		if i <= 50000000  then output test050;
		if i <= 25000000  then output test025;
		if i <= 15000000  then output test015;
		if i <= 10000000  then output test010;
		if i <=  5000000  then output test005;
		if i <=  1000000  then output test001;
	end;

	drop i j _:;
run;

* Test replacing without restriction on the first character not being a digit. *;
%macro repeat;
	%do i = 1 %to 17;
		data test001PRX;  set test001;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test005PRX;  set test005;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test010PRX;  set test010;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test015PRX;  set test015;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test025PRX;  set test025;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test050PRX;  set test050;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test075PRX;  set test075;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test100PRX;  set test100;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;

		data test001TRN;  set test001;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test005TRN;  set test005;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test010TRN;  set test010;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test015TRN;  set test015;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test025TRN;  set test025;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test050TRN;  set test050;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test075TRN;  set test075;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test100TRN;  set test100;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
	%end;
%mend repeat;

%repeat;

proc compare base = test100TRN compare = test100PRX listAll;  run;
	
* Test again with condition of making first character an underscore if a digit too. *;
%macro repeat2;
	%do i = 1 %to 17;
		data test001PRX;  set test001;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test005PRX;  set test005;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test010PRX;  set test010;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test015PRX;  set test015;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test025PRX;  set test025;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test050PRX;  set test050;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test075PRX;  set test075;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test100PRX;  set test100;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;

		data test001TRN;  set test001;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test005TRN;  set test005;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test010TRN;  set test010;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test015TRN;  set test015;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test025TRN;  set test025;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test050TRN;  set test050;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test075TRN;  set test075;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test100TRN;  set test100;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
	%end;
%mend repeat2;

%repeat2;

proc compare base = test100TRN compare = test100PRX listAll;  run;

Patrick · Posted 06-28-2018 05:32 PM

@Kastchei

Thanks for all the testing and debugging.

As you already figured out:

o compile once

I case insensitive

As I understand it the o wouldn't be necessary when using a constant string. It's only of importance if you pass in the RegEx via a variable.

RegEx are resource intensive and though "slow". But they are really powerful when it comes to pattern matching.

As for the \w: Good catch!

Here some docu links. Especially the 2nd one (metacharacters) is sometimes a bit hard to find.

http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0r8h2fa8djqf1n1cnenrvm573br.htm&...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

Cheers,

Patrick

polingjw · Posted 11-04-2010 05:54 AM

Here is another quick solution if it is acceptable to simply remove all the invalid characters from the string:

[pre]
95 data _null_;
96 string = ' &@ 03 This Is &An. Invalid "SAS" Variable ± Name;';
97 ValidName=compress(string, '_', 'adik');
98 if anydigit(ValidName) = 1 then validname= substr(ValidName,anyalpha(ValidName));
99 put ValidName=;
100 run;

ValidName=ThisIsAnInvalidSASVariableName
[/pre]

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away