BookmarkSubscribeRSS Feed
Paige
Quartz | Level 8
Suppose I have an arbitrary text string, such as

ACD(#)DEF

This is not a legal SAS variable name. Is there a function that will convert this to a legal SAS name?

I know I can do this with translate function, but then I would need to know every possible illegal character that might appear.
18 REPLIES 18
Daryl
SAS Employee
I haven't found a function to convert a string to a SAS variable name, but you might have a look at these functions; they might help you get where you need to go.

notname function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002197357.htm

nvalid function: http://support.sas.com/documentation/cdl/en/lrdict/63026/HTML/default/viewer.htm#a002271952.htm
Peter_C
Rhodochrosite | Level 12
it sounds like a good application for a prxchange()

any prx specialists on-line?
Tim_SAS
Barite | Level 11
> I would need to know every possible illegal
> character that might appear.

Not really. You only need to know what characters are legal. The first character must be alphabetic or an underscore. Subsequent characters can be alphabetic, numeric, or underscores. (See Rules for SAS Variable Names.) Identify legal characters and substitute a legal character (such as '_') for the others.

Or perhaps converting these strings to name literals will do the trick for you? Use the VALIDVARNAME=ANY option and change ABC(#)DEF to 'ABC(#)DEF'n.
Paige
Quartz | Level 8
> > I would need to know every possible illegal
> > character that might appear.
>
> Not really. You only need to know what characters are
> legal. The first character must be alphabetic or an
> underscore. Subsequent characters can be alphabetic,
> numeric, or underscores. (See
> Rules
> for SAS Variable Names
.) Identify legal
> characters and substitute a legal character (such as
> '_') for the others.

Your solution still requires me to go through the string character by character, and translate as needed. Which is what I was hoping to avoid. Actually, my idea of using TRANSLATE is a whole lot easier for me to program than looping through each character. I was hoping there was a SAS function that would do the job for me.
data_null__
Jade | Level 19
Post=500

I've wondered about this myself. SAS clearly knows how to do this, with the various IMPORT methods through SAS/ACCESS.

> Your solution still requires me to go through the
> string character by character, and translate as
> needed.

You don't actually need to loop every character. NOTNAME returns character number and when it returns 0 there are no more NOTNAME characters.
NVALID will tell you if you need to try at all.

This might be a good opportunity to learn PROC FCMP.

[pre]
1043 data _null_;
1044 length name $32;
1045 do name = 'ACD(#)DEF','SAS','Hello','1','1C',' ACD(#)DEF ';
1046 if not nvalid(name) then do;
1047 notname = notname(trimN(name));
1048 do while(notname);
1049 substr(name,notname,1)='_';
1050 notname = notname(trimN(name));
1051 end;
1052 if not anyfirst(first(name)) then name = cats('_',name);
1053 end;
1054 put 'NOTE: ' name=;
1055 end;
1056 run;

NOTE: name=ACD___DEF
NOTE: name=SAS
NOTE: name=Hello
NOTE: name=_1
NOTE: name=_1C
NOTE: name=_ACD___DEF
[/pre]
Tim_SAS
Barite | Level 11
Does the solution need to ensure that distinct strings produce distinct names? For example, should "ABC#DEF" and "ABC!DEF" be converted to two different names?
data_null__
Jade | Level 19
I would think YES. Which is a bit complicated than just changing NOTNAME characters.
Peter_C
Rhodochrosite | Level 12
could push into a hash table
Paige
Quartz | Level 8
> Does the solution need to ensure that distinct
> strings produce distinct names? For example, should
> "ABC#DEF" and "ABC!DEF" be converted to two different
> names?

For my purposes, the answer is NO, because the incoming strings aren't truly arbitrary. However, I could see situations where the answer would be YES.
Patrick
Opal | Level 21

@Paige

Below a RegEx which will replace all illegal characters with an underscore - more or less what some of the SAS import wizards are doing.

data sample;
  have='9ACD(#)D9EF';
  want=prxchange('s/^[^_a-z]|[^\w]/_/oi',-1,strip(have));
run;
Kastchei
Pyrite | Level 9

This is great!  I have never used these before, and I just took the time to figure out your string using the SAS help and regexr.com.  One thing I haven't figured out is what the final /oi means.  SAS Help only defines the s/<string>/<string>/ part.  From trial and error, I believe the i means that it's case insensitive.  When I removed the i, the matching on the first character of the string only occured when the first character was a lower case a-z, whereas with the i it matched either lower or upper case.  What does the o do?

 

Second question: \w appears to include more than just A-Z, a-z, 0-9, and _.  I used your code on the string "Côte d'Ivoire", and SAS returned "Côte_d_Ivoire".  Unfortunately, ô is not permissible in a SAS name.  Is there a way to get the perl expression to only check regular letters without diacritical marks?

 

Thanks!!

Kastchei
Pyrite | Level 9

I think I figured this out.  Just posting the answer for others.

 

The /o apparently just like the o option in the SAS function compress.  If you have something complicated in your arguments, the system will figure out that complication only once, rather than each time.  It's good if you know the resolved argument isn't going to change.  However, upon reading, people generally say that one should never use /o anymore in Perl, because it automatically does this kind of check itself and /o can introduce other logic issues.

 

Regarding \w, this matches any letter at all that is deemed a valid letter by your language selection.  ô is a valid letter on my computer, as are é, ß, õ, ü, î, etc.  So \w will select any of these, and not \w will select none of these.  I had to use [_A-Za-z0-9] without the /i option or [_a-z0-9] with the /i option in order to truly select valid SAS name characters.  For the formula suggested, it required the 'not' version so it became [^_a-z0-9] with the /i option.

 

Lastly, prxChange is very slow.  I figured it would be faster than SAS statements, but after running tests on various size datasets (1 million - 100 million records), SAS statements seem 3-5x faster.  Even when requiring multiple statements, SAS was still quicker than prxChange.  I was quite surprised.  Here are some times I measured from the log.

 

 

                 Time of simple substitution (s)            Time of substitution to valid SAS name (s)
Using prxChange Using SAS statements Using prxChange Using SAS statements
1M records 4.83 4.73 1.46 1.28 5.01 4.90 0.99 0.97 5M records 22.50 25.15 7.63 7.47 10M records 46.42 48.07 15.78 14.24 15M records 68.60 72.20 23.13 24.49 25M records 114.34 120.24 41.45 37.58 50M records 246.17 245.42 77.19 75.68 75M records 392.11 360.53 123.49 118.56 100M records 516.63 472.15 156.91 151.37 143.99 512.24 113.03 108.49 147.51

 

 

 

* Set up datasets of various sizes with random strings. *;
data test100 test075 test050 test025 test015 test010 test005 test001;
	length i j length 8. string $ 32. _1-_32 $ 1.;
	array __ {*} _1-_32;

	do i = 1 to 100000000;
		length = floor(32*rand('uniform')) + 1;
		
		do j = 1 to 32;
			if j <= length
				then __[j] = byte(floor(256*rand('uniform')));
				else __[j] = '';
		end;

		string = cat(of __{*});

		                       output test100;
		if i <= 75000000  then output test075;
		if i <= 50000000  then output test050;
		if i <= 25000000  then output test025;
		if i <= 15000000  then output test015;
		if i <= 10000000  then output test010;
		if i <=  5000000  then output test005;
		if i <=  1000000  then output test001;
	end;

	drop i j _:;
run;

* Test replacing without restriction on the first character not being a digit. *;
%macro repeat;
	%do i = 1 %to 17;
		data test001PRX;  set test001;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test005PRX;  set test005;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test010PRX;  set test010;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test015PRX;  set test015;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test025PRX;  set test025;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test050PRX;  set test050;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test075PRX;  set test075;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test100PRX;  set test100;  length valid $ 32.;  valid = prxchange('s/[^_A-Z0-9]/_/i',-1,strip(string));  run;

		data test001TRN;  set test001;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test005TRN;  set test005;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test010TRN;  set test010;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test015TRN;  set test015;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test025TRN;  set test025;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test050TRN;  set test050;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test075TRN;  set test075;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
		data test100TRN;  set test100;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  run;
	%end;
%mend repeat;

%repeat;

proc compare base = test100TRN compare = test100PRX listAll;  run;
	
* Test again with condition of making first character an underscore if a digit too. *;
%macro repeat2;
	%do i = 1 %to 17;
		data test001PRX;  set test001;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test005PRX;  set test005;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test010PRX;  set test010;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test015PRX;  set test015;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test025PRX;  set test025;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test050PRX;  set test050;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test075PRX;  set test075;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;
		data test100PRX;  set test100;  length valid $ 32.;  valid = prxchange('s/^[^_A-Z]|[^_A-Z0-9]/_/i',-1,strip(string));  run;

		data test001TRN;  set test001;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test005TRN;  set test005;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test010TRN;  set test010;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test015TRN;  set test015;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test025TRN;  set test025;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test050TRN;  set test050;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test075TRN;  set test075;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
		data test100TRN;  set test100;  length valid $ 32.;  valid = translate(strip(string), repeat('_', length(strip(string))), compress(strip(string),'_abcdefghijklmnopqrstuvwxyz', 'di'));  if anyDigit(substr(valid, 1, 1))  then substr(valid, 1, 1) = '_';  run;
	%end;
%mend repeat2;

%repeat2;

proc compare base = test100TRN compare = test100PRX listAll;  run;

 

Patrick
Opal | Level 21

@Kastchei

Thanks for all the testing and debugging.

As you already figured out:

o  compile once

I   case insensitive

 

As I understand it the o wouldn't be necessary when using a constant string. It's only of importance if you pass in the RegEx via a variable.

 

RegEx are resource intensive and though "slow". But they are really powerful when it comes to pattern matching.

 

As for the \w: Good catch! 

 

Here some docu links. Especially the 2nd one (metacharacters) is sometimes a bit hard to find.

http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=n0r8h2fa8djqf1n1cnenrvm573br.htm&...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

 

Cheers,

Patrick

polingjw
Quartz | Level 8
Here is another quick solution if it is acceptable to simply remove all the invalid characters from the string:

[pre]
95 data _null_;
96 string = ' &@ 03 This Is &An. Invalid "SAS" Variable ± Name;';
97 ValidName=compress(string, '_', 'adik');
98 if anydigit(ValidName) = 1 then validname= substr(ValidName,anyalpha(ValidName));
99 put ValidName=;
100 run;

ValidName=ThisIsAnInvalidSASVariableName
[/pre]

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 5003 views
  • 9 likes
  • 10 in conversation