Is there any category like string.ascii_letters from which i can searc...

Nipun22 · Posted 04-26-2025 01:44 PM

data have;
input name $ letter $;
datalines;
john j
Jacobs J
Alex A
mathew m
harward h
Liner L
Meesa M
xenon x
;
run;

data want;
set have;
if letter in('A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h') then group='A';
if letter in('I' 'J' 'K' 'L' 'M' 'N' 'O' 'i' 'j' 'k' 'l' 'm' 'n' 'o') then group = 'B';
if letter in('P' 'Q' 'R' 'S' 'T' 'U' 'W' 'X' 'Y' 'Z' 'p' 'q' 'r' 's' 't' 'u' 'w' 'x' 'y' 'z') 
then group = 'C';
run;

this is the result i want from the code i have written above but is there any category like string.ascii_letters (similar to python) from which i can assign group so that i dont have to write individual letters like above?

Tom · Posted 04-26-2025 02:08 PM

I am not sure what your question actually is.

But it looks like you want to make a FORMAT or informat that will convert your variable LETTER into the strings you want in the variable GROUP.

So you might make a format named $GROUP

proc format ;
value $group 
 'A','B','C','D','E','F','G','H','a','b','c','d','e','f','g','h' ='A'
 'I','J','K','L','M','N','O','i','j','k','l','m','n','o' = 'B'
 'P','Q','R','S','T','U','W','X','Y','Z','p','q','r','s','t','u','w','x','y','z' = 'C'
;
run;

Then use that format. Either by attaching it to the LETTER variable so it displays the actual letter as the group code instead.

proc print data=have;
  format letter $group.;
run;

Or by using the PUT() function to make a new variable with the group code.

data want;
  set have;
  group=put(letter,$group.);
run;

If you don't need to attach the format then it might be easier to make an INFORMAT instead and use the INPUT() function. That way you could take advantage of the UPCASE option available for informats that is not available for formats.

proc format ;
invalue $group (upcase)
 'A','B','C','D','E','F','G','H' ='A'
 'I','J','K','L','M','N','O' = 'B'
 'P','Q','R','S','T','U','W','X','Y','Z' = 'C'
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;

Tom · Posted 04-26-2025 02:18 PM

If you want help with comparisons to features from some other language you should explain in detail what that feature is. So what is string.ascii_letters in Python?

If you want the ASCII code (actually the code used to display that particular byte in your current single byte encoding) then use the RANK() function. So the uppercase letter A in ASCII (and most single byte encodings like LATIN1 or WLATIN1 or even in the multi-byte encoding UTF-8) is the hexadecimal number 41 , which is the decimal number 65.

226  data _null_;
227    letter='A';
228    number=rank(letter);
229    put number= hex2. number= ;
230  run;

number=41 number=65

If the LETTER is in EBCDIC then you will get different codes.

But I am not sure how that knowledge helps with the problem you showed.

Why not just use comparison operators?

data want;
  set have;
  if 'A' <= upcase(letter) <= 'H'  then group='A';
  else if 'I' <= upcase(letter) <= 'O' then group = 'B';
  else if 'P' <= upcase(letter) <= 'Z' then group = 'C';
  else group = ' ';
run;

Nipun22 · Posted 09-07-2025 11:22 AM

thanks for the solution tom.
i didn't know that comparison operators can also be used on characters to group the records while characters can also be treated as a range. this is the most simplified solution could come across.

qingy · Posted 04-26-2025 04:56 PM

You can use Perl regular expressions for this, through the PRXMATCH function. For example, /^[A-H]\s*$/i tells the function to start at the beginning of the string (symbolized by '^'), for a character ('[...]') in the range of A-H, possibly trailed by spaces ('\s') repeated 0 or more times ('*') before reaching the end of string '$'. The '/i' flag at the end makes the matching case-insensitive.

Note that I added a couple of extra rows at the end of the test dataset to validate against unicode look-alikes of ASCII characters.

data have;
length name $20;
input name $ letter $;
datalines;
john j
Jacobs J
Alex A
mathew m
harward h
Liner L
Meesa M
xenon x
ascii_a a
not_really_a а
;
run;

proc print data=have;
	format letter $hex8.;
run;

data want;
	set have;
	if prxmatch('/^[A-H]\s*$/i', letter) then
		group = 'A';
	else if prxmatch('/^[I-O]\s*$/i', letter) then
		group = 'B';
	else if prxmatch('/^[P-Z]\s*$/i', letter) then
		group = 'C';
run;

proc print data=want;
run;

Tom · Posted 04-27-2025 11:51 AM

You can use Regular Expressions when defining INFORMATs also.

proc format ;
invalue $group (upcase)
 '/[A-H]/' (regexe) = 'A'
 '/[I-O]/' (regexe) = 'B'
 '/[P-Z]/' (regexe) = 'C'
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;

Ksharp · Posted 04-26-2025 09:42 PM

Yes. But you need to be familar with Perl Regular Expression.

data have;
input name $ letter $;
if prxmatch('/[a-h]/i',strip(letter)) then group='A';
 else if prxmatch('/[i-o]/i',strip(letter)) then group='B';
  else if prxmatch('/[p-z]/i',strip(letter)) then group='C';

if prxmatch('/[[:ascii:]]/',strip(letter)) then is_ascii='Y';

datalines;
john j
Jacobs J
Alex A
mathew m
harward h
Liner L
Meesa M
xenon x
;
run;

qingy · Posted 04-27-2025 11:06 AM

I don't think your code is an improvement over what I have provided. The way I have things set up, the Group field is left blank if there is an unexpected value in the Letter field (i.e. if it is not a single ASCII letter with no leading blanks). Additional steps to investigate the records that could not be assigned a Group value, such as checking for non-ASCII characters, can wait until it is established that such records exist - which is hopefully not the case.

With your unanchored '/[a-h]/i' regex pattern, a valid Group value could be assigned even if the record has an invalid entry in the Letter field, e.g. 'cellShift', '7z'. In my experience, it is a safer style to to flag such unexpected inputs with a special value in the output (blank in this case) instead of assuming they don't exist.

Tom · Posted 04-27-2025 11:52 AM

You can use the _ERROR_ keyword when defining an INFORMAT to provide some error checking.

proc format ;
invalue $group (upcase)
 '/[A-H]/' (regexe) = 'A'
 '/[I-O]/' (regexe) = 'B'
 '/[P-Z]/' (regexe) = 'C'
  other = _error_
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;

qingy · Posted 04-27-2025 02:05 PM

The informat technique is an interesting one that I haven't seen before, but I have a couple of quibbles:

Apparently putting the character value you want the case to map to after the equals sign doesn't work in the case of regexp character informats (you'll get a syntax error if you try). The correct way to do it seems to be to use a substitution regex and then map it to the _same_ value like so (also note that the option is regexp or regexpe, not regexe).

proc format ;
invalue $group (upcase)
 's/[A-H]/A/' (regexpe) = _same_
 's/[I-O]/B/' (regexpe) = _same_
 's/[P-Z]/C/' (regexpe) = _same_
  other = _error_
;
run;

My other quibble is that again, it's essential to properly anchor the regular expression with ^ and $ so that input with extraneous non-whitespace characters are flagged for review. For example, consider this hypothetical dataset:

data have;
length name $20;
input name $ letter $;
datalines;
Meesa M
xenon x
ascii_a a
not_really_a а
typo vb
hex_k 6b
cellshift purple
;
run;

If you process it using the unanchored regexes, the only error you get is for the Cyrillic 'а'. But if you process it using the anchored versions of the regexes (see below), it correctly logs errors for each of the last four rows.

proc format ;
invalue $group_anchored (upcase)
 's/^\s*[A-H]\s*$/A/' (regexpe) = _same_
 's/^\s*[I-O]\s*$/B/' (regexpe) = _same_
 's/^\s*[P-Z]\s*$/C/' (regexpe) = _same_
  other = _error_
;
run;

Tom · Posted 04-27-2025 03:16 PM

I would not use \s as that includes any "white space" , not just spaces. I would prefer that things like tabs or "non-breaking spaces" be considered invalid.

It seems it is not REGEXP vs REGEXPE so much as the fact that it seems to only allow ONE regex expression per informat. Some how the parser appears to get confused when parsing the second range when the first range results in a constant instead of the special keywords like _same_ or _error_.

965  invalue $groupx (upcase)
966    '/^[A-H]\ *$/' (regexp) = 'A'
967    '/^[I-O]\ *$/' (regexp) = 'B'
                      -
                      22
                      76
ERROR 22-322: Syntax error, expecting one of the following: a quoted string,
              a numeric constant, a datetime constant, a missing value, ;,
              LOW, OTHER.
ERROR 76-322: Syntax error, statement will be ignored.
968    '/^[P-Z]\ *$/' (regexp) = 'C'
969    other=_error_
970  ;

But you can make the format if you use CNTLIN= option with a dataset that defines the format.

data cntlin ;
length fmtname $32 type $1 start $40 label $40 hlo $3 ;
FMTNAME="GROUP" ;
TYPE="J" ;
START="/^[A-H]\ *$/i" ; LABEL="A" ; HLO="P" ; output;
START="/^[I-O]\ *$/i" ; LABEL="B" ; HLO="P" ; output;
START="/^[P-Z]\ *$/i" ; LABEL="C" ; HLO="P" ; output;
START="**OTHER**" ; LABEL="_ERROR_" ; HLO="O" ; output;
run;
proc format lib=work.formats cntlin=cntlin ;
run;

You can also get around that by nesting multiple formats instead.

proc format ;
invalue $group (upcase) '/^[^ ]\ *$/' (regexp) = [$groupa.] other=_error_;
invalue $groupa  '/[A-H]/' (regexp) = 'A' other=[$groupb.];
invalue $groupb  '/[I-O]/' (regexp) = 'B' other=[$groupc.];
invalue $groupc  '/[P-Z]/' (regexp) = 'C' other=_error_;
run;

Example:

Spoiler

resetline;
data have;
  input name :$20. letter $;
  if name='Leadingspace' then letter=' A';
datalines;
A_lowercase b
A_uppercase A
b_Lowercase k
B_uppercase J
C_lowercase q
C_uppercase R
Multiple  AZ
Empty .
Non_letter /
Leadingspace .
;

proc format ;
invalue $group (upcase) '/^[^ ]\ *$/' (regexp) = [$groupa.] other=_error_;
invalue $groupa  '/[A-H]/' (regexp) = 'A' other=[$groupb.];
invalue $groupb  '/[I-O]/' (regexp) = 'B' other=[$groupc.];
invalue $groupc  '/[P-Z]/' (regexp) = 'C' other=_error_;
run;

data want;
  set have;
  format letter $quote.;
  group=input(letter,$group.);
run;

proc print;
run;

resetline; data have; input name :$20. letter $; if name='Leadingspace' then letter=' A'; datalines; A_lowercase b A_uppercase A b_Lowercase k B_uppercase J C_lowercase q C_uppercase R Multiple AZ Empty . Non_letter / Leadingspace . ; proc format ; invalue $group (upcase) '/^[^ ]\ *$/' (regexp) = [$groupa.] other=_error_; invalue $groupa '/[A-H]/' (regexp) = 'A' other=[$groupb.]; invalue $groupb '/[I-O]/' (regexp) = 'B' other=[$groupc.]; invalue $groupc '/[P-Z]/' (regexp) = 'C' other=_error_; run; data want; set have; format letter $quote.; group=input(letter,$group.); run; proc print; run;

Log:

24   data want;
25     set have;
26     format letter $quote.;
27     group=input(letter,$group.);
28   run;

NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Multiple letter="AZ" group=  _ERROR_=1 _N_=7
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Empty letter="" group=  _ERROR_=1 _N_=8
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Non_letter letter="/" group=  _ERROR_=1 _N_=9
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Leadingspace letter=" A" group=  _ERROR_=1 _N_=10
NOTE: There were 10 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Output:

Obs    name            letter    group

  1    A_lowercase      "b"        A
  2    A_uppercase      "A"        A
  3    b_Lowercase      "k"        B
  4    B_uppercase      "J"        B
  5    C_lowercase      "q"        C
  6    C_uppercase      "R"        C
  7    Multiple         "AZ"
  8    Empty            ""
  9    Non_letter       "/"
 10    Leadingspace     " A"

Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Re: Is there any category like string.ascii_letters from which i can search the words from?

Registration is open

SAS Training: Just a Click Away