BookmarkSubscribeRSS Feed
Nipun22
Obsidian | Level 7
data have;
input name $ letter $;
datalines;
john j
Jacobs J
Alex A
mathew m
harward h
Liner L
Meesa M
xenon x
;
run;
data want;
set have;
if letter in('A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h') then group='A';
if letter in('I' 'J' 'K' 'L' 'M' 'N' 'O' 'i' 'j' 'k' 'l' 'm' 'n' 'o') then group = 'B';
if letter in('P' 'Q' 'R' 'S' 'T' 'U' 'W' 'X' 'Y' 'Z' 'p' 'q' 'r' 's' 't' 'u' 'w' 'x' 'y' 'z') 
then group = 'C';
run;

 this is the result i want from the code i have written above but is there any category like string.ascii_letters (similar to python) from which i can assign group so that i dont have to write individual letters like above?

Nipun22_1-1745688705904.png

 

9 REPLIES 9
Tom
Super User Tom
Super User

I am not sure what your question actually is.

 

But it looks like you want to make a FORMAT or informat that will convert your variable LETTER into the strings you want in the variable GROUP.

 

So you might make a format named $GROUP

proc format ;
value $group 
 'A','B','C','D','E','F','G','H','a','b','c','d','e','f','g','h' ='A'
 'I','J','K','L','M','N','O','i','j','k','l','m','n','o' = 'B'
 'P','Q','R','S','T','U','W','X','Y','Z','p','q','r','s','t','u','w','x','y','z' = 'C'
;
run;

Then use that format.  Either by attaching it to the LETTER variable so it displays the actual letter as the group code instead.

proc print data=have;
  format letter $group.;
run;

Or by using the PUT() function to make a new variable with the group code.

data want;
  set have;
  group=put(letter,$group.);
run;

If you don't need to attach the format then it might be easier to make an INFORMAT instead and use the INPUT() function.  That way you could take advantage of the UPCASE option available for informats that is not available for formats.

proc format ;
invalue $group (upcase)
 'A','B','C','D','E','F','G','H' ='A'
 'I','J','K','L','M','N','O' = 'B'
 'P','Q','R','S','T','U','W','X','Y','Z' = 'C'
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;

 

Tom
Super User Tom
Super User

If you want help with comparisons to features from some other language you should explain in detail what that feature is.  So what is string.ascii_letters in Python?  

 

If you want the ASCII code (actually the code used to display that particular byte in your current single byte encoding) then use the RANK() function.  So the uppercase letter A in ASCII (and most single byte encodings like LATIN1 or WLATIN1 or even in the multi-byte encoding UTF-8) is the hexadecimal number 41 , which is the decimal number 65.

226  data _null_;
227    letter='A';
228    number=rank(letter);
229    put number= hex2. number= ;
230  run;

number=41 number=65

If the LETTER is in EBCDIC then you will get different codes.

 

But I am not sure how that knowledge helps with the problem you showed.

Why not just use comparison operators?

data want;
  set have;
  if 'A' <= upcase(letter) <= 'H'  then group='A';
  else if 'I' <= upcase(letter) <= 'O' then group = 'B';
  else if 'P' <= upcase(letter) <= 'Z' then group = 'C';
  else group = ' ';
run;

 

qingy
Fluorite | Level 6

You can use Perl regular expressions for this, through the PRXMATCH function. For example, /^[A-H]\s*$/i tells the function to start at the beginning of the string (symbolized by '^'), for a character ('[...]') in the range of A-H, possibly trailed by spaces ('\s') repeated 0 or more times ('*') before reaching the end of string '$'. The '/i' flag at the end makes the matching case-insensitive.

 

Note that I added a couple of extra rows at the end of the test dataset to validate against unicode look-alikes of ASCII characters.

 

 

data have;
length name $20; input name $ letter $; datalines; john j Jacobs J Alex A mathew m harward h Liner L Meesa M xenon x ascii_a a not_really_a а ; run; proc print data=have; format letter $hex8.; run; data want; set have; if prxmatch('/^[A-H]\s*$/i', letter) then group = 'A'; else if prxmatch('/^[I-O]\s*$/i', letter) then group = 'B'; else if prxmatch('/^[P-Z]\s*$/i', letter) then group = 'C'; run; proc print data=want; run;

qingy_3-1745700830589.png

 

qingy_4-1745700857668.png

 

 

Tom
Super User Tom
Super User

You can use Regular Expressions when defining INFORMATs also.

proc format ;
invalue $group (upcase)
 '/[A-H]/' (regexe) = 'A'
 '/[I-O]/' (regexe) = 'B'
 '/[P-Z]/' (regexe) = 'C'
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;
Ksharp
Super User

Yes. But you need to be familar with Perl Regular Expression.

data have;
input name $ letter $;
if prxmatch('/[a-h]/i',strip(letter)) then group='A';
 else if prxmatch('/[i-o]/i',strip(letter)) then group='B';
  else if prxmatch('/[p-z]/i',strip(letter)) then group='C';

if prxmatch('/[[:ascii:]]/',strip(letter)) then is_ascii='Y';

datalines;
john j
Jacobs J
Alex A
mathew m
harward h
Liner L
Meesa M
xenon x
;
run;
qingy
Fluorite | Level 6

I don't think your code is an improvement over what I have provided. The way I have things set up, the Group field is left blank if there is an unexpected value in the Letter field (i.e. if it is not a single ASCII letter with no leading blanks). Additional steps to investigate the records that could not be assigned a Group value, such as checking for non-ASCII characters, can wait until it is established that such records exist - which is hopefully not the case.

 

With your unanchored '/[a-h]/i' regex pattern, a valid Group value could be assigned even if the record has an invalid entry in the Letter field, e.g. 'cellShift', '7z'. In my experience, it is a safer style to to flag such unexpected inputs with a special value in the output (blank in this case) instead of assuming they don't exist.

Tom
Super User Tom
Super User

You can use the _ERROR_ keyword when defining an INFORMAT to provide some error checking.

proc format ;
invalue $group (upcase)
 '/[A-H]/' (regexe) = 'A'
 '/[I-O]/' (regexe) = 'B'
 '/[P-Z]/' (regexe) = 'C'
  other = _error_
;
run;

data want;
  set have;
  group=input(letter,$group.);
run;
qingy
Fluorite | Level 6

The informat technique is an interesting one that I haven't seen before, but I have a couple of quibbles:

Apparently putting the character value you want the case to map to after the equals sign doesn't work in the case of regexp character informats (you'll get a syntax error if you try). The correct way to do it seems to be to use a substitution regex and then map it to the _same_ value like so (also note that the option is regexp or regexpe, not regexe).

 

proc format ;
invalue $group (upcase)
 's/[A-H]/A/' (regexpe) = _same_
 's/[I-O]/B/' (regexpe) = _same_
 's/[P-Z]/C/' (regexpe) = _same_
  other = _error_
;
run;

My other quibble is that again, it's essential to properly anchor the regular expression with ^ and $ so that input with extraneous non-whitespace characters are flagged for review. For example, consider this hypothetical dataset:

data have;
length name $20;
input name $ letter $;
datalines;
Meesa M
xenon x
ascii_a a
not_really_a а
typo vb
hex_k 6b
cellshift purple
;
run;

If you process it using the unanchored regexes, the only error you get is for the Cyrillic 'а'. But if you process it using the anchored versions of the regexes (see below), it correctly logs errors for each of the last four rows.

proc format ;
invalue $group_anchored (upcase)
 's/^\s*[A-H]\s*$/A/' (regexpe) = _same_
 's/^\s*[I-O]\s*$/B/' (regexpe) = _same_
 's/^\s*[P-Z]\s*$/C/' (regexpe) = _same_
  other = _error_
;
run;

 

Tom
Super User Tom
Super User

I would not use \s as that includes any "white space" , not just spaces.  I would prefer that things like tabs or "non-breaking spaces" be considered invalid.

 

It seems it is not REGEXP vs REGEXPE so much as the fact that it seems to only allow ONE regex expression per informat.  Some how the parser appears to get confused when parsing the second range when the first range results in a constant instead of the special keywords like  _same_ or _error_.

965  invalue $groupx (upcase)
966    '/^[A-H]\ *$/' (regexp) = 'A'
967    '/^[I-O]\ *$/' (regexp) = 'B'
                      -
                      22
                      76
ERROR 22-322: Syntax error, expecting one of the following: a quoted string,
              a numeric constant, a datetime constant, a missing value, ;,
              LOW, OTHER.
ERROR 76-322: Syntax error, statement will be ignored.
968    '/^[P-Z]\ *$/' (regexp) = 'C'
969    other=_error_
970  ;

But you can make the format if you use CNTLIN= option with a dataset that defines the format.

data cntlin ;
length fmtname $32 type $1 start $40 label $40 hlo $3 ;
FMTNAME="GROUP" ;
TYPE="J" ;
START="/^[A-H]\ *$/i" ; LABEL="A" ; HLO="P" ; output;
START="/^[I-O]\ *$/i" ; LABEL="B" ; HLO="P" ; output;
START="/^[P-Z]\ *$/i" ; LABEL="C" ; HLO="P" ; output;
START="**OTHER**" ; LABEL="_ERROR_" ; HLO="O" ; output;
run;
proc format lib=work.formats cntlin=cntlin ;
run;

 

You can also get around that by nesting multiple formats instead.

proc format ;
invalue $group (upcase) '/^[^ ]\ *$/' (regexp) = [$groupa.] other=_error_;
invalue $groupa  '/[A-H]/' (regexp) = 'A' other=[$groupb.];
invalue $groupb  '/[I-O]/' (regexp) = 'B' other=[$groupc.];
invalue $groupc  '/[P-Z]/' (regexp) = 'C' other=_error_;
run;

Example:

Spoiler
resetline;
data have;
  input name :$20. letter $;
  if name='Leadingspace' then letter=' A';
datalines;
A_lowercase b
A_uppercase A
b_Lowercase k
B_uppercase J
C_lowercase q
C_uppercase R
Multiple  AZ
Empty .
Non_letter /
Leadingspace .
;

proc format ;
invalue $group (upcase) '/^[^ ]\ *$/' (regexp) = [$groupa.] other=_error_;
invalue $groupa  '/[A-H]/' (regexp) = 'A' other=[$groupb.];
invalue $groupb  '/[I-O]/' (regexp) = 'B' other=[$groupc.];
invalue $groupc  '/[P-Z]/' (regexp) = 'C' other=_error_;
run;

data want;
  set have;
  format letter $quote.;
  group=input(letter,$group.);
run;

proc print;
run;

Log:

24   data want;
25     set have;
26     format letter $quote.;
27     group=input(letter,$group.);
28   run;

NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Multiple letter="AZ" group=  _ERROR_=1 _N_=7
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Empty letter="" group=  _ERROR_=1 _N_=8
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Non_letter letter="/" group=  _ERROR_=1 _N_=9
NOTE: Invalid argument to function INPUT at line 27 column 9.
name=Leadingspace letter=" A" group=  _ERROR_=1 _N_=10
NOTE: There were 10 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.WANT has 10 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Output:

Obs    name            letter    group

  1    A_lowercase      "b"        A
  2    A_uppercase      "A"        A
  3    b_Lowercase      "k"        B
  4    B_uppercase      "J"        B
  5    C_lowercase      "q"        C
  6    C_uppercase      "R"        C
  7    Multiple         "AZ"
  8    Empty            ""
  9    Non_letter       "/"
 10    Leadingspace     " A"

 

 

 

 

 

 

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 653 views
  • 7 likes
  • 4 in conversation