BookmarkSubscribeRSS Feed
SAS49
Obsidian | Level 7

Hi all,

 

Very new to SAS.  I have a data set with a variable/column called sample.  I have read in this dataset, but I would like to keep only specific rows that match a sample name pattern.  All of the ones that I would like to keep start with (A-H) and are then followed by a - and 3 numbers 

for example:

A-095

How do I keep only rows with samples in this format and drop other rows for example:

IQP17-UM-1997 

 

Thanks!

 

 

3 REPLIES 3
Astounding
PROC Star

I'm sure people who know parsing functions can give you a shorter expression to do the trick.  In the meantime, this should work:

 

data want;
   set have;
   if ("A" <=: sample <= "H") and substr(sample, 2, 1) = "-" and
   length (sample) = 5 and input(substr(sample, 3), ?? 3.) > .;
run;
Ksharp
Super User
data have;
input have $20.;
if prxmatch('/^[A-H]\-\d\d\d$/i',strip(have)) then matched=1;
cards;
A-095
IQP17-UM-1997 
;
FreelanceReinh
Jade | Level 19

Hi @SAS49 and welcome to the SAS Support Communities!

 

With PRXMATCH, one of the parsing functions mentioned by Astounding -- and already suggested by Ksharp, as I've just seen -- you can specify a Perl regular expression to describe the search pattern.

 

Example:

/^[A-H]-\d{3}$/

Explanation (see Tables of Perl Regular Expression (PRX) Metacharacters for the documentation):

  • ^ means "beginning of the string"
  • [A-H] means one of the characters in the range "A", "B", ..., "H"
  • - is the hyphen
  • \d means "digit," i.e. one of the characters in the range "0", "1", ..., "9"
  • \d{3} is a shorthand for \d\d\d, i.e., three digits in a row
  • $ means "end of the string"

This would be a suitable pattern for strings of length 5. However, your character variable must be longer to accommodate a string like "IQP17-UM-1997," which means that shorter values would be padded with trailing blanks. So we need to insert code for "zero or more blanks" (after \d{3}) into the regular expression or, simpler, trim the trailing blanks from the string (e.g., using the TRIM function) before starting the search, as shown below.

/* Create sample data for demonstration */

data have;
input sample $char13.;
cards;
a-095
A-095
A-0951
 A-044
B-40
H-123
IQP17-UM-1997
C-1.3
D-1E3
E-.99
F--28
G-+10
CF-123
;

/* Select observations matching the desired pattern */

data want;
set have;
if prxmatch('/^[A-H]-\d{3}$/', trim(sample));
run;

The PRXMATCH function returns the character position where the search pattern is found in the string -- necessarily 1 in your example -- and 0 if it is not found. Hence, the returned value is a suitable condition in a subsetting IF statement. The above code results in two observations in dataset WANT, as only "A-095" and "H-123" match the Perl regular expression.