Re: How Drop/Keep Rows in a Dataset Based on Sample Name Patterns?

SAS49 · Posted 11-02-2021 05:48 PM

Hi all,

Very new to SAS. I have a data set with a variable/column called sample. I have read in this dataset, but I would like to keep only specific rows that match a sample name pattern. All of the ones that I would like to keep start with (A-H) and are then followed by a - and 3 numbers

for example:

A-095

How do I keep only rows with samples in this format and drop other rows for example:

IQP17-UM-1997

Thanks!

Astounding · Posted 11-02-2021 08:08 PM

I'm sure people who know parsing functions can give you a shorter expression to do the trick. In the meantime, this should work:

data want;
   set have;
   if ("A" <=: sample <= "H") and substr(sample, 2, 1) = "-" and
   length (sample) = 5 and input(substr(sample, 3), ?? 3.) > .;
run;

Ksharp · Posted 11-03-2021 08:00 AM

data have;
input have $20.;
if prxmatch('/^[A-H]\-\d\d\d$/i',strip(have)) then matched=1;
cards;
A-095
IQP17-UM-1997 
;

FreelanceReinh · Posted 11-03-2021 08:25 AM

Hi @SAS49 and welcome to the SAS Support Communities!

With PRXMATCH, one of the parsing functions mentioned by Astounding -- and already suggested by Ksharp, as I've just seen -- you can specify a Perl regular expression to describe the search pattern.

Example:

/^[A-H]-\d{3}$/

Explanation (see Tables of Perl Regular Expression (PRX) Metacharacters for the documentation):

^ means "beginning of the string"
[A-H] means one of the characters in the range "A", "B", ..., "H"
- is the hyphen
\d means "digit," i.e. one of the characters in the range "0", "1", ..., "9"
\d{3} is a shorthand for \d\d\d, i.e., three digits in a row
$ means "end of the string"

This would be a suitable pattern for strings of length 5. However, your character variable must be longer to accommodate a string like "IQP17-UM-1997," which means that shorter values would be padded with trailing blanks. So we need to insert code for "zero or more blanks" (after \d{3}) into the regular expression or, simpler, trim the trailing blanks from the string (e.g., using the TRIM function) before starting the search, as shown below.

/* Create sample data for demonstration */

data have;
input sample $char13.;
cards;
a-095
A-095
A-0951
 A-044
B-40
H-123
IQP17-UM-1997
C-1.3
D-1E3
E-.99
F--28
G-+10
CF-123
;

/* Select observations matching the desired pattern */

data want;
set have;
if prxmatch('/^[A-H]-\d{3}$/', trim(sample));
run;

The PRXMATCH function returns the character position where the search pattern is found in the string -- necessarily 1 in your example -- and 0 if it is not found. Hence, the returned value is a suitable condition in a subsetting IF statement. The above code results in two observations in dataset WANT, as only "A-095" and "H-123" match the Perl regular expression.

How Drop/Keep Rows in a Dataset Based on Sample Name Patterns?