Solved: Re: Filtering character data only when it has specific characteristics

mahler_ji · Posted 09-15-2014 12:34 PM

Hello All!

I hope that everyone had a great weekend! I have a quick question.

I have a sas dataset that has a bunch of stock tickers in it, (AAPL, BA, etc) and some of them have a "-" or a "." in them. I want to be able to filter out all of the observations that have these (accidental) special characters in that field.

Essentially, I want to keep observations that have only letters in their ticker symbol.

Any idea how this would work?

Thanks!

John

Jagadishkatam · Posted 09-15-2014 09:54 PM

Hi john,

Thought something like below will help you. The scan function by default recognizes the following delimiters ! $ % & ( ) * + , - . / ;

By using 1 , sac will output the first string before the delimiters.

data have;

input dat$;

new=scan(dat,1);

cards;

BA-a

BAB-a

GOOG.a

;

run;

if there are any other delimiters ??, then you can mention the same in scan. However along with these delimiters , you should also include the default delimiters

data have;

input dat$;

new=scan(dat,1,'?? ! $ % & ( ) * + , - . / ;');

cards;

BA-a

BAB-a

GOOG.a

;

run;

Thanks,

Jag

Thanks,
Jag

View solution in original post

Reeza · Posted 09-15-2014 12:41 PM

Look at the notalpha function.

data want;

set have;

if notalpha(stock_ticker)>0 then delete;

run;

Hima · Posted 09-15-2014 01:55 PM

Hi Reeza

Sorry, its not my intention to put you in the spot. I just want to learn. The code you provided is not working. I am returning empty data set.

data temp3;
input string $ 1-11;
cards;
abcXxX/
_jklxxx
abc.jjj
xXx()lll
xxx*aaa
;
run;

data temp4;
set temp3;

if notalpha(string) > 0 then delete ;
run;

proc print data = temp4;
run;

I love to learn from Masters like you.

Reeza · Posted 09-15-2014 02:01 PM

Two reasons:

1. String has trailing blanks that need to be trimmed out.

2. You have no items in your data set that are all alpha to be returned.

data temp3;

input string $ ;

cards;

abcXxX/

_jklxxx

abc.jjj

xXx()lll

xxx*aaa

ABC

APPL

IBM

GOOG

;

run;

data temp4;

set temp3;

if notalpha(trim(string)) > 0 then delete ;

run;

proc print data = temp4;

run;

Hima · Posted 09-15-2014 02:04 PM

Thanks for clarifying. I read the question incorrectly then. You are correct. Thank you so much for your quick reply.

mahler_ji · Posted 09-15-2014 09:19 PM

Hey and

Thank you so much for all of your help, and both answers that were given were amazing.

I have a slightly different question now...

Is there a way that I can trim the tickers so that sas will IGNORE everything after a certain character? Like if the symbol is BAB-a, I want the observation to return BAB.

The only thing is, the character length is different. There could be ones with only one or two letters and then a character (i.e. BA-a) and then some are more (GOOG.a). Sometimes the symbol is a dash, sometimes a period and sometimes something else.

Any help would be amazing!

John

Reeza · Posted 09-15-2014 09:50 PM

Look at the scan function...

Jagadishkatam · Posted 09-15-2014 09:54 PM

Hi john,

Thought something like below will help you. The scan function by default recognizes the following delimiters ! $ % & ( ) * + , - . / ;

By using 1 , sac will output the first string before the delimiters.

data have;

input dat$;

new=scan(dat,1);

cards;

BA-a

BAB-a

GOOG.a

;

run;

if there are any other delimiters ??, then you can mention the same in scan. However along with these delimiters , you should also include the default delimiters

data have;

input dat$;

new=scan(dat,1,'?? ! $ % & ( ) * + , - . / ;');

cards;

BA-a

BAB-a

GOOG.a

;

run;

Thanks,

Jag

Thanks,
Jag

PGStats · Posted 09-15-2014 11:06 PM

You will get the most flexibility with regular expressions:

data have;

length dat $20;

input dat;

cards;

TOOLONGWORD

ABC1234

12abcd

BA-a

BAB@a

GOOG.a

.nothing

tôt^weird

;

data want;

/* Regular expression: ^: at beginning of string, [[:alpha:]]{1,6} : 1 to 6 alpha characters, /..../i : match case insensitive */

if not prxId then prxId + prxParse("/^[[:alpha:]]{1,6}/i");

set have;

call prxSubstr(prxId, dat, pos, len);

if pos > 0 then word = substr(dat, pos, len);

drop pos len prxId;

run;

title "Words with 1 to 6 alpha characters appearing at the beginning of strings";

proc print data = want noobs;

run;

PG

Registration is open

SAS Training: Just a Click Away