Re: How to find a set of strings from character values?

deleted_user · Posted 10-23-2008 10:45 AM

Hi All,

I have a table containing telephone number fields which is set as character value.

We expect to get telephone numbers in a set format e.g. As a valid UK landline or mobile number.

Instead we seem to also get strings like HO, HOME, WORK, EX DIR etc.

Now I want to scan through all my telephone values to find how many of them have HO, HOME, WORK or EX DIR?

Can anyone help please.

Thanks a lot in advance,
Sree

IrishGuy · Posted 10-23-2008 12:03 PM

You can use the find function to find the substring. Check Language Reference Dictionary for the syntax for this. And you can output the results to 2 datasets if you want to view the records. E.g. something like

data strange_numbers (keep = phone_number) ok_numbers(keep = phone_number);
set mydataset;
if (find(phone_number, "HOM") > 0) or (find(teamid, "WORK") then output strange_numbers;
else
output ok_numbers;
run;

Or else you could use the retain statement to just get a running count of records that match and just output that as a total number.

If you want to get a bit more sophisticated you can use the prxparse functions to find a pattern in the data. This is more flexible. You could search for any letters for example or many different types of patterns.

Cynthia_sas · Posted 10-23-2008 06:13 PM

Hi:
If all you want to do is find out is about the whole universe of text that people have added to the phone field, I'd be very tempted to compress out all the digits and punctuation from the phone number and then just run a proc freq on the remaining text. If what you want to do is REMOVE all the ancillary text, then you could compress that out as well, creating a new variable. Both techniques are shown below.

cynthia
[pre]
** first, make some test data;
data fone_info;
length name $10 phone $25;
infile datalines dsd;
input name $ phone $;
return;
datalines;
alan, "123 456 7890 home"
barbara, "555-634-5789 h"
clark, "06-53931302 work"
dave, "236-6132 chicago"
ed, "101-853-5937 Land"
frank, "mobile 11-2345789"
gail, "05-2583536 ext 12"
harry, "98-9783412 x 44"
ida , "97-3335643"
jenny, "101-867-5329 cell"
kathy, "96-2349224"
lawrence, "123-555-2323"
mark, "545-893-5436"
ned, "66-1234567"
oscar, "444-333-4343"
paula, "654 222 5436 home"
quentin, "777-634-8754 h"
ross, "77-5468902 work"
stephen, "43-56498321 ext dir 33"
terri, "342-6132 chicago"
ula, "555-865-2647 Land"
victor, "mobile 22-2387389"
winifred, "93-43921301 work"
xavier, "528-6132 atlanta"
yvonne, "984-333-5935 Land"
zach, "mobile 22-2312126"
;
run;

** next, compress out the digits, making the text_w_phone variable;
** and compress out the alpha characters, making the phone_only variable;
data count_text;
set fone_info;
phone = upcase(phone);
text_w_phone = left(compress(phone,,'dp'));
phone_only = left(compress(phone,,'A'));
run;

title; footnote;
ods listing;
options nodate nonumber nocenter;

** Proc Freq step to see the whole universe of text strings with phone numbers;
proc freq data=count_text;
title 'get rid of digits and see what is left';
tables text_w_phone/missing;
run;

** proc print to see result of compress function;
proc print data=count_text;
var name phone_only text_w_phone phone;
run;
[/pre]

deleted_user · Posted 10-24-2008 11:40 AM

Thanks a lot IrishGuy and Cynthia@SAS, I will try both the method and see which one gives me the best results.

Cheers,
Sree

How to find a set of strings from character values?