DATA Step, Macro, Functions and more

First & Last names

Reply
Super Contributor
Posts: 673

First & Last names

Dr. Smith T. Bauer MD
Samuel I Rodriguez M.D.
Will Glader MD


How to split the above Physicians names into first and last names:
Smith Bauer
Samuel Rodriguez
Will Glader

I tried to compress Dr.,MD and then tried to compress middle initial.But it is not applicable to all cases.
Super Contributor
Super Contributor
Posts: 3,174

Re: First & Last names

Since you are not going to have a consistent number of "tokens", suggest you parse using the SCAN function, and then based on known possible "last" values, you can decide how to identify a FIRST, LAST and maybe MIDDLE. Again, for this work-objective, your friend will be the DATA step and the SCAN function with some number of token-substring (declared as maybe $20 each) variables.

Check the DOC for using SCAN.

Scott Barry
SBBWorks, Inc.
Super Contributor
Posts: 474

Re: First & Last names

Hi.

String parsing/substitution is easily achievable with regular expressions.

The following code will do what you need, using the regular expression functions provided by SAS:

[pre]
data RESULT;
length FIRST $64 LAST $64;
input;
drop _:;

* prepare reg. expression for text parsing;
_EXPR='s/(Dr.)?(\s*\S*)\s*?\S*?(\s+)(\S*)\s*(MD|M.D.)\s*/$2 $4/i';
_REGX=prxparse(_EXPR);
_PRX=prxchange(_REGX,1,_infile_);

* split parsed text into desired variables;
FIRST=scan(_PRX,1);
LAST=scan(_PRX,2);

cards;
Dr. Smith T. Bauer MD
Samuel I Rodriguez M.D.
Will Glader MD
run;
[/pre]

Now, the difficult part is actually understand how regular expression work, and construct a valid one that will do precisely what you need.

Give a look at this pages from info about regular expressions:
http://www.regular-expressions.info/tutorial.html
http://www.troubleshooters.com/codecorn/littperl/perlreg.htm

And check the online doc about the subject at SAS:
http://support.sas.com/documentation/cdl/en/lrdict/62618/HTML/default/a002291852.htm

Cheers from Portugal

Daniel Santos @ www.cgd.pt
Super Contributor
Posts: 673

Re: First & Last names

Posted in reply to DanielSantos
Thanks Santos,
I will make necessary changes on the regex based on the conditions.
Super Contributor
Super Contributor
Posts: 3,174

Re: First & Last names

Whatever happened to simplicity - the SCAN function works just fine so why complicate the situation, possibly considering that someone may need to support your SAS program in the future.

Scott Barry
SBBWorks, Inc.
Super Contributor
Posts: 359

Re: First & Last names

That's called "job security" Scott.

;}
SAS Super FREQ
Posts: 8,868

Re: First & Last names

Hi:
As elegant as the PRX-only solution is, I agree with Scott that sometimes a more verbose or step-wise solution might be easier to maintain. For example, compare the 7 statements of the PRXCHANGE/REVERSE/SCAN solution with the 5 statements in the PRX-only solution.

If I am a PRX newbie and my ever inventive data entry folks throw a few curve balls like Dr. Casey (no first name) or Dr. First Last, Sr. MD or Dr. First Last, III MD I have a better chance of -successfully- adjusting my code using the hybrid approach. If I'm not a PRX newbie, then adjusting the regex will probably be easy.

cynthia
[pre]
data drname;
length name $30;
infile datalines dsd dlm=',';
input idnum name $;
return;
datalines;
1,"Dr. Smith T. Bauer MD"
2,"Samuel I Rodriguez M.D."
3,"Will Glader MD"
4,"Dr. Greg House"
5,"Dr Drake Morgan"
6,"DR Donnie Darko, Sr. MD"
7,"Dr. Casey"
;
run;

data parsename;
length first last z revname zz $30;
set drname;
** get rid of Dr, if present;
z = left(prxchange('s/(Dr |Dr. |DR |DR. )/ /',1, name));

** get rid of periods and commas;
z = compress(z,',.');

** reverse the string so that MD, Jr, Sr is first, if present;
revname = reverse(z);

** change MD, Jr, Sr (in reverse) to spaces;
zz = left(prxchange('s/(DM |rJ |rS )/ /',2, revname ));

** now first name is always the first chunk of the string;
first = scan(z,1,' ');

** and last name is always the first chunk of the reversed string;
** but the scanned string has to be reversed again to be correct;
last = reverse(scan(zz,1,' '));

** If first name and last name are the same (Dr. Casey), then set ;
** spaces/missing for first name;
if first = last then first = ' ';

** entire PRX solution;
* prepare reg. expression for text parsing;
_EXPR='s/(Dr.)?(\s*\S*)\s*?\S*?(\s+)(\S*)\s*(MD|M.D.)\s*/$2 $4/i';
_REGX=prxparse(_EXPR);
_PRX=prxchange(_REGX,1,name);

* split parsed text into desired variables;
alt_FIRST=scan(_PRX,1);
alt_LAST=scan(_PRX,2);

run;

ods listing;
proc print data=parsename;
var first last name z revname zz alt_first alt_last;
run;


[/pre]
Super Contributor
Posts: 474

Re: First & Last names

Posted in reply to Cynthia_sas
Hi all.

OK, regular expressions may not be something you learn on the fly, but it's actually something very logic and worth spending sometime learning (and I must say, my knowledge in the field is pretty average, and I manage myself to successfully produce one for the above question).
As I see it, string manipulation functions (scan, index, reverse, substr, etc) are quite good and will do the job for most of the needs. But in some cases, where transformation gets more complicated, those will produce a very "messy", difficult to understand and maintain code. Regular expressions will give you the same number of lines, and actually the same code, either you want to do some simple or more complex transformation. You just need to code the right expression. And it's been there for years (at leas in the UNIX world) so there's a lot of documentation, just one google away!

Regular expression are to my eyes one of the greatest new features of the SAS 9 platform, one that really adds value to the SAS language.

So, do not be afraid of the "infamous" PRX functions, they won't bite you.

But, then again, I believe any solution is a valid, if the result is the one expected.

Cheers from Portugal

Daniel Santos @ www.cgd.pt
Super Contributor
Posts: 673

Re: First & Last names

Posted in reply to DanielSantos
Thats true.Regular expressions are not as easy to understand as the other SAS functions.They are too confusing.
SAS Super FREQ
Posts: 8,868

Re: First & Last names

Posted in reply to DanielSantos
Daniel...
In principle, I agree with you. However, figuring out PRX expressions gives me a headache .. even with Google ... and reminds me of my old symbolic logic class, which also gave me headaches. I passed the course, but it was the hardest class I ever took.

To give Perl regular expressions their due: I -love- PRXCHANGE and if anything is going to lead me to understand the rest of Perl Regular Expressions, PRXCHANGE will get me there....but not today!

cynthia
Super Contributor
Posts: 474

Re: First & Last names

Posted in reply to Cynthia_sas
OK folks, just wanna help here.

But bear in mind, that although the easiest path may be the safest, it's probably not the one that will take you furthermore ahead (as a teacher of mine use to say).

And let me say, that you're missing quite an important and powerful feature of the SAS language.

Cheers from Portugal

Daniel Santos @ www.cgd.pt
Valued Guide
Posts: 2,177

Re: First & Last names

Posted in reply to DanielSantos
I still take the view that a public "text definition standard" that I presume Perl regular expressions provide, gives our SAS languages something on which to build the "inPicture" equivalent of "inValue" (at http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/a002473466.htm see proc format documantation for inValue).
Although I know a clever little birdie, in his "developers sandpit", has toyed with, and demonstrated techniques using perl regular expressions in the "start" string for an enhanced inValue statement, there does not seem to be enough "customer demand" behind this concept of better integration of regular expression within the SAS languages to create the prioity or budget for development and implementation of this logical extension of SAS language inFormats.


PeterC
Super Contributor
Posts: 673

Re: First & Last names

Posted in reply to Cynthia_sas
Thank you Cynthia.
Ask a Question
Discussion stats
  • 12 replies
  • 1884 views
  • 0 likes
  • 6 in conversation