Solved: standardize strings variable as names

GingerJJ · Posted 08-18-2021 03:24 PM

I have a variable as participants last name. There are names like:

Jose Cruz Jr

Jose Cruz Jr.

Jose Cruz Jr Jr

Jose Jrare

Also, like these

De la Rosa

Maria De la Rosa

Maria De Rosa

De Rosa

Vorde Rosa

Rosa Deloy

My question is how I can get rid of the Jr or Jr Jr only at the end of the string, but not in the middle as in 'Jose Jrare'? Also, I want to get rid of De in 'De Rosa', but not in 'Vorde Rosa' or 'Rosa Deloy'.

Thank you!

maguiremq · Posted 08-18-2021 03:53 PM

I think this calls for regular expressions. Let me know if you want me to explain what's going on. I may edit and add descriptions.

data have;
infile datalines dsd;
input name :$25.;
datalines;
Jose Cruz Jr
Jose Cruz Jr.
Jose Cruz Jr Jr
Jose Jrare
De la Rosa
Maria De la Rosa
Maria De Rosa
De Rosa
Vorde Rosa
Rosa Deloy
;
run;

data want;
	set have;
		/* Forgot to add the number. If Jr exists do a separate process. */
		if find(name, "Jr") > 0 then do;
			/* s/ starts the substition. It has a reciprocated / at the end to stop this process. */
			/* I group all the Jr's together with parentheses to treat them as one. */
			/* I then add the $, which looks specifically for the Jr. group at the end of the string. */
			/* I then replace it with a blank by specifying nothing after $/ */
			/* Also trimming names so there isn't any trailing blanks. */
			new_name = prxchange("s/(Jr|Jr.|Jr Jr)$//", -1 , trim(name));
		end;
			/* If it isn't Jr. do another process. */
			else do;
				/* Again, prxchange or substituting starts with "/s */
				/* I want to find two patterns: space (\s) De space (\s), or (|) a string that starts (^) with De. */
				/* Replace it with a space / / */
				/* Also trimming names so there isn't any trailing blanks. */
				new_name = prxchange("s/(\s\De\s|^De)/ /", -1, trim(name));
			end;
run;

name	new_name
Jose Cruz Jr	Jose Cruz
Jose Cruz Jr.	Jose Cruz
Jose Cruz Jr Jr	Jose Cruz
Jose Jrare	Jose Jrare
De la Rosa	la Rosa
Maria De la Rosa	Maria la Rosa
Maria De Rosa	Maria Rosa
De Rosa	Rosa
Vorde Rosa	Vorde Rosa
Rosa Deloy	Rosa Deloy

Ran this on my local copy of SAS and had some formatting issues. Added a left/trim to left align and trim blanks.

data want;
	set have;
		/* Forgot to add the number. If Jr exists do a separate process. */
		if find(name, "Jr") > 0 then do;
			/* s/ starts the substition. It has a reciprocated / at the end to stop this process. */
			/* I group all the Jr's together with parentheses to treat them as one. */
			/* I then add the $, which looks specifically for the Jr. group at the end of the string. */
			/* I then replace it with a blank by specifying nothing after $/ */
			/* Also trimming names so there isn't any trailing blanks. */
			new_name = left(trim(prxchange("s/(Jr|Jr.|Jr Jr)$//", -1 , trim(name))));
		end;
			/* If it isn't Jr. do another process. */
			else do;
				/* Again, prxchange or substituting starts with "/s */
				/* I want to find two patterns: space (\s) De space (\s), or (|) a string that starts (^) with De. */
				/* Replace it with a space / / */
				/* Also trimming names so there isn't any trailing blanks. */
				new_name = left(trim(prxchange("s/(\s\De\s|^De)/ /", -1, trim(name))));
			end;
run;

View solution in original post

maguiremq · Posted 08-18-2021 03:53 PM

I think this calls for regular expressions. Let me know if you want me to explain what's going on. I may edit and add descriptions.

data have;
infile datalines dsd;
input name :$25.;
datalines;
Jose Cruz Jr
Jose Cruz Jr.
Jose Cruz Jr Jr
Jose Jrare
De la Rosa
Maria De la Rosa
Maria De Rosa
De Rosa
Vorde Rosa
Rosa Deloy
;
run;

data want;
	set have;
		/* Forgot to add the number. If Jr exists do a separate process. */
		if find(name, "Jr") > 0 then do;
			/* s/ starts the substition. It has a reciprocated / at the end to stop this process. */
			/* I group all the Jr's together with parentheses to treat them as one. */
			/* I then add the $, which looks specifically for the Jr. group at the end of the string. */
			/* I then replace it with a blank by specifying nothing after $/ */
			/* Also trimming names so there isn't any trailing blanks. */
			new_name = prxchange("s/(Jr|Jr.|Jr Jr)$//", -1 , trim(name));
		end;
			/* If it isn't Jr. do another process. */
			else do;
				/* Again, prxchange or substituting starts with "/s */
				/* I want to find two patterns: space (\s) De space (\s), or (|) a string that starts (^) with De. */
				/* Replace it with a space / / */
				/* Also trimming names so there isn't any trailing blanks. */
				new_name = prxchange("s/(\s\De\s|^De)/ /", -1, trim(name));
			end;
run;

name	new_name
Jose Cruz Jr	Jose Cruz
Jose Cruz Jr.	Jose Cruz
Jose Cruz Jr Jr	Jose Cruz
Jose Jrare	Jose Jrare
De la Rosa	la Rosa
Maria De la Rosa	Maria la Rosa
Maria De Rosa	Maria Rosa
De Rosa	Rosa
Vorde Rosa	Vorde Rosa
Rosa Deloy	Rosa Deloy

Ran this on my local copy of SAS and had some formatting issues. Added a left/trim to left align and trim blanks.

data want;
	set have;
		/* Forgot to add the number. If Jr exists do a separate process. */
		if find(name, "Jr") > 0 then do;
			/* s/ starts the substition. It has a reciprocated / at the end to stop this process. */
			/* I group all the Jr's together with parentheses to treat them as one. */
			/* I then add the $, which looks specifically for the Jr. group at the end of the string. */
			/* I then replace it with a blank by specifying nothing after $/ */
			/* Also trimming names so there isn't any trailing blanks. */
			new_name = left(trim(prxchange("s/(Jr|Jr.|Jr Jr)$//", -1 , trim(name))));
		end;
			/* If it isn't Jr. do another process. */
			else do;
				/* Again, prxchange or substituting starts with "/s */
				/* I want to find two patterns: space (\s) De space (\s), or (|) a string that starts (^) with De. */
				/* Replace it with a space / / */
				/* Also trimming names so there isn't any trailing blanks. */
				new_name = left(trim(prxchange("s/(\s\De\s|^De)/ /", -1, trim(name))));
			end;
run;

GingerJJ · Posted 08-18-2021 04:52 PM

Thank you! Please do explain it a bit.

maguiremq · Posted 08-18-2021 06:40 PM

Hey @GingerJJ, I did in the second program that I included in my post.

Regular expressions can be a bit unclear at first, but they’re extremely powerful at parsing strings with odd conditions.

This link may help as well.

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

GingerJJ · Posted 08-18-2021 07:02 PM

Thank you so much!

Just saw your updated response!

Good thing is I found more abnormal patterns in my data and will probably come back with more questions.

standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Re: standardize strings variable as names

Click image to register for webinar

Classroom Training Available!