Solved: Re: Extracting specific words from the middle of a string variable.

Adubhai · Posted 09-09-2023 09:35 PM

I have a variable called "family" which includes details of family members all together.

From that variable I want to extract the name of the spouse in a separate column. The logic needed is to extract all the words after either "Wife:" or "Husband:" till before the "(".

In another column I want to extract the birth year of daughter from "family". The logic will be search for the immediate next "b." after "Daughter:" and then extract the next four characters after "b." (excluding blank spaces) into a separate column.

There could be more than one spouse and more than one daughter in some cases. In those cases I want for each spouse separate columns, i.e. Spouse 1, spouse 2. And for each daughter separate columns, i.e Daughter 1, daughter 2.

Have:

family

[ Wife: XXX XXXX (architect, m. 6-Oct-1924, d. 13-Jan-1949, 2 children) , Daughter: AAA (b. 1925) , Son: BBBB (b. 1928) , Wife: YYY YYYY (architect, m. 4-Oct-1952) ]

[ Wife: XXX1 XXXX1 (div., one son) , Son: AAA1, Wife: YYY1 YYYY1 (actress, m. 1986, sep. 2008, one daughter) , Daughter: BBBB1 ]

[ Father: XXX2 XXXX2 XXXXX2 , Wife: YYYY2 YYYY2(gymnast, one daughter) , Daughter: AAA2 ]

[ Father: XXXX3 XXXXX3 ("XX3", d. 1922) , Mother: ZZZ ZZZZZ (d. 1923) , Wife: YYYY3 YYY3 (m. 28-Jun-1926, three daughters) , Daughter: AAA , Daughter: BBB , Daughter: CCCCC]

[ Wife: YYYY4 (one daughter, one son) , Daughter: AAAA3 (ballerina, b. 1962) , Son: BBB2 (b. 1968) ]

[ Father: XXXX5 (lawyer/administrator) , Wife: (d. 1970, two daughters) ]

Want:

Family	Spousename 1	Spousename 2	Daughteryear 1	Daughteryear 2
[ Wife: XXX XXXX (architect, m. 6-Oct-1924, d. 13-Jan-1949, 2 children) , Daughter: AAA (b. 1925) , Son: BBBB (b. 1928) , Wife: YYY YYYY (architect, m. 4-Oct-1952) ]	XXX XXXX	YYY YYYY	1925
[ Wife: XXX1 XXXX1 (div., one son) , Son: AAA1, Wife: YYY1 YYYY1 (actress, m. 1986, sep. 2008, one daughter) , Daughter: BBBB1 ]	XXX1 XXXX1	YYY1 YYYY1
[ Father: XXX2 XXXX2 XXXXX2 , Wife: YYYY2 YYYY2(gymnast, one daughter) , Daughter: AAA2 ]	YYYY2 YYYY2
[ Father: XXXX3 XXXXX3 ("XX3", d. 1922) , Mother: ZZZ ZZZZZ (d. 1923) , Wife: YYYY3 YYY3 (m. 28-Jun-1926, three daughters) , Daughter: AAA , Daughter: BBB , Daughter: CCCCC]	YYYY3 YYY3
[ Wife: YYYY4 (one daughter, one son) , Daughter: AAAA3 (ballerina, b. 1962) , Son: BBB2 (b. 1968) ]	YYYY4		1962
[ Father: XXXX5 (lawyer/administrator) , Wife: (d. 1970, two daughters) ]

For privacy purpose the names have been replaced with random alphabets.

Can someone help me design the code for this? Thanks.

Ksharp · Posted 09-19-2023 07:36 AM

/*OK.Here you go*/
libname x v9 'c:\temp\a';

data temp;
 set x.have;
 length want $ 200;
 pid=prxparse('/(Husband:|Wife:|Father:|Mother:|Daughter:|Son:)[\s\p\w\.]+([\s\p\w\.\(\)\-]+)?/i');
 s=1;e=length(family);
 call prxnext(pid,s,e,family,p,l);
 do while(p);
   want=substr(family,p,l);output;
   call prxnext(pid,s,e,family,p,l);
 end;
 keep sl want;
 run;
data temp2;
 set temp;
 vname=scan(want,1,': ');
 name=scan(want,2,':()');
 if prxmatch('/Daughter:|Son:/i',want) then year=compress(scan(want,-1,' ,()-','p'),,'kd');
 drop want;
run;
proc sort data=temp2 out=temp3;
by sl  vname;
run;
data temp4;
 set temp3;
 by sl vname;
 if first.vname then x=0;
 x+1;
 varname=cats(vname,x);
 drop  x vname;
run;

proc sort data=temp4 ;
by sl  varname;
run;
proc transpose data=temp4 out=temp5 ;
by sl varname ;
var name year ;
run;
data temp6;
 set temp5;
if lowcase(varname) in: ('father' 'mother') and missing(col1) then delete;
run;
proc transpose data=temp6 out=temp7 delimiter=_ ;
by sl;
id varname _name_;
var col1;
run;
data want;
 merge x.have temp7(drop=_name_);
 by sl;
run;

View solution in original post

Patrick · Posted 09-09-2023 09:49 PM

Please provide representative sample data (=data that reflects all your use cases) created via SAS data step code posted here. Then show us the desired result and explain the logic/patterns how you get from have to want.

Adubhai · Posted 09-09-2023 11:50 PM

Thanks Patrick. I have updated my post with sample have and want tables.

Ksharp · Posted 09-10-2023 05:51 AM

data have;
infile cards truncover;
input family $200.;
cards4;
[ Wife: XXX XXXX (architect, m. 6-Oct-1924, d. 13-Jan-1949, 2 children) , Daughter: AAA (b. 1925) , Son: BBBB (b. 1928) , Wife: YYY YYYY (architect, m. 4-Oct-1952) ]
[ Wife: XXX1 XXXX1 (div., one son) , Son: AAA1, Wife: YYY1 YYYY1 (actress, m. 1986, sep. 2008, one daughter) , Daughter: BBBB1 ]
[ Father: XXX2 XXXX2 XXXXX2 , Wife: YYYY2 YYYY2(gymnast, one daughter) , Daughter: AAA2 ]
[ Father: XXXX3 XXXXX3 ("XX3", d. 1922) , Mother: ZZZ ZZZZZ (d. 1923) , Wife: YYYY3 YYY3 (m. 28-Jun-1926, three daughters) , Daughter: AAA , Daughter: BBB , Daughter: CCCCC]
[ Wife: YYYY4 (one daughter, one son) , Daughter: AAAA3 (ballerina, b. 1962) , Son: BBB2 (b. 1968) ]
[ Father: XXXX5 (lawyer/administrator) , Wife: (d. 1970, two daughters) ]
;;;;

data temp;
 set have;
 n+1;
 pid=prxparse('/(Husband:|Wife:|Father:|Mother:)[\s\w]+|(Daughter:|Son:)[\s\w]+\([\w\s,]+\.\s+\d+\)/i');
 s=1;e=length(family);
 call prxnext(pid,s,e,family,p,l);
 do while(p);
   want=substr(family,p,l);output;
   call prxnext(pid,s,e,family,p,l);
 end;
 drop s e p l pid;
 run;
data temp2;
 set temp;
 vname=scan(want,1,': ');
 value=scan(want,2,':()');
 if findc(want,'()') then year=scan(want,-1,' ,()');
run;
proc sort data=temp2 out=temp3;
by n family vname;
run;
data temp4;
 set temp3;
 by n family vname;
 if first.vname then x=0;
 x+1;
 varname=cats(vname,x);
 drop want x vname;
run;
proc transpose data=temp4 out=temp5;
by n family varname;
var value year;
run;
proc transpose data=temp5 out=want(drop=_NAME_);
by n family ;
id varname _NAME_;
var col1;
run;

Adubhai · Posted 09-17-2023 03:22 AM

Hi Ksharp,
Thanks for the code. However this is giving me an error when trying to transpose temp4 into temp5. The error message reads "ERROR: Data set USER.TEMP4 is not sorted in ascending sequence. The current BY group has varname = Husband9 and the next BY group has varname = Husband10."
I tried sorting temp4 first by n family varname and then running it which actually gives me an output. But after the second transpose from temp5 to want my data is not showing everything the way I needed. No information of daughter is shown. Only father, mother, husband and wife.

Ksharp · Posted 09-17-2023 05:16 AM

The better is posting your real data,so I can replicate your problem.
And YES, you need PROC SORT before PROC TRANSPOSE when you have such ERROR info.
Or try NOTSORTED option of BY statement to avoid this ERROR info.
Like:

proc transpose data=temp4 out=temp5;
by n family varname notsorted ;
var value year;
run;
proc transpose data=temp5 out=want(drop=_NAME_);
by n family notsorted ;
id varname _NAME_;
var col1;
run;

If you still have problem, I suggest you post your real data and LOG.

Adubhai · Posted 09-18-2023 03:23 AM

Thanks Ksharp. With "notsorted" it does not show any error message anymore. But the output does not provide any daughter/son info. I have attached a portion of my actual data here for you.

Ksharp · Posted 09-18-2023 08:06 AM

OK. Try this one. Since YEAR is empty go through your data by my code logic, I drop YEAR not to transpose.

"the output does not provide any daughter/son info."

That is because my prx didn't match your real data.I changed it . And drop YEAR variable. If you need it ,that is another story.

libname x v9 'c:\temp\a';

data temp;
 set x.have;
 length want $ 200;
 pid=prxparse('/(Husband:|Wife:|Father:|Mother:|Daughter:|Son:)[\s\w\.]+/i');
 s=1;e=length(family);
 call prxnext(pid,s,e,family,p,l);
 do while(p);
   want=substr(family,p,l);output;
   call prxnext(pid,s,e,family,p,l);
 end;
 drop s e p l pid;
 run;
data temp2;
 set temp;
 vname=scan(want,1,': ');
 value=scan(want,2,':()');
/* if findc(want,'()') then year=scan(want,-1,' ,()','p');*/
run;
proc sort data=temp2 out=temp3;
by sl family vname;
run;
data temp4;
 set temp3;
 by sl family vname;
 if first.vname then x=0;
 x+1;
 varname=cats(vname,x);
 drop want x vname;
run;

proc sort data=temp4 ;
by sl family varname;
run;
proc transpose data=temp4 out=want let;
by sl family ;
var value ;
id varname;
run;

Adubhai · Posted 09-18-2023 09:44 PM

Thanks. This an improvement for sure. I have the daughter/son names now. But I need the birth years for the daughter and son as well.

Ksharp · Posted 09-19-2023 07:36 AM

/*OK.Here you go*/
libname x v9 'c:\temp\a';

data temp;
 set x.have;
 length want $ 200;
 pid=prxparse('/(Husband:|Wife:|Father:|Mother:|Daughter:|Son:)[\s\p\w\.]+([\s\p\w\.\(\)\-]+)?/i');
 s=1;e=length(family);
 call prxnext(pid,s,e,family,p,l);
 do while(p);
   want=substr(family,p,l);output;
   call prxnext(pid,s,e,family,p,l);
 end;
 keep sl want;
 run;
data temp2;
 set temp;
 vname=scan(want,1,': ');
 name=scan(want,2,':()');
 if prxmatch('/Daughter:|Son:/i',want) then year=compress(scan(want,-1,' ,()-','p'),,'kd');
 drop want;
run;
proc sort data=temp2 out=temp3;
by sl  vname;
run;
data temp4;
 set temp3;
 by sl vname;
 if first.vname then x=0;
 x+1;
 varname=cats(vname,x);
 drop  x vname;
run;

proc sort data=temp4 ;
by sl  varname;
run;
proc transpose data=temp4 out=temp5 ;
by sl varname ;
var name year ;
run;
data temp6;
 set temp5;
if lowcase(varname) in: ('father' 'mother') and missing(col1) then delete;
run;
proc transpose data=temp6 out=temp7 delimiter=_ ;
by sl;
id varname _name_;
var col1;
run;
data want;
 merge x.have temp7(drop=_name_);
 by sl;
run;

Adubhai · Posted 09-20-2023 09:56 PM

Thanks Ksharp,

This actually works in most cases. In some cases when the daughter or son's profession is mentioned before the birth year then the year is not extracted. But nonetheless this will do. Thank you very much.

Can I make one extra request? Is it possible to figure out if the eldest child is a son or daughter? I was assuming that the birth year will help but if we can't extract all the birth years then is it possible to find if the first appearance of the word "daughter" is before the first appearance of the word "son"? If so then a dummy variable of first_child_daughter can be 1.

I have already accepted your previous answer as the solution for the issue I initially asked for.

Ksharp · Posted 09-21-2023 07:44 AM

/*
" is it possible to find if the first appearance of the word "daughter" is before the first appearance of the word "son"? 
If so then a dummy variable of first_child_daughter can be 1. "

Sure.No problem.
*/
libname x v9 'c:\temp\a';

data temp;
 set x.have;
 length want $ 200;
 pid=prxparse('/(Husband:|Wife:|Father:|Mother:|Daughter:|Son:)[\s\p\w\."]+/i');
 s=1;e=length(family);
 call prxnext(pid,s,e,family,p,l);
 do while(p);
   want=substr(family,p,l);output;
   call prxnext(pid,s,e,family,p,l);
 end;
 keep sl want p ;  /*<-- Here p is the position of appearance of son or daughter*/
 run;
data temp2;
 set temp;
 vname=scan(want,1,': ');
 name=scan(want,2,':()');
 drop want;
run;
proc sort data=temp2 out=temp3;
by sl  vname;
run;
data temp4;
 set temp3;
 by sl vname;
 if first.vname then x=0;
 x+1;
 varname=cats(vname,x);
 drop  x vname;
run;
proc transpose data=temp4 out=temp5(drop=_NAME_) ;
by sl  ;
var name  ;
id varname;
run;


/*Get first child is daughter or son*/
proc sql;
create table son_or_dau as
select sl,vname as first_child
 from temp2
  where strip(lowcase(vname)) in ('son' 'daughter') 
   group by sl
    having p=min(p);
quit;
/*******/

data want;
 merge x.have son_or_dau temp5;
 by sl;
run;

Adubhai · Posted 09-22-2023 02:32 AM

Perfect! Thanks a lot.

mkeintz · Posted 09-10-2023 12:08 PM

This is a case where you can take advantage of the power of the "@" pointer in the INPUT statement. Reading your data from a text file would look like this:

data want (drop=i);
  infile cards truncover ;
  input family $200. @1 @;  *Read Family, reset pointer to column 1, and hold it there;

  array wife_name {2} $30;
  input @'Wife:' wife_name1 $30. @'Wife:' wife_name2 $30.  @1 @ ;  *Read up to 2 wife names+ excess characters;

  do i=1 to dim(wife_name);   *Remove excess characters ;
    if wife_name{i}^=' ' then substr(wife_name{i},indexc(wife_name{i},'('))=' ';
  end;

  input @'Daughter: '  @ 'b. ' dyear1 4. @'Daughter: '  @ 'b. ' dyear2 4. @1 @;
cards4;
[ Wife: XXX XXXX (architect, m. 6-Oct-1924, d. 13-Jan-1949, 2 children) , Daughter: AAA (b. 1925) , Son: BBBB (b. 1928) , Wife: YYY YYYY (architect, m. 4-Oct-1952) ]
[ Wife: XXX1 XXXX1 (div., one son) , Son: AAA1, Wife: YYY1 YYYY1 (actress, m. 1986, sep. 2008, one daughter) , Daughter: BBBB1 ]
[ Father: XXX2 XXXX2 XXXXX2 , Wife: YYYY2 YYYY2(gymnast, one daughter) , Daughter: AAA2 ]
[ Father: XXXX3 XXXXX3 ("XX3", d. 1922) , Mother: ZZZ ZZZZZ (d. 1923) , Wife: YYYY3 YYY3 (m. 28-Jun-1926, three daughters) , Daughter: AAA , Daughter: BBB , Daughter: CCCCC]
[ Wife: YYYY4 (one daughter, one son) , Daughter: AAAA3 (ballerina, b. 1962) , Son: BBB2 (b. 1968) ]
[ Father: XXXX5 (lawyer/administrator) , Wife: (d. 1970, two daughters) ]
;;;;

The "@'Wife:' tells SAS to move the input pointer to just after the designated text, at which location it read 30 characters into variables wife_name1 and wife_name2 (if there is one).

Because this will include excess characters, everything from "(" is set to a blank in a subsequent statement. This is one of those cases where the substr function can be used to the left of the equals sign in an assignment statement, as in substr(name,6)=' '; would set every character from column 6 to a blank. Only instead of the number 6, I use the location of the '('.

Because there is more information to be read, the INPUT statement ends with @1 @, which tells SAS to move the pointer back to column 1 (@1) and hold the pointer there for the next input statement (the trailing @).

Then the same technique can be used to read daughters' birth years, where the pointer is first moved to just after 'Daughter:' and then after 'b. ' to read the birth year.

Note the impact of the single trailing '@' will be relinquished at the end of the data step iteration, so the next line of text will be read for the next observation.

Now if you don't have the data in text file form, but already in variable FAMILY in a SAS dataset named HAVE, then:

filename fnames temp;
data _null_;
  file fnames;
  set have;
  put family;
run;

data want (drop=i);
  infile fnames truncover ;
  input family $200. @1 @;  *Read Family, reset pointer to column 1, and hold it there;

  array wife_name {2} $30;
  input @'Wife:' wife_name1 $30. @'Wife:' wife_name2 $30.  @1 @ ;  *Read up to 2 wife names+ excess characters;

  do i=1 to dim(wife_name);   *Remove excess characters ;
    if wife_name{i}^=' ' then substr(wife_name{i},indexc(wife_name{i},'('))=' ';
  end;

  input @'Daughter: '  @ 'b. ' dyear1 4. @'Daughter: '  @ 'b. ' dyear2 4. @;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Adubhai · Posted 09-17-2023 03:23 AM

Hi mkeintz,
Does this code work on data with more than 2 daughtes/wives?

SAS Training: Just a Click Away