## Creating a variable based on subsequent observations

Solved
Frequent Contributor
Posts: 82

# Creating a variable based on subsequent observations

I'm working with a dataset (from the Medical Expenditure Panel Survey, for those who are familiar) in which each observation is an individual.  Families are grouped together by the variable "DwellingID", individuals within a family are assigned a "PersonID."  Concatenating DwellingID and PersonID gives a unique identifier variable called "ID."  The variable "MomID" gives the PersonID of the individual's mother, provided that this mother is in the dataset.  Similarly, the variable "DadID" gives the PersonID of the individual's father, provided that this father is in the dataset.

So, for example, the data from this four-member family indicate that PersonID 101 is the mom of PersonID 103 and 104, while PersonID 102 is the father of PersonID 103 and 104:

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        101                      102

1001                  104                 1001_104        101                      102

What I'm trying to do is create an indicator that flags person 101 as a mother (e.g. MOTHER = 1) and person 102 as a father (e.g. FATHER = 1). I can look at the value of MomID when PersonID = 103 and deduce that PersonID 101 is the mom, but how do I get SAS to do this, so that I end up with something like this?

DwellingID          PersonID         ID                   MomID                 DadID           MOTHER          FATHER

1001                  101                 1001_101        N/A                      N/A               1                       0

1001                  102                 1001_102        N/A                      N/A               0                       1

1001                  103                 1001_103        101                      102                0                       0

1001                  104                 1001_104        101                      102                0                       0

Thanks!

Accepted Solutions
Solution
‎08-12-2011 11:11 PM
PROC Star
Posts: 8,164

## Re: Creating a variable based on subsequent observations

I think that you only need a fairly simple datastep merge or sql join.  In a datastep, you could probably do it with something like:

data have;

infile cards truncover;

input (DUID PID DUPERSID MomID DadID) (\$);

cards;

1001  101  1001101

1001  102  1001102

1001  103  1001103

1001  104  1001104  1001101  1001102

1001  105  1001105  1001103  1001102

1002  101  1002101

1003  101  1003101

1003  102  1003102

1003  103  1003103  1003102  1003101

1003  104  1003104  1003102  1003101

;

proc sort data=have;

by DUID descending MomID;

run;

data want (drop=Hold;

set have;

by DUID;

if first.DUID then do;

Hold_MomID="";

end;

Mother=0;

Father=0;

if missing(MomID) then do;

if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;

end;

else Hold_MomID=catx(" ",Hold_MomID,MomID);

if index(Hold_DadID,DUPERSID) gt 0 then Father=1;

end;

run;

data mdegrees;

input (DUID PID DUPERSID MOMID DEGREE) (\$) MOTHER HASMOM;

cards;

1001 101 1001101 . 1 1 .

1001 102 1001102 . 2 . .

1001 103 1001103 . 3 1 .

1001 104 1001104 1001101 8 . 1

1001 105 1001105 1001103 8 . 1

1002 101 1002101 . 1 . .

1003 101 1003101 . 1 . .

1003 102 1003102 . 2 . .

1003 103 1003103 1003102 8 . 1

1003 104 1003104 1003102 8 . 1

;

run;

proc sort data=mdegrees (keep=dupersid degree

rename=(degree=mdegree

dupersid=MomID)

)

out=mdegrees;

by MomID mdegree;

run;

proc sort data=want;

by MomID;

run;

data new_want;

merge want (in=a) mdegrees (in=b);

by MomID;

if a;

run;

All Replies
PROC Star
Posts: 8,164

## Creating a variable based on subsequent observations

I don't know if I've captured all contingencies, but the following should at least give you a start about one way of solving such a problem:

data have;

input (DwellingID PersonID ID  MomID DadID) (\$);

cards;

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        101                      102

1001                  104                 1001_104        101                      102

;

proc sort data=have;

by DwellingID MomID;

run;

data want (drop=Hold;

set have;

by DwellingID;

if first.DwellingID then do;

Hold_MomID="";

end;

Mother=0;

Father=0;

if MomID eq "N/A" then do;

if PersonID eq Hold_MomID then Mother=1;

end;

else if not missing(MomID) then

Hold_MomID=MomID;

if DadID eq "N/A" then do;

if PersonID eq Hold_DadID then Father=1;

end;

run;

Frequent Contributor
Posts: 82

## Creating a variable based on subsequent observations

Thanks, this definitely has pointed me in the right direction.  The problem I'm running into is that in some dwellingid's, there are sometimes multiple mothers.  [Let's forget about dads for a second].  For instance, in this five-member dwelling, person 104's mom is person 101, and person 105's mom is person 102.

DwellingID          PersonID         ID                   MomID

1001                  101                 1001_101        N/A

1001                  102                 1001_102        N/A

1001                  103                 1001_103        N/A

1001                  104                 1001_104        101

1001                  105                 1001_105        102

So with your code, the data above would be sorted by dwellingID and then MomID, looking like this:

DwellingID          PersonID         ID                   MomID

1001                  104                 1001_104        101

1001                  105                 1001_105        102

1001                  101                 1001_101        N/A

1001                  102                 1001_102        N/A

1001                  103                 1001_102        N/A

During the first iteration of the datstep, the code:

if MomID eq "N/A" then do;

if PersonID eq Hold_MomID then Mother=1;

end;

...will not result in changes since Hold_MomID has not been set yet.  The next piece of the code:

else if not missing(MomID) then

Hold_MomID=MomID;

...Will set Hold_MOMID to 101.  The data step iterates, and then the statement:

if MomID eq "N/A" then do;

if PersonID eq Hold_MomID then Mother=1;

end;

...will not result in changes.  The next statement:

else if not missing(MomID) then

Hold_MomID=MomID;

...Will set Hold_MomID to 102.  And then when the data step iterates again to next observation (person 101), that person will NOT have MOTHER = 1 because Hold_Mom = 102 (insted of 101), but in the next iteration, person 102 WILL have MOTHER = 1 because Hold_Mom =  102.

Is there are a way around this that isn't so dependent on the ordering of the data?  Thanks.

Frequent Contributor
Posts: 81

## Creating a variable based on subsequent observations

*Using Data step;

proc sort data=one; by personid; run;

proc sort data=one out=momid(keep=momid rename=(momid=personid)) nodupkey;

by momid;

run;

run;

data final;

by personid;

if a;

if b then mother=1; else mother=0;

if c then father=1; else father=0;

run;

*Using Proc SQL;

proc sql;

create table two as select c.*,father from (select a.*,mother from one as a left join (select distinct momid,1 as

mother from one) as b on a.personid=b.momid) as c left join (select distinct dadid,1 as father from one) as d

quit;

PROC Star
Posts: 8,164

## Creating a variable based on subsequent observations

Again, the following still probably doesn't account for all contingencies, but it at least correctly (I think) accounts for the condition you identified:

data have;

input (DwellingID PersonID ID  MomID DadID) (\$);

cards;

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        N/A                      N/A

1001                  104                 1001_104        101                      102

1001                  105                 1001_105        103                      102

;

proc sort data=have;

by DwellingID MomID;

run;

data want (drop=Hold;

set have;

by DwellingID;

if first.DwellingID then do;

Hold_MomID="";

end;

Mother=0;

Father=0;

if MomID eq "N/A" then do;

if index(PersonID,Hold_MomID) gt 0 then Mother=1;

end;

else if not missing(MomID) then

Hold_MomID=Hold_MomID||" "||MomID;

if DadID eq "N/A" then do;

if index(PersonID,Hold_DadID) gt 0 then Father=1;

end;

run;

Frequent Contributor
Posts: 82

## Creating a variable based on subsequent observations

Hi - when I try to use the code above with the index(PersonID,Hold_MomID) statement, Mother = 1 for all of my observations.

Perhaps I made an error by trying to simplify my dataset too much for my question.  This is what the dataset actually looks like (note DUID is the "dwelling unit ID" and PID is "person ID" - you were correct in that person ID is duplicated in different families.  DUPERSID is the concatenated DUID + PID and is the unique identifier in the MEPS database).  Here, DUID = 1001 is a five-member dwelling unit with two moms (DUPERSID 1001101 is mom of DUPERSID 1001104, and DUPERSID 1001103 is mom of DUPERSID 001105).  DUID = 1002 is a one-member dwelling unit (DUPERSID 1002101 has no children).  DUID = 1003 is a four-member family, and DUPERSID 1003102 is the mom of DUPERSID 1003103 and 1003104.  MOMID has missing data for individuals whose mothers are not in the study.

DUID                  PID                 DUPERSID         MOMID

1001                  101                 1001101

1001                  102                 1001102

1001                  103                 1001103

1001                  104                 1001104             101

1001                  105                 1001105             103

1002                  101                 1002101

1003                  101                 1003101

1003                  102                 1003102

1003                  103                 1003103             102

1003                  104                 1003104             102

Thanks for your patience and help.

PROC Star
Posts: 8,164

## Re: Creating a variable based on subsequent observations

Please post the full example dataset adding DadID and the N/As for the mothers and fathers.  I can change the code to account for the different variable names, but would rather not change the logic because of variables not being there that will actually exist in the real world.  If the N/As don't really exist, that is okay, as the missing values would work just as well.  But there is no sense developing code that you'll have to change because your data has additional variables and/or factors.

Frequent Contributor
Posts: 82

## Creating a variable based on subsequent observations

OK, let me change the example dataset somewhat...here, the MOMID and DADID variables are going to reflect the unique DUPERSID identifiers of the mother and father instead of their PID.  There really isn't N/A in the dataset for missing values; they're simply missing.

1001                  101                 1001101

1001                  102                 1001102

1001                  103                 1001103

1001                  104                 1001104             1001101                 1001102

1001                  105                 1001105             1001103                 1001102

1002                  101                 1002101

1003                  101                 1003101

1003                  102                 1003102

1003                  103                 1003103             1003102                 1003101

1003                  104                 1003104             1003102                 1003101

Frequent Contributor
Posts: 82

## Creating a variable based on subsequent observations

Thanks for this solution.  I'm close to getting this approach to working, but there's one complication.  In my actual program, I had to construct the variables momid and dadid using the concatenation function CATS (probably too complicated to be worth explainining).  However, the variables momid and dadid have an identical length and informat as the personid variable in the dataset ones.

In the code above, momid and dadid are renamed personid, and then in the final data step, SAS is supposed to be merge the three datasets by personid.  The problem is that the personid variable in the dataset one is apparently different from the personid variable in the datasets momid and dadid, because the merge does not work.  So, despite the variable personid having the same name, length, and informat in all three datasets, the merge isn't working.

Any ideas?  Thanks.

PROC Star
Posts: 8,164

## Re: Creating a variable based on subsequent observations

We would need to see the data and code.  It may be a simple matter of unexpected spacing that could be corrected with, say, the strip function.  But, without seeing data and code, one can only guess and probably offer misleading advice.

E.g., my suggested code was based on the possibility that IDs might be repeated across dwellings.  That was simply a guess on my part, but one that I don't think would hurt.

Valued Guide
Posts: 765

## Re: Creating a variable based on subsequent observations

Hi ... here's another approach that works with your multiple moms. It uses a format as a lookup table

to assign the PERSONIDs as a mother or father ...

data one;

input dwellingid : \$4. personid : \$3. id : \$8. (momid dadid) (: \$3.);

datalines;

1001 101 1001_101 N/A N/A

1001 102 1001_102 N/A N/A

1001 103 1001_103 N/A N/A

1001 104 1001_104 101 102

1001 105 1001_105 103 102

;

run;

proc sql;

create table momdad (where=(start ne 'N/A')) as

select distinct momid as start, '10' as label, '\$momdad' as fmtname from one

union

select distinct dadid as start, '01' as label, '\$momdad' as fmtname from one;

quit;

run;

data two;

set one;

mother = ifc(momid eq 'N/A', char(put(personid,\$momdad.),1), '0');

run;

proc print data=two noobs;

run;

dwellingid    personid       id       momid    dadid    mother    father

1001         101       1001_101     N/A      N/A        1         0

1001         102       1001_102     N/A      N/A        0         1

1001         103       1001_103     N/A      N/A        1         0

1001         104       1001_104     101      102        0         0

1001         105       1001_105     103      102        0         0

Super User
Posts: 10,784

## Re: Creating a variable based on subsequent observations

If your dataset were not too large.

```data one;
input dwellingid : \$4. personid : \$3. id : \$8. (momid dadid) (: \$3.);
datalines;
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 N/A N/A
1001 104 1001_104 101 102
1001 105 1001_105 103 102
;
run;
proc sql noprint;
select distinct quote(momid) into : momid separated by ' '
from one where momid ne 'N/A';
from one where dadid ne 'N/A';
quit;

data want;
set one;
mother=ifn( personid in (&momid),1,0);
run;
```

Ksharp

Frequent Contributor
Posts: 82

## Creating a variable based on subsequent observations

Thanks for this idea, but my dataset's too large to accomodate this solution (The macro variable's maximum length was exceeded).

PROC Star
Posts: 8,164

## Creating a variable based on subsequent observations

Try the following and let us know if it does what you expect.  There were at least two problems with the code I last posted that I believe are corrected below:

data have;

infile cards truncover;

input (DUID PID DUPERSID MomID DadID) (\$);

cards;

1001  101  1001101

1001  102  1001102

1001  103  1001103

1001  104  1001104  1001101  1001102

1001  105  1001105  1001103  1001102

1002  101  1002101

1003  101  1003101

1003  102  1003102

1003  103  1003103  1003102  1003101

1003  104  1003104  1003102  1003101

;

proc sort data=have;

by DUID descending MomID;

run;

data want (drop=Hold;

set have;

by DUID;

if first.DUID then do;

Hold_MomID="";

end;

Mother=0;

Father=0;

if missing(MomID) then do;

if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;

end;

else Hold_MomID=catx(" ",Hold_MomID,MomID);

if index(Hold_DadID,DUPERSID) gt 0 then Father=1;

end;

run;

Frequent Contributor
Posts: 82

## Re: Creating a variable based on subsequent observations

Hi, this code worked, except that I had to assign MOTHER and FATHER into two different datasets, then merge by DUPERSID (the code above sorts by descending momID but then tries to assign MOTHER and FATHER at the same time; it successfully assigned MOTHER but not FATHER, because the data was not sorted by descending DadID).

🔒 This topic is solved and locked.