Help using Base SAS procedures

Creating a variable based on subsequent observations

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 82
Accepted Solution

Creating a variable based on subsequent observations

I'm working with a dataset (from the Medical Expenditure Panel Survey, for those who are familiar) in which each observation is an individual.  Families are grouped together by the variable "DwellingID", individuals within a family are assigned a "PersonID."  Concatenating DwellingID and PersonID gives a unique identifier variable called "ID."  The variable "MomID" gives the PersonID of the individual's mother, provided that this mother is in the dataset.  Similarly, the variable "DadID" gives the PersonID of the individual's father, provided that this father is in the dataset.

So, for example, the data from this four-member family indicate that PersonID 101 is the mom of PersonID 103 and 104, while PersonID 102 is the father of PersonID 103 and 104:

DwellingID          PersonID         ID                   MomID                 DadID

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        101                      102

1001                  104                 1001_104        101                      102

What I'm trying to do is create an indicator that flags person 101 as a mother (e.g. MOTHER = 1) and person 102 as a father (e.g. FATHER = 1). I can look at the value of MomID when PersonID = 103 and deduce that PersonID 101 is the mom, but how do I get SAS to do this, so that I end up with something like this? 

DwellingID          PersonID         ID                   MomID                 DadID           MOTHER          FATHER

1001                  101                 1001_101        N/A                      N/A               1                       0

1001                  102                 1001_102        N/A                      N/A               0                       1

1001                  103                 1001_103        101                      102                0                       0

1001                  104                 1001_104        101                      102                0                       0

Thanks!


Accepted Solutions
Solution
‎08-12-2011 11:11 PM
PROC Star
Posts: 7,491

Re: Creating a variable based on subsequent observations

I think that you only need a fairly simple datastep merge or sql join.  In a datastep, you could probably do it with something like:

data have;

  infile cards truncover;

  input (DUID PID DUPERSID MomID DadID) ($);

  cards;

1001  101  1001101          

1001  102  1001102            

1001  103  1001103          

1001  104  1001104  1001101  1001102

1001  105  1001105  1001103  1001102

1002  101  1002101

1003  101  1003101

1003  102  1003102

1003  103  1003103  1003102  1003101                    

1003  104  1003104  1003102  1003101

;

proc sort data=have;

  by DUID descending MomID;

run;

data want (drop=HoldSmiley Happy;

  set have;

  by DUID;

  retain Hold_MomID Hold_DadID;

  length Hold_MomID Hold_DadID $100;

  if first.DUID then do;

    Hold_MomID="";

    Hold_DadID="";

  end;

  Mother=0;

  Father=0;

  if missing(MomID) then do;

    if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;

  end;

  else Hold_MomID=catx(" ",Hold_MomID,MomID);

  if missing(DadID) then do;

    if index(Hold_DadID,DUPERSID) gt 0 then Father=1;

  end;

  else Hold_DadID=catx(" ",Hold_DadID,DadID);

run;

 

data mdegrees;

  input (DUID PID DUPERSID MOMID DEGREE) ($) MOTHER HASMOM;

  cards;

1001 101 1001101 . 1 1 .

1001 102 1001102 . 2 . .

1001 103 1001103 . 3 1 .

1001 104 1001104 1001101 8 . 1

1001 105 1001105 1001103 8 . 1

1002 101 1002101 . 1 . .

1003 101 1003101 . 1 . .

1003 102 1003102 . 2 . .

1003 103 1003103 1003102 8 . 1

1003 104 1003104 1003102 8 . 1

;

run;

 

proc sort data=mdegrees (keep=dupersid degree

                         rename=(degree=mdegree

                                 dupersid=MomID)

                         )

                         out=mdegrees;

  by MomID mdegree;

run;

 

proc sort data=want;

  by MomID;

run;

 

data new_want;

  merge want (in=a) mdegrees (in=b);

  by MomID;

  if a;

run;

View solution in original post


All Replies
PROC Star
Posts: 7,491

Creating a variable based on subsequent observations

I don't know if I've captured all contingencies, but the following should at least give you a start about one way of solving such a problem:

data have;

  input (DwellingID PersonID ID  MomID DadID) ($);

  cards;

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        101                      102

1001                  104                 1001_104        101                      102

;

proc sort data=have;

  by DwellingID MomID;

run;

data want (drop=HoldSmiley Happy;

  set have;

  by DwellingID;

  retain Hold_MomID Hold_DadID;

  length Hold_MomID Hold_DadID $3;

  if first.DwellingID then do;

    Hold_MomID="";

    Hold_DadID="";

  end;

  Mother=0;

  Father=0;

  if MomID eq "N/A" then do;

    if PersonID eq Hold_MomID then Mother=1;

  end;

  else if not missing(MomID) then

    Hold_MomID=MomID;

  if DadID eq "N/A" then do;

    if PersonID eq Hold_DadID then Father=1;

  end;

  else if not missing(DadID) then

    Hold_DadID=DadID;

run;

Frequent Contributor
Posts: 82

Creating a variable based on subsequent observations

Thanks, this definitely has pointed me in the right direction.  The problem I'm running into is that in some dwellingid's, there are sometimes multiple mothers.  [Let's forget about dads for a second].  For instance, in this five-member dwelling, person 104's mom is person 101, and person 105's mom is person 102. 

DwellingID          PersonID         ID                   MomID                 

1001                  101                 1001_101        N/A                     

1001                  102                 1001_102        N/A                      

1001                  103                 1001_103        N/A                     

1001                  104                 1001_104        101                     

1001                  105                 1001_105        102                      

So with your code, the data above would be sorted by dwellingID and then MomID, looking like this:

DwellingID          PersonID         ID                   MomID                 

1001                  104                 1001_104        101                     

1001                  105                 1001_105        102        

1001                  101                 1001_101        N/A                     

1001                  102                 1001_102        N/A                      

1001                  103                 1001_102        N/A                     

During the first iteration of the datstep, the code:

if MomID eq "N/A" then do;

    if PersonID eq Hold_MomID then Mother=1;

end;

...will not result in changes since Hold_MomID has not been set yet.  The next piece of the code:

else if not missing(MomID) then

    Hold_MomID=MomID;

...Will set Hold_MOMID to 101.  The data step iterates, and then the statement:

if MomID eq "N/A" then do;

    if PersonID eq Hold_MomID then Mother=1;

end;

...will not result in changes.  The next statement:

else if not missing(MomID) then

    Hold_MomID=MomID;

...Will set Hold_MomID to 102.  And then when the data step iterates again to next observation (person 101), that person will NOT have MOTHER = 1 because Hold_Mom = 102 (insted of 101), but in the next iteration, person 102 WILL have MOTHER = 1 because Hold_Mom =  102. 

Is there are a way around this that isn't so dependent on the ordering of the data?  Thanks.

Frequent Contributor
Posts: 81

Creating a variable based on subsequent observations

*Using Data step;

proc sort data=one; by personid; run;

proc sort data=one out=momid(keep=momid rename=(momid=personid)) nodupkey;

          by momid;

run;

proc sort data=one out=dadid(keep=dadid rename=(dadid=personid)) nodupkey;

          by dadid;

run;

data final;

          merge one(in=a) momid(in=b) dadid(in=c);

          by personid;

          if a;

          if b then mother=1; else mother=0;

          if c then father=1; else father=0;

run;

*Using Proc SQL;

proc sql;

          create table two as select c.*,father from (select a.*,mother from one as a left join (select distinct momid,1 as

          mother from one) as b on a.personid=b.momid) as c left join (select distinct dadid,1 as father from one) as d

          on c.personid=d.dadid;

quit;

PROC Star
Posts: 7,491

Creating a variable based on subsequent observations

Again, the following still probably doesn't account for all contingencies, but it at least correctly (I think) accounts for the condition you identified:

data have;

  input (DwellingID PersonID ID  MomID DadID) ($);

  cards;

1001                  101                 1001_101        N/A                      N/A

1001                  102                 1001_102        N/A                      N/A

1001                  103                 1001_103        N/A                      N/A

1001                  104                 1001_104        101                      102

1001                  105                 1001_105        103                      102

;

proc sort data=have;

  by DwellingID MomID;

run;

data want (drop=HoldSmiley Happy;

  set have;

  by DwellingID;

  retain Hold_MomID Hold_DadID;

  length Hold_MomID Hold_DadID $3;

  if first.DwellingID then do;

    Hold_MomID="";

    Hold_DadID="";

  end;

  Mother=0;

  Father=0;

  if MomID eq "N/A" then do;

    if index(PersonID,Hold_MomID) gt 0 then Mother=1;

  end;

  else if not missing(MomID) then

    Hold_MomID=Hold_MomID||" "||MomID;

  if DadID eq "N/A" then do;

    if index(PersonID,Hold_DadID) gt 0 then Father=1;

  end;

  else if not missing(DadID) then

    Hold_DadID=Hold_DadID||" "||DadID;

run;

Frequent Contributor
Posts: 82

Creating a variable based on subsequent observations

Hi - when I try to use the code above with the index(PersonID,Hold_MomID) statement, Mother = 1 for all of my observations. 

Perhaps I made an error by trying to simplify my dataset too much for my question.  This is what the dataset actually looks like (note DUID is the "dwelling unit ID" and PID is "person ID" - you were correct in that person ID is duplicated in different families.  DUPERSID is the concatenated DUID + PID and is the unique identifier in the MEPS database).  Here, DUID = 1001 is a five-member dwelling unit with two moms (DUPERSID 1001101 is mom of DUPERSID 1001104, and DUPERSID 1001103 is mom of DUPERSID 001105).  DUID = 1002 is a one-member dwelling unit (DUPERSID 1002101 has no children).  DUID = 1003 is a four-member family, and DUPERSID 1003102 is the mom of DUPERSID 1003103 and 1003104.  MOMID has missing data for individuals whose mothers are not in the study.  

DUID                  PID                 DUPERSID         MOMID               

1001                  101                 1001101                                     

1001                  102                 1001102                                       

1001                  103                 1001103                                     

1001                  104                 1001104             101                    

1001                  105                 1001105             103                     

1002                  101                 1002101

1003                  101                 1003101            

1003                  102                 1003102

1003                  103                 1003103             102                      

1003                  104                 1003104             102                      

Thanks for your patience and help. 

PROC Star
Posts: 7,491

Re: Creating a variable based on subsequent observations

Please post the full example dataset adding DadID and the N/As for the mothers and fathers.  I can change the code to account for the different variable names, but would rather not change the logic because of variables not being there that will actually exist in the real world.  If the N/As don't really exist, that is okay, as the missing values would work just as well.  But there is no sense developing code that you'll have to change because your data has additional variables and/or factors.

Frequent Contributor
Posts: 82

Creating a variable based on subsequent observations

OK, let me change the example dataset somewhat...here, the MOMID and DADID variables are going to reflect the unique DUPERSID identifiers of the mother and father instead of their PID.  There really isn't N/A in the dataset for missing values; they're simply missing.

DUID                  PID                 DUPERSID         MOMID                 DADID              

1001                  101                 1001101                                     

1001                  102                 1001102                                       

1001                  103                 1001103                                     

1001                  104                 1001104             1001101                 1001102

1001                  105                 1001105             1001103                 1001102

1002                  101                 1002101

1003                  101                 1003101            

1003                  102                 1003102

1003                  103                 1003103             1003102                 1003101                     

1003                  104                 1003104             1003102                 1003101

Frequent Contributor
Posts: 82

Creating a variable based on subsequent observations

Thanks for this solution.  I'm close to getting this approach to working, but there's one complication.  In my actual program, I had to construct the variables momid and dadid using the concatenation function CATS (probably too complicated to be worth explainining).  However, the variables momid and dadid have an identical length and informat as the personid variable in the dataset ones.

In the code above, momid and dadid are renamed personid, and then in the final data step, SAS is supposed to be merge the three datasets by personid.  The problem is that the personid variable in the dataset one is apparently different from the personid variable in the datasets momid and dadid, because the merge does not work.  So, despite the variable personid having the same name, length, and informat in all three datasets, the merge isn't working. 

Any ideas?  Thanks.

PROC Star
Posts: 7,491

Re: Creating a variable based on subsequent observations

We would need to see the data and code.  It may be a simple matter of unexpected spacing that could be corrected with, say, the strip function.  But, without seeing data and code, one can only guess and probably offer misleading advice.

E.g., my suggested code was based on the possibility that IDs might be repeated across dwellings.  That was simply a guess on my part, but one that I don't think would hurt.

Valued Guide
Posts: 765

Re: Creating a variable based on subsequent observations

Hi ... here's another approach that works with your multiple moms. It uses a format as a lookup table

to assign the PERSONIDs as a mother or father ...

data one;

input dwellingid : $4. personid : $3. id : $8. (momid dadid) (: $3.);

datalines;

1001 101 1001_101 N/A N/A

1001 102 1001_102 N/A N/A

1001 103 1001_103 N/A N/A

1001 104 1001_104 101 102

1001 105 1001_105 103 102

;

run;

proc sql;

create table momdad (where=(start ne 'N/A')) as

select distinct momid as start, '10' as label, '$momdad' as fmtname from one

union

select distinct dadid as start, '01' as label, '$momdad' as fmtname from one;

quit;

proc format cntlin=momdad;

run;

data two;

set one;

mother = ifc(momid eq 'N/A', char(put(personid,$momdad.),1), '0');

father = ifc(dadid eq 'N/A', char(put(personid,$momdad.),2), '0');

run;

proc print data=two noobs;

run;


dwellingid    personid       id       momid    dadid    mother    father

   1001         101       1001_101     N/A      N/A        1         0

   1001         102       1001_102     N/A      N/A        0         1

   1001         103       1001_103     N/A      N/A        1         0

   1001         104       1001_104     101      102        0         0

   1001         105       1001_105     103      102        0         0

Super User
Posts: 10,044

Re: Creating a variable based on subsequent observations

If your dataset were not too large.

data one;
input dwellingid : $4. personid : $3. id : $8. (momid dadid) (: $3.);
datalines;
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 N/A N/A
1001 104 1001_104 101 102
1001 105 1001_105 103 102
;
run;
%global momid dadid;
proc sql noprint;
 select distinct quote(momid) into : momid separated by ' '
  from one where momid ne 'N/A';
 select distinct quote(dadid) into : dadid separated by ' '
  from one where dadid ne 'N/A';
quit;

data want;
 set one;
 mother=ifn( personid in (&momid),1,0);
 father=ifn ( personid in (&dadid),1,0);
run;

Ksharp

Frequent Contributor
Posts: 82

Creating a variable based on subsequent observations

Thanks for this idea, but my dataset's too large to accomodate this solution (The macro variable's maximum length was exceeded).

PROC Star
Posts: 7,491

Creating a variable based on subsequent observations

Try the following and let us know if it does what you expect.  There were at least two problems with the code I last posted that I believe are corrected below:

data have;

  infile cards truncover;

  input (DUID PID DUPERSID MomID DadID) ($);

  cards;

1001  101  1001101          

1001  102  1001102            

1001  103  1001103          

1001  104  1001104  1001101  1001102

1001  105  1001105  1001103  1001102

1002  101  1002101

1003  101  1003101

1003  102  1003102

1003  103  1003103  1003102  1003101                    

1003  104  1003104  1003102  1003101

;

proc sort data=have;

  by DUID descending MomID;

run;

data want (drop=HoldSmiley Happy;

  set have;

  by DUID;

  retain Hold_MomID Hold_DadID;

  length Hold_MomID Hold_DadID $100;

  if first.DUID then do;

    Hold_MomID="";

    Hold_DadID="";

  end;

  Mother=0;

  Father=0;

  if missing(MomID) then do;

    if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;

  end;

  else Hold_MomID=catx(" ",Hold_MomID,MomID);

  if missing(DadID) then do;

    if index(Hold_DadID,DUPERSID) gt 0 then Father=1;

  end;

  else Hold_DadID=catx(" ",Hold_DadID,DadID);

run;

Frequent Contributor
Posts: 82

Re: Creating a variable based on subsequent observations

Hi, this code worked, except that I had to assign MOTHER and FATHER into two different datasets, then merge by DUPERSID (the code above sorts by descending momID but then tries to assign MOTHER and FATHER at the same time; it successfully assigned MOTHER but not FATHER, because the data was not sorted by descending DadID).

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 20 replies
  • 265 views
  • 3 likes
  • 6 in conversation