I'm working with a dataset (from the Medical Expenditure Panel Survey, for those who are familiar) in which each observation is an individual. Families are grouped together by the variable "DwellingID", individuals within a family are assigned a "PersonID." Concatenating DwellingID and PersonID gives a unique identifier variable called "ID." The variable "MomID" gives the PersonID of the individual's mother, provided that this mother is in the dataset. Similarly, the variable "DadID" gives the PersonID of the individual's father, provided that this father is in the dataset.
So, for example, the data from this four-member family indicate that PersonID 101 is the mom of PersonID 103 and 104, while PersonID 102 is the father of PersonID 103 and 104:
DwellingID PersonID ID MomID DadID
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 101 102
1001 104 1001_104 101 102
What I'm trying to do is create an indicator that flags person 101 as a mother (e.g. MOTHER = 1) and person 102 as a father (e.g. FATHER = 1). I can look at the value of MomID when PersonID = 103 and deduce that PersonID 101 is the mom, but how do I get SAS to do this, so that I end up with something like this?
DwellingID PersonID ID MomID DadID MOTHER FATHER
1001 101 1001_101 N/A N/A 1 0
1001 102 1001_102 N/A N/A 0 1
1001 103 1001_103 101 102 0 0
1001 104 1001_104 101 102 0 0
Thanks!
I think that you only need a fairly simple datastep merge or sql join. In a datastep, you could probably do it with something like:
data have;
infile cards truncover;
input (DUID PID DUPERSID MomID DadID) ($);
cards;
1001 101 1001101
1001 102 1001102
1001 103 1001103
1001 104 1001104 1001101 1001102
1001 105 1001105 1001103 1001102
1002 101 1002101
1003 101 1003101
1003 102 1003102
1003 103 1003103 1003102 1003101
1003 104 1003104 1003102 1003101
;
proc sort data=have;
by DUID descending MomID;
run;
data want (drop=Hold:);
set have;
by DUID;
retain Hold_MomID Hold_DadID;
length Hold_MomID Hold_DadID $100;
if first.DUID then do;
Hold_MomID="";
Hold_DadID="";
end;
Mother=0;
Father=0;
if missing(MomID) then do;
if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;
end;
else Hold_MomID=catx(" ",Hold_MomID,MomID);
if missing(DadID) then do;
if index(Hold_DadID,DUPERSID) gt 0 then Father=1;
end;
else Hold_DadID=catx(" ",Hold_DadID,DadID);
run;
data mdegrees;
input (DUID PID DUPERSID MOMID DEGREE) ($) MOTHER HASMOM;
cards;
1001 101 1001101 . 1 1 .
1001 102 1001102 . 2 . .
1001 103 1001103 . 3 1 .
1001 104 1001104 1001101 8 . 1
1001 105 1001105 1001103 8 . 1
1002 101 1002101 . 1 . .
1003 101 1003101 . 1 . .
1003 102 1003102 . 2 . .
1003 103 1003103 1003102 8 . 1
1003 104 1003104 1003102 8 . 1
;
run;
proc sort data=mdegrees (keep=dupersid degree
rename=(degree=mdegree
dupersid=MomID)
)
out=mdegrees;
by MomID mdegree;
run;
proc sort data=want;
by MomID;
run;
data new_want;
merge want (in=a) mdegrees (in=b);
by MomID;
if a;
run;
I don't know if I've captured all contingencies, but the following should at least give you a start about one way of solving such a problem:
data have;
input (DwellingID PersonID ID MomID DadID) ($);
cards;
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 101 102
1001 104 1001_104 101 102
;
proc sort data=have;
by DwellingID MomID;
run;
data want (drop=Hold:);
set have;
by DwellingID;
retain Hold_MomID Hold_DadID;
length Hold_MomID Hold_DadID $3;
if first.DwellingID then do;
Hold_MomID="";
Hold_DadID="";
end;
Mother=0;
Father=0;
if MomID eq "N/A" then do;
if PersonID eq Hold_MomID then Mother=1;
end;
else if not missing(MomID) then
Hold_MomID=MomID;
if DadID eq "N/A" then do;
if PersonID eq Hold_DadID then Father=1;
end;
else if not missing(DadID) then
Hold_DadID=DadID;
run;
Thanks, this definitely has pointed me in the right direction. The problem I'm running into is that in some dwellingid's, there are sometimes multiple mothers. [Let's forget about dads for a second]. For instance, in this five-member dwelling, person 104's mom is person 101, and person 105's mom is person 102.
DwellingID PersonID ID MomID
1001 101 1001_101 N/A
1001 102 1001_102 N/A
1001 103 1001_103 N/A
1001 104 1001_104 101
1001 105 1001_105 102
So with your code, the data above would be sorted by dwellingID and then MomID, looking like this:
DwellingID PersonID ID MomID
1001 104 1001_104 101
1001 105 1001_105 102
1001 101 1001_101 N/A
1001 102 1001_102 N/A
1001 103 1001_102 N/A
During the first iteration of the datstep, the code:
if MomID eq "N/A" then do;
if PersonID eq Hold_MomID then Mother=1;
end;
...will not result in changes since Hold_MomID has not been set yet. The next piece of the code:
else if not missing(MomID) then
Hold_MomID=MomID;
...Will set Hold_MOMID to 101. The data step iterates, and then the statement:
if MomID eq "N/A" then do;
if PersonID eq Hold_MomID then Mother=1;
end;
...will not result in changes. The next statement:
else if not missing(MomID) then
Hold_MomID=MomID;
...Will set Hold_MomID to 102. And then when the data step iterates again to next observation (person 101), that person will NOT have MOTHER = 1 because Hold_Mom = 102 (insted of 101), but in the next iteration, person 102 WILL have MOTHER = 1 because Hold_Mom = 102.
Is there are a way around this that isn't so dependent on the ordering of the data? Thanks.
*Using Data step;
proc sort data=one; by personid; run;
proc sort data=one out=momid(keep=momid rename=(momid=personid)) nodupkey;
by momid;
run;
proc sort data=one out=dadid(keep=dadid rename=(dadid=personid)) nodupkey;
by dadid;
run;
data final;
merge one(in=a) momid(in=b) dadid(in=c);
by personid;
if a;
if b then mother=1; else mother=0;
if c then father=1; else father=0;
run;
*Using Proc SQL;
proc sql;
create table two as select c.*,father from (select a.*,mother from one as a left join (select distinct momid,1 as
mother from one) as b on a.personid=b.momid) as c left join (select distinct dadid,1 as father from one) as d
on c.personid=d.dadid;
quit;
Again, the following still probably doesn't account for all contingencies, but it at least correctly (I think) accounts for the condition you identified:
data have;
input (DwellingID PersonID ID MomID DadID) ($);
cards;
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 N/A N/A
1001 104 1001_104 101 102
1001 105 1001_105 103 102
;
proc sort data=have;
by DwellingID MomID;
run;
data want (drop=Hold:);
set have;
by DwellingID;
retain Hold_MomID Hold_DadID;
length Hold_MomID Hold_DadID $3;
if first.DwellingID then do;
Hold_MomID="";
Hold_DadID="";
end;
Mother=0;
Father=0;
if MomID eq "N/A" then do;
if index(PersonID,Hold_MomID) gt 0 then Mother=1;
end;
else if not missing(MomID) then
Hold_MomID=Hold_MomID||" "||MomID;
if DadID eq "N/A" then do;
if index(PersonID,Hold_DadID) gt 0 then Father=1;
end;
else if not missing(DadID) then
Hold_DadID=Hold_DadID||" "||DadID;
run;
Hi - when I try to use the code above with the index(PersonID,Hold_MomID) statement, Mother = 1 for all of my observations.
Perhaps I made an error by trying to simplify my dataset too much for my question. This is what the dataset actually looks like (note DUID is the "dwelling unit ID" and PID is "person ID" - you were correct in that person ID is duplicated in different families. DUPERSID is the concatenated DUID + PID and is the unique identifier in the MEPS database). Here, DUID = 1001 is a five-member dwelling unit with two moms (DUPERSID 1001101 is mom of DUPERSID 1001104, and DUPERSID 1001103 is mom of DUPERSID 001105). DUID = 1002 is a one-member dwelling unit (DUPERSID 1002101 has no children). DUID = 1003 is a four-member family, and DUPERSID 1003102 is the mom of DUPERSID 1003103 and 1003104. MOMID has missing data for individuals whose mothers are not in the study.
DUID PID DUPERSID MOMID
1001 101 1001101
1001 102 1001102
1001 103 1001103
1001 104 1001104 101
1001 105 1001105 103
1002 101 1002101
1003 101 1003101
1003 102 1003102
1003 103 1003103 102
1003 104 1003104 102
Thanks for your patience and help.
Please post the full example dataset adding DadID and the N/As for the mothers and fathers. I can change the code to account for the different variable names, but would rather not change the logic because of variables not being there that will actually exist in the real world. If the N/As don't really exist, that is okay, as the missing values would work just as well. But there is no sense developing code that you'll have to change because your data has additional variables and/or factors.
OK, let me change the example dataset somewhat...here, the MOMID and DADID variables are going to reflect the unique DUPERSID identifiers of the mother and father instead of their PID. There really isn't N/A in the dataset for missing values; they're simply missing.
DUID PID DUPERSID MOMID DADID
1001 101 1001101
1001 102 1001102
1001 103 1001103
1001 104 1001104 1001101 1001102
1001 105 1001105 1001103 1001102
1002 101 1002101
1003 101 1003101
1003 102 1003102
1003 103 1003103 1003102 1003101
1003 104 1003104 1003102 1003101
Thanks for this solution. I'm close to getting this approach to working, but there's one complication. In my actual program, I had to construct the variables momid and dadid using the concatenation function CATS (probably too complicated to be worth explainining). However, the variables momid and dadid have an identical length and informat as the personid variable in the dataset ones.
In the code above, momid and dadid are renamed personid, and then in the final data step, SAS is supposed to be merge the three datasets by personid. The problem is that the personid variable in the dataset one is apparently different from the personid variable in the datasets momid and dadid, because the merge does not work. So, despite the variable personid having the same name, length, and informat in all three datasets, the merge isn't working.
Any ideas? Thanks.
We would need to see the data and code. It may be a simple matter of unexpected spacing that could be corrected with, say, the strip function. But, without seeing data and code, one can only guess and probably offer misleading advice.
E.g., my suggested code was based on the possibility that IDs might be repeated across dwellings. That was simply a guess on my part, but one that I don't think would hurt.
Hi ... here's another approach that works with your multiple moms. It uses a format as a lookup table
to assign the PERSONIDs as a mother or father ...
data one;
input dwellingid : $4. personid : $3. id : $8. (momid dadid) (: $3.);
datalines;
1001 101 1001_101 N/A N/A
1001 102 1001_102 N/A N/A
1001 103 1001_103 N/A N/A
1001 104 1001_104 101 102
1001 105 1001_105 103 102
;
run;
proc sql;
create table momdad (where=(start ne 'N/A')) as
select distinct momid as start, '10' as label, '$momdad' as fmtname from one
union
select distinct dadid as start, '01' as label, '$momdad' as fmtname from one;
quit;
proc format cntlin=momdad;
run;
data two;
set one;
mother = ifc(momid eq 'N/A', char(put(personid,$momdad.),1), '0');
father = ifc(dadid eq 'N/A', char(put(personid,$momdad.),2), '0');
run;
proc print data=two noobs;
run;
dwellingid personid id momid dadid mother father
1001 101 1001_101 N/A N/A 1 0
1001 102 1001_102 N/A N/A 0 1
1001 103 1001_103 N/A N/A 1 0
1001 104 1001_104 101 102 0 0
1001 105 1001_105 103 102 0 0
If your dataset were not too large.
data one; input dwellingid : $4. personid : $3. id : $8. (momid dadid) (: $3.); datalines; 1001 101 1001_101 N/A N/A 1001 102 1001_102 N/A N/A 1001 103 1001_103 N/A N/A 1001 104 1001_104 101 102 1001 105 1001_105 103 102 ; run; %global momid dadid; proc sql noprint; select distinct quote(momid) into : momid separated by ' ' from one where momid ne 'N/A'; select distinct quote(dadid) into : dadid separated by ' ' from one where dadid ne 'N/A'; quit; data want; set one; mother=ifn( personid in (&momid),1,0); father=ifn ( personid in (&dadid),1,0); run;
Ksharp
Thanks for this idea, but my dataset's too large to accomodate this solution (The macro variable's maximum length was exceeded).
Try the following and let us know if it does what you expect. There were at least two problems with the code I last posted that I believe are corrected below:
data have;
infile cards truncover;
input (DUID PID DUPERSID MomID DadID) ($);
cards;
1001 101 1001101
1001 102 1001102
1001 103 1001103
1001 104 1001104 1001101 1001102
1001 105 1001105 1001103 1001102
1002 101 1002101
1003 101 1003101
1003 102 1003102
1003 103 1003103 1003102 1003101
1003 104 1003104 1003102 1003101
;
proc sort data=have;
by DUID descending MomID;
run;
data want (drop=Hold:);
set have;
by DUID;
retain Hold_MomID Hold_DadID;
length Hold_MomID Hold_DadID $100;
if first.DUID then do;
Hold_MomID="";
Hold_DadID="";
end;
Mother=0;
Father=0;
if missing(MomID) then do;
if index(Hold_MomID,DUPERSID) gt 0 then Mother=1;
end;
else Hold_MomID=catx(" ",Hold_MomID,MomID);
if missing(DadID) then do;
if index(Hold_DadID,DUPERSID) gt 0 then Father=1;
end;
else Hold_DadID=catx(" ",Hold_DadID,DadID);
run;
Hi, this code worked, except that I had to assign MOTHER and FATHER into two different datasets, then merge by DUPERSID (the code above sorts by descending momID but then tries to assign MOTHER and FATHER at the same time; it successfully assigned MOTHER but not FATHER, because the data was not sorted by descending DadID).
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.