Hi SAS Experts,
I am having a project to mange a dataset with affiliated subjects, which means there is a lady who is a mother from one family (subject=1) and her daughter (subject=1.1) and her son (subject=1.2) etc.
I need to create some variables describing every mother-kid situation, such as whether they are all female, whether they all have fiber everyday, and what is the absolute difference of age.
The dataset I have is this:
family | ID | female | Has fiber every day | age | Mother | Kids |
4 | 4 | 1 | 1 | 25 | 1 | 0 |
4 | 4.1 | 0 | 1 | 29 | 0 | 1 |
4 | 4.2 | 1 | 0 | 34 | 0 | 1 |
4 | 4.3 | 1 | 1 | 45 | 0 | 1 |
5 | 5 | 1 | 1 | 42 | 1 | 0 |
5 | 5.1 | 1 | 1 | 53 | 0 | 1 |
6 | 6 | 1 | 0 | 23 | 1 | 0 |
6 | 6.1 | 0 | 0 | 29 | 0 | 1 |
6 | 6.2 | 0 | 1 | 36 | 0 | 1 |
7 | 7 | 1 | 1 | 48 | 1 | 0 |
7 | 7.1 | 1 | 0 | 23 | 0 | 1 |
8 | 8 | 1 | 0 | 53 | 1 | 0 |
8 | 8.1 | 0 | 0 | 54 | 0 | 1 |
8 | 8.2 | 1 | 1 | 37 | 0 | 1 |
8 | 8.3 | 1 | 1 | 41 | 0 | 1 |
8 | 8.4 | 0 | 1 | 26 | 0 | 1 |
I need to calculate: (I highlight the want part in yellow)
1. Both the mother and her daughter or son have fiber everyday: yes or no
2. Both they are female, which mean both mother and her daughter participated in the study: yes or no
3. Count of both have fiber by mother: for example family 4 has 2
4. Count of both are female by mother: for example family 8 has 2
5. Total count of both have fiber in the study: there are 3 mother-kid cases having fiber everyday
6. Total count of both female in the study: it is 6
7. Absolute difference of age between the mother and her daughter or son
8. Number of kids participated in the study by mother
9. Mean of absolute difference of age across all mother-kid cases
The calculation shouldn't be very difficult, but the display of the current dataset stop me from doing those calculation. Can SAS do some calculation by family number?
This is very challenging for me to use SAS.
Thank you very much for any help!
with regards,
"This is very challenging for me to use SAS"
It's not at all as difficult as you may think - it's nothing but an acquired skill ;).
data have ;
input family id female hfed age mother kids ;
cards ;
4 4 1 1 25 1 0
4 4.1 0 1 29 0 1
4 4.2 1 0 34 0 1
4 4.3 1 1 45 0 1
5 5 1 1 42 1 0
5 5.1 1 1 53 0 1
6 6 1 0 23 1 0
6 6.1 0 0 29 0 1
6 6.2 0 1 36 0 1
7 7 1 1 48 1 0
7 7.1 1 0 23 0 1
8 8 1 0 53 1 0
8 8.1 0 0 54 0 1
8 8.2 1 1 37 0 1
8 8.3 1 1 41 0 1
8 8.4 0 1 26 0 1
;
run ;
data detail (drop = cnt: tot: mean: _:)
group (keep = family cnt:)
total (keep = family tot: mean:)
;
do _n_ = 1 by 1 until (last.family) ;
set have end = z ;
by family ;
if _n_ = 1 then do ;
_m_hfed = hfed ;
_m_age = age ;
end ;
else do ;
both_hfed = _m_hfed and hfed ;
both_fem = female ;
abs_age_diff = abs (age - _m_age) ;
cnt_both_hfed + both_hfed ;
cnt_both_fem + both_fem ;
cnt_kids + kids ;
_aads + abs_age_diff ;
_n_aads + 1 ;
end ;
output detail ;
end ;
output group ;
tot_both_hfed + cnt_both_hfed ;
tot_both_fem + cnt_both_fem ;
call missing (of cnt:) ;
if z ;
mean_abs_age_diff = divide (_aads, _n_aads) ;
output total ;
run ;
data want ;
merge detail group ;
by family ;
if _n_ = 1 then set total ;
run ;
Kind regards
Paul D.
Hi Paul,
I really appreciate your prompt response a lot! I can't believe you provide me with the whole syntax for my question. It is awesome! Thank you very much again!
Right now, I just realized that I don't have family variable yet. I use the following code to create them but it looks it doesn't work, that is the family variable is not generated by using the following code, could you please help me correct it?
data want;
set have;
family=substr(id,1,1);
run;
Is there something wrong?
Thank you very much!
Kind regards,
If your ID is a numeric variable (which I suspect it is), using SUBSTR causes an implicit type conversion of ID to a character variable using the BEST12. format, right-justifying the result from which you are trying to extract the first byte using SUBSTR (ID,1,1). But because of the right justification, that byte is blank, hence the problem. Just use:
family = int (ID) ;
and you will be all right.
Kind regards
Paul D.
Thank you Paul again!
I still have some questions.
1. Both female and hfed are categorical variables, why you only created _m_hfed for fhed?
2. what does _aads mean? Does it mean the total of age absolute difference? I can use any name for it, right? It is not SAS default name, right?
3. What does _n_aads mean? Why there is +1. Is it SAS default name? So, for other continuous variables, I just need to change the variables name based on my dataset, right?
4. Why both_hfed =_m_hfed and hfed, but both_fem=female?
both_hfed = _m_hfed and hfed ;
both_fem = female ;
Thank you so much for your great instruction!
Kind regards,
Cynthia
Kind regards
Paul D.
1. Add the lines:
neither_hfed = not (_m_hfed or hfed) ;
...
cnt_neither_hfed + neither_hfed ;
...
tot_neither_hfed + cnt_neither_hfed ;
2. You can add a WHERE statement either right after the SET statement or as a HAVE data set option WHERE=. The Having statement is not available in the DATA step - it's part of the SQL syntax.
3. Make the SUM statements in question conditional. E.g., schematically:
if not missing (<some variable>) then <aggregate variable> + <expression> ;
Kind regards
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.