Hello -,
I have this data:
birthdate Age
6/22/99 18.9
7/6/99 18.9
. .
. .
. .
9/15/05 12.6
run;
This are my question.
I want to create data sets where the different in age from the youngest and oldest is not more that 3.5 years using the birthdate. You can use the first birthday as first date
Thank you,
DA
Here is a way to find the largest set of birthdates with a span <= 6 months
data have;
input BirthDate :mmddyy. Age;
id = _n_; /* You must have some kind of person id */
format birthdate yymmdd10.;
datalines;
6/22/99 18.9
7/6/99 18.8
6/30/99 18.8
7/8/99 18.8
7/8/99 18.8
6/21/99 18.9
7/15/99 18.8
7/12/99 18.8
4/25/99 19.0
8/6/99 18.7
7/9/99 18.8
8/20/99 18.7
7/22/99 18.8
3/2/99 19.2
7/11/99 18.8
8/2/99 18.7
7/9/99 18.8
8/22/99 18.7
6/16/99 18.9
7/29/99 18.8
8/17/99 18.7
8/17/99 18.7
1/27/99 19.3
5/29/99 18.9
7/16/99 18.8
8/23/99 18.7
9/2/99 18.7
2/10/99 19.2
8/7/99 18.7
1/19/99 19.3
9/24/99 18.6
7/10/99 18.8
5/31/99 18.9
9/25/99 18.6
6/3/99 18.9
5/20/99 19.0
6/26/99 18.9
8/28/99 18.7
9/27/99 18.6
9/9/99 18.6
7/18/99 18.8
8/6/99 18.7
10/17/99 18.5
4/22/99 19.0
1/29/99 19.3
10/15/99 18.5
10/18/99 18.5
7/15/99 18.8
10/21/99 18.5
7/29/99 18.8
8/20/99 18.7
8/20/99 18.7
5/25/99 18.9
9/15/99 18.6
8/14/99 18.7
9/28/99 18.6
10/28/99 18.5
5/9/99 19.0
10/24/99 18.5
8/11/99 18.7
5/4/99 19.0
10/2/99 18.6
11/26/99 18.4
11/9/99 18.5
4/17/99 19.0
;
proc sql;
create table groups as
select
a.id,
a.birthdate as startDate,
intnx("month",a.birthdate,6,"same") as endDate,
count(*) as grpSize
from
have as a inner join
have as b
on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")
group by a.id, startDate, endDate
order by grpSize desc;
quit;
data want;
if _n_ = 1 then set groups;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;
I don't understand your second request.
I don't get this. What if the birth dates are spread out so that there are like 20 years from the first to the last one? Then two data sets will not cut it if there are not supposed to be more than 3,5 years difference between them?
Please be more specific and post some sample data and what you want the desired outcome to look like.
Below is the birthdate and age . I want to create a data such that the difference in birthday between the youngest and oldest person in that data should not be more than 3.5 years. This is just a partial data, the entire data has age difference between the youngest and oldest at 18 years
Birth Date | Age |
6/22/99 | 18.9 |
7/6/99 | 18.8 |
6/30/99 | 18.8 |
7/8/99 | 18.8 |
7/8/99 | 18.8 |
6/21/99 | 18.9 |
7/15/99 | 18.8 |
7/12/99 | 18.8 |
4/25/99 | 19.0 |
8/6/99 | 18.7 |
7/9/99 | 18.8 |
8/20/99 | 18.7 |
7/22/99 | 18.8 |
3/2/99 | 19.2 |
7/11/99 | 18.8 |
8/2/99 | 18.7 |
7/9/99 | 18.8 |
8/22/99 | 18.7 |
6/16/99 | 18.9 |
7/29/99 | 18.8 |
8/17/99 | 18.7 |
8/17/99 | 18.7 |
1/27/99 | 19.3 |
5/29/99 | 18.9 |
7/16/99 | 18.8 |
8/23/99 | 18.7 |
9/2/99 | 18.7 |
2/10/99 | 19.2 |
8/7/99 | 18.7 |
1/19/99 | 19.3 |
9/24/99 | 18.6 |
7/10/99 | 18.8 |
5/31/99 | 18.9 |
9/25/99 | 18.6 |
6/3/99 | 18.9 |
5/20/99 | 19.0 |
6/26/99 | 18.9 |
8/28/99 | 18.7 |
9/27/99 | 18.6 |
9/9/99 | 18.6 |
7/18/99 | 18.8 |
8/6/99 | 18.7 |
10/17/99 | 18.5 |
4/22/99 | 19.0 |
1/29/99 | 19.3 |
10/15/99 | 18.5 |
10/18/99 | 18.5 |
7/15/99 | 18.8 |
10/21/99 | 18.5 |
7/29/99 | 18.8 |
8/20/99 | 18.7 |
8/20/99 | 18.7 |
5/25/99 | 18.9 |
9/15/99 | 18.6 |
8/14/99 | 18.7 |
9/28/99 | 18.6 |
10/28/99 | 18.5 |
5/9/99 | 19.0 |
10/24/99 | 18.5 |
8/11/99 | 18.7 |
5/4/99 | 19.0 |
10/2/99 | 18.6 |
11/26/99 | 18.4 |
11/9/99 | 18.5 |
4/17/99 | 19.0 |
What is the actual smallest value of age in your data? The largest?
Are all of you values rounded to one decimal place?
Generally almost any process splitting data on values of a single variable are likely to be more difficult.
Suppose your data has the minimum age of 13.4 this time. Then the "first" data set, which you have not given any naming rule for would contain ages of 13.4 to 16.9.
Now suppose you are repeating this process with another data set later where the youngest age is 9.8. Then the "first" data set would have ages 9.8 to 13.3. Which has nothing in common with the first data set from the first process. So every bit of logic would have to change to arrange results differently, select the "correct" data set for any specific task.
What is the actual purpose behind splitting the data? Why is 3.5 critical?
Let make it easy, let us not split that data. Let just create a data set where the difference between the oldest person and the youngest person is not more than 3.5 years
Here is a way to find the largest set of birthdates with a span <= 6 months
data have;
input BirthDate :mmddyy. Age;
id = _n_; /* You must have some kind of person id */
format birthdate yymmdd10.;
datalines;
6/22/99 18.9
7/6/99 18.8
6/30/99 18.8
7/8/99 18.8
7/8/99 18.8
6/21/99 18.9
7/15/99 18.8
7/12/99 18.8
4/25/99 19.0
8/6/99 18.7
7/9/99 18.8
8/20/99 18.7
7/22/99 18.8
3/2/99 19.2
7/11/99 18.8
8/2/99 18.7
7/9/99 18.8
8/22/99 18.7
6/16/99 18.9
7/29/99 18.8
8/17/99 18.7
8/17/99 18.7
1/27/99 19.3
5/29/99 18.9
7/16/99 18.8
8/23/99 18.7
9/2/99 18.7
2/10/99 19.2
8/7/99 18.7
1/19/99 19.3
9/24/99 18.6
7/10/99 18.8
5/31/99 18.9
9/25/99 18.6
6/3/99 18.9
5/20/99 19.0
6/26/99 18.9
8/28/99 18.7
9/27/99 18.6
9/9/99 18.6
7/18/99 18.8
8/6/99 18.7
10/17/99 18.5
4/22/99 19.0
1/29/99 19.3
10/15/99 18.5
10/18/99 18.5
7/15/99 18.8
10/21/99 18.5
7/29/99 18.8
8/20/99 18.7
8/20/99 18.7
5/25/99 18.9
9/15/99 18.6
8/14/99 18.7
9/28/99 18.6
10/28/99 18.5
5/9/99 19.0
10/24/99 18.5
8/11/99 18.7
5/4/99 19.0
10/2/99 18.6
11/26/99 18.4
11/9/99 18.5
4/17/99 19.0
;
proc sql;
create table groups as
select
a.id,
a.birthdate as startDate,
intnx("month",a.birthdate,6,"same") as endDate,
count(*) as grpSize
from
have as a inner join
have as b
on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")
group by a.id, startDate, endDate
order by grpSize desc;
quit;
data want;
if _n_ = 1 then set groups;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;
I don't understand your second request.
Here is a faster way to do the same, without SQL:
proc sort data=have out=sortedDates(keep=birthDate); by birthdate; run;
data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,6,"same");
do while(birthDate <= endDate and not done);
set sortedDates end=done;
last + 1;
end;
grpSize = last - _n_ + 1;
output;
if done then stop;
keep startDate endDate grpSize;
run;
proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;
data want;
if _n_ = 1 then set maxGrp;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;
I try to use the faster approach with sql but had issue;
proc sort data=DFW out=sortedDates; by birthdate; run;
data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,42,"same");
do while(birthDate <= endDate and not done);
set sortedDates end=done;
last + 1;
end;
grpSize = last - _n_ + 1;
output;
if done then stop;
*keep startDate endDate grpSize;
run;
data grps (rename=(PtMRN=id));
run;
proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;
data want;
if _n_ = 1 then set maxGrp;
set have;
if startDate <= birthdate <= endDate;
*keep PtMRN birthdate age;
run;
129 data grps (rename=(PtMRN=id));
130 run;
WARNING: The variable PtMRN in the DROP, KEEP, or RENAME list has never been referenced.
NOTE: The data set WORK.GRPS has 1 observations and 0 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
131
132 proc means data=grps noprint;
133 output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
ERROR: Variable GRPSIZE not found.
ERROR: Variable STARTDATE not found.
ERROR: Variable ENDDATE not found.
134 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MAXGRP may be incomplete. When this step was stopped there were 0
observations and 0 variables.
WARNING: Data set WORK.MAXGRP was not replaced because this step was stopped.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
You are not missing by much. Try it this way:
proc sort data=DFW out=sortedDates(keep=birthDate); by birthdate; run;
data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,42,"same");
do while(birthDate <= endDate and not done);
set sortedDates end=done;
last + 1;
end;
grpSize = last - _n_ + 1;
output;
if done then stop;
keep startDate endDate grpSize;
run;
proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;
data want;
if _n_ = 1 then set maxGrp;
set DFW;
if startDate <= birthdate <= endDate;
keep PtMRN birthdate age;
run;
Thank you, works well!!!!
So if I want the largest set of birthday with 3.5 years, i will replace 6 by 42 in this statements:
intnx("month",a.birthdate,6,"same") as endDate
and
on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")
????
Yes. I used 6 months as an example, since your sample data didn't cover 3.5 years.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.