Solved: Re: Working with SAS date to create different data

desireatem · Posted 05-08-2018 12:35 PM

Hello -,

I have this data:

birthdate Age

6/22/99 18.9

7/6/99 18.9

. .

9/15/05 12.6

run;

This are my question.

I want to create data sets where the different in age from the youngest and oldest is not more that 3.5 years using the birthdate. You can use the first birthday as first date

Thank you,

DA

PGStats · Posted 05-08-2018 03:42 PM

Here is a way to find the largest set of birthdates with a span <= 6 months

data have;
input BirthDate :mmddyy. Age;
id = _n_; /* You must have some kind of person id */
format birthdate yymmdd10.;
datalines; 
 6/22/99 18.9 
 7/6/99 18.8 
 6/30/99 18.8 
 7/8/99 18.8 
 7/8/99 18.8 
 6/21/99 18.9 
 7/15/99 18.8 
 7/12/99 18.8 
 4/25/99 19.0 
 8/6/99 18.7 
 7/9/99 18.8 
 8/20/99 18.7 
 7/22/99 18.8 
 3/2/99 19.2 
 7/11/99 18.8 
 8/2/99 18.7 
 7/9/99 18.8 
 8/22/99 18.7 
 6/16/99 18.9 
 7/29/99 18.8 
 8/17/99 18.7 
 8/17/99 18.7 
 1/27/99 19.3 
 5/29/99 18.9 
 7/16/99 18.8 
 8/23/99 18.7 
 9/2/99 18.7 
 2/10/99 19.2 
 8/7/99 18.7 
 1/19/99 19.3 
 9/24/99 18.6 
 7/10/99 18.8 
 5/31/99 18.9 
 9/25/99 18.6 
 6/3/99 18.9 
 5/20/99 19.0 
 6/26/99 18.9 
 8/28/99 18.7 
 9/27/99 18.6 
 9/9/99 18.6 
 7/18/99 18.8 
 8/6/99 18.7 
 10/17/99 18.5 
 4/22/99 19.0 
 1/29/99 19.3 
 10/15/99 18.5 
 10/18/99 18.5 
 7/15/99 18.8 
 10/21/99 18.5 
 7/29/99 18.8 
 8/20/99 18.7 
 8/20/99 18.7 
 5/25/99 18.9 
 9/15/99 18.6 
 8/14/99 18.7 
 9/28/99 18.6 
 10/28/99 18.5 
 5/9/99 19.0 
 10/24/99 18.5 
 8/11/99 18.7 
 5/4/99 19.0 
 10/2/99 18.6 
 11/26/99 18.4 
 11/9/99 18.5 
 4/17/99 19.0 
;

proc sql;
create table groups as
select
    a.id, 
    a.birthdate as startDate, 
    intnx("month",a.birthdate,6,"same") as endDate,
    count(*) as grpSize
from
    have as a inner join
    have as b 
        on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")
group by a.id, startDate, endDate
order by grpSize desc;
quit;

data want;
if _n_ = 1 then set groups;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;

I don't understand your second request.

PG

View solution in original post

PeterClemmensen · Posted 05-08-2018 01:05 PM

I don't get this. What if the birth dates are spread out so that there are like 20 years from the first to the last one? Then two data sets will not cut it if there are not supposed to be more than 3,5 years difference between them?

Please be more specific and post some sample data and what you want the desired outcome to look like.

The DATA to DATA Step Macro
Blog: SASnrd

desireatem · Posted 05-08-2018 02:18 PM

Below is the birthdate and age . I want to create a data such that the difference in birthday between the youngest and oldest person in that data should not be more than 3.5 years. This is just a partial data, the entire data has age difference between the youngest and oldest at 18 years

Birth Date	Age
6/22/99	18.9
7/6/99	18.8
6/30/99	18.8
7/8/99	18.8
7/8/99	18.8
6/21/99	18.9
7/15/99	18.8
7/12/99	18.8
4/25/99	19.0
8/6/99	18.7
7/9/99	18.8
8/20/99	18.7
7/22/99	18.8
3/2/99	19.2
7/11/99	18.8
8/2/99	18.7
7/9/99	18.8
8/22/99	18.7
6/16/99	18.9
7/29/99	18.8
8/17/99	18.7
8/17/99	18.7
1/27/99	19.3
5/29/99	18.9
7/16/99	18.8
8/23/99	18.7
9/2/99	18.7
2/10/99	19.2
8/7/99	18.7
1/19/99	19.3
9/24/99	18.6
7/10/99	18.8
5/31/99	18.9
9/25/99	18.6
6/3/99	18.9
5/20/99	19.0
6/26/99	18.9
8/28/99	18.7
9/27/99	18.6
9/9/99	18.6
7/18/99	18.8
8/6/99	18.7
10/17/99	18.5
4/22/99	19.0
1/29/99	19.3
10/15/99	18.5
10/18/99	18.5
7/15/99	18.8
10/21/99	18.5
7/29/99	18.8
8/20/99	18.7
8/20/99	18.7
5/25/99	18.9
9/15/99	18.6
8/14/99	18.7
9/28/99	18.6
10/28/99	18.5
5/9/99	19.0
10/24/99	18.5
8/11/99	18.7
5/4/99	19.0
10/2/99	18.6
11/26/99	18.4
11/9/99	18.5
4/17/99	19.0

ballardw · Posted 05-08-2018 02:51 PM

What is the actual smallest value of age in your data? The largest?

Are all of you values rounded to one decimal place?

Generally almost any process splitting data on values of a single variable are likely to be more difficult.

Suppose your data has the minimum age of 13.4 this time. Then the "first" data set, which you have not given any naming rule for would contain ages of 13.4 to 16.9.

Now suppose you are repeating this process with another data set later where the youngest age is 9.8. Then the "first" data set would have ages 9.8 to 13.3. Which has nothing in common with the first data set from the first process. So every bit of logic would have to change to arrange results differently, select the "correct" data set for any specific task.

What is the actual purpose behind splitting the data? Why is 3.5 critical?

desireatem · Posted 05-08-2018 03:14 PM

Let make it easy, let us not split that data. Let just create a data set where the difference between the oldest person and the youngest person is not more than 3.5 years

PGStats · Posted 05-08-2018 03:42 PM

Here is a way to find the largest set of birthdates with a span <= 6 months

data have;
input BirthDate :mmddyy. Age;
id = _n_; /* You must have some kind of person id */
format birthdate yymmdd10.;
datalines; 
 6/22/99 18.9 
 7/6/99 18.8 
 6/30/99 18.8 
 7/8/99 18.8 
 7/8/99 18.8 
 6/21/99 18.9 
 7/15/99 18.8 
 7/12/99 18.8 
 4/25/99 19.0 
 8/6/99 18.7 
 7/9/99 18.8 
 8/20/99 18.7 
 7/22/99 18.8 
 3/2/99 19.2 
 7/11/99 18.8 
 8/2/99 18.7 
 7/9/99 18.8 
 8/22/99 18.7 
 6/16/99 18.9 
 7/29/99 18.8 
 8/17/99 18.7 
 8/17/99 18.7 
 1/27/99 19.3 
 5/29/99 18.9 
 7/16/99 18.8 
 8/23/99 18.7 
 9/2/99 18.7 
 2/10/99 19.2 
 8/7/99 18.7 
 1/19/99 19.3 
 9/24/99 18.6 
 7/10/99 18.8 
 5/31/99 18.9 
 9/25/99 18.6 
 6/3/99 18.9 
 5/20/99 19.0 
 6/26/99 18.9 
 8/28/99 18.7 
 9/27/99 18.6 
 9/9/99 18.6 
 7/18/99 18.8 
 8/6/99 18.7 
 10/17/99 18.5 
 4/22/99 19.0 
 1/29/99 19.3 
 10/15/99 18.5 
 10/18/99 18.5 
 7/15/99 18.8 
 10/21/99 18.5 
 7/29/99 18.8 
 8/20/99 18.7 
 8/20/99 18.7 
 5/25/99 18.9 
 9/15/99 18.6 
 8/14/99 18.7 
 9/28/99 18.6 
 10/28/99 18.5 
 5/9/99 19.0 
 10/24/99 18.5 
 8/11/99 18.7 
 5/4/99 19.0 
 10/2/99 18.6 
 11/26/99 18.4 
 11/9/99 18.5 
 4/17/99 19.0 
;

proc sql;
create table groups as
select
    a.id, 
    a.birthdate as startDate, 
    intnx("month",a.birthdate,6,"same") as endDate,
    count(*) as grpSize
from
    have as a inner join
    have as b 
        on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")
group by a.id, startDate, endDate
order by grpSize desc;
quit;

data want;
if _n_ = 1 then set groups;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;

I don't understand your second request.

PG

PGStats · Posted 05-08-2018 04:52 PM

Here is a faster way to do the same, without SQL:

proc sort data=have out=sortedDates(keep=birthDate); by birthdate; run;

data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,6,"same");
do while(birthDate <= endDate and not done);
    set sortedDates end=done;
    last + 1;
    end;
grpSize = last - _n_ + 1;
output;
if done then stop;
keep startDate endDate grpSize;
run;

proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;

data want;
if _n_ = 1 then set maxGrp;
set have;
if startDate <= birthdate <= endDate;
keep id birthdate age;
run;

PG

desireatem · Posted 05-08-2018 10:20 PM

I try to use the faster approach with sql but had issue;

proc sort data=DFW out=sortedDates; by birthdate; run;

data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,42,"same");
do while(birthDate <= endDate and not done);
set sortedDates end=done;
last + 1;
end;
grpSize = last - _n_ + 1;
output;
if done then stop;
*keep startDate endDate grpSize;
run;
data grps (rename=(PtMRN=id));
run;

proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;

data want;
if _n_ = 1 then set maxGrp;
set have;
if startDate <= birthdate <= endDate;
*keep PtMRN birthdate age;
run;

129 data grps (rename=(PtMRN=id));
130 run;

WARNING: The variable PtMRN in the DROP, KEEP, or RENAME list has never been referenced.
NOTE: The data set WORK.GRPS has 1 observations and 0 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

131
132 proc means data=grps noprint;
133 output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
ERROR: Variable GRPSIZE not found.
ERROR: Variable STARTDATE not found.
ERROR: Variable ENDDATE not found.
134 run;

NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.MAXGRP may be incomplete. When this step was stopped there were 0
observations and 0 variables.
WARNING: Data set WORK.MAXGRP was not replaced because this step was stopped.
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

PGStats · Posted 05-09-2018 12:32 AM

You are not missing by much. Try it this way:

proc sort data=DFW out=sortedDates(keep=birthDate); by birthdate; run;

data grps;
set sortedDates(rename=birthDate=startDate);
endDate = intnx("month",startDate,42,"same");
do while(birthDate <= endDate and not done);
 set sortedDates end=done;
 last + 1;
 end;
grpSize = last - _n_ + 1;
output;
if done then stop;
keep startDate endDate grpSize;
run;

proc means data=grps noprint;
output out=maxGrp maxid(grpSize(startDate endDate))=startDate endDate;
run;

data want;
if _n_ = 1 then set maxGrp;
set DFW;
if startDate <= birthdate <= endDate;
keep PtMRN birthdate age;
run;

PG

desireatem · Posted 05-09-2018 12:54 AM

Thank you, works well!!!!

desireatem · Posted 05-08-2018 05:01 PM

So if I want the largest set of birthday with 3.5 years, i will replace 6 by 42 in this statements:

intnx("month",a.birthdate,6,"same") as endDate

and

on b.birthdate between a.birthdate and intnx("month",a.birthdate,6,"same")

????

PGStats · Posted 05-08-2018 05:05 PM

Yes. I used 6 months as an example, since your sample data didn't cover 3.5 years.

PG

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away