Hi,
I have two datasets. Dataset 1 is at the person level and has ID, a date variable, and a series of variables named for year/month combinations. A value of 0 for a year/month variable means that the person was absent in that year/month; a value of 1 means that they were present:
ID Date1 Jan2010 Feb2010....etc through Mar2015
1 1/4/12 0 1 0
2 5/9/12 1 0 0
3 4/12/13 1 1 1
Dataset 2 is at the person/date level; that is, each row is one person on one date. Each row also has a number associated with it (number of events):
ID Date2 Num_events
1 2/3/2014 5
1 10/12/2014 1
1 11/5/2014 2
2 1/1/2012 2
2 5/7/2013 3
I have joined the datasets on ID, and I need two different things:
1. I need every person's total num_events by month/year combo--however, this needs to be cross-referenced with the year/month variables, because if they have a value of 1 for that year/month but no events, they should have a value of 0 rather than missing.
For example, ID 1 was present in Feb 2010 because they have Feb2010=1. In Dataset 2, they have no events in Feb 2010 (because they have no record for Feb 2010). Therefore, ID 1's total events for Feb 2010 should=0. However, they were absent in Jan 2010 (Jan2010 var=0), so their event total for Jan 2010 will be null, not 0.
2. Even more complicated--ultimately what I want is to have everyone's event total relative to their own Date1 (the date variable in Dataset 1). That is, the event total for ID 1 in Dec 2011 would be -1 month from their Date1 var, but for ID 2 it would be -5 months from their Date1 var.
Basically, I can't figure out how to reconcile all of these elements. I was able to create a dataset of total events by relative month by calculating a new variable which counted the number of months between Date1 and Date2 (mth_diff), then doing
proc sql;
create table mthly_sums as
select ID, mth_diff, sum(num_events) as tot
from have
group by ID, mth_diff;
quit;
But that doesn't incorporate the mth/year vars and the fact that people who were present but had no events should be 0 and not missing.
This gives the values you are after.
data T1;
input ID DATE1 mmddyy10.
PRESENT_201001 PRESENT_201002 PRESENT201003 PRESENT201004 PRESENT201005 PRESENT201006
PRESENT_201007 PRESENT_201008 PRESENT201009 PRESENT201010 PRESENT201011 PRESENT201012;
DATE1=intnx('month',DATE1,0);
format DATE1 monyy.;
cards ;
1 01/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
2 05/09/2010 0 0 0 0 1 1 1 1 1 1 1 1
3 04/12/2010 1 1 1 1 1 1 1 1 1 1 1 1
4 03/19/2010 1 1 1 1 1 1 1 1 1 0 0 0
5 12/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
run;
data T1A (keep=ID DATE1 DATE2 PRESENT); %* Prepare data for merge;
set T1;
array DATES [12] PRESENT:;
do I=1to12;
DATE2=intnx('month','01jan2010'd,I-1);
PRESENT=DATES;
output;
end;
format DATE2 monyy.;
run;
data T2;
input ID DATE2 : mmddyy. EVENT_NUM ;
DATE2=intnx('month',DATE2,0);
format DATE2 monyy.;
cards;
1 1/9/2010 7
1 1/21/2010 23
1 2/4/2010 2
1 2/7/2010 2
1 4/5/2010 4
1 4/10/2010 1
1 4/21/2010 9
1 5/4/2010 1
1 5/9/2010 7
1 6/12/2010 33
2 5/4/2010 6
2 5/12/2010 74
2 5/30/2010 80
2 6/4/2010 1
2 6/20/2010 52
2 10/12/2010 63
2 10/29/2010 78
2 10/31/2010 33
3 1/24/2010 12
3 1/27/2010 10
3 2/3/2010 3
3 2/17/2010 19
3 7/9/2010 2
3 7/14/2010 55
4 1/23/2010 29
4 1/27/2010 44
4 1/29/2010 35
4 3/5/2010 4
4 3/14/2010 92
4 7/2/2010 4
4 8/9/2010 3
5 1/5/2010 2
5 1/23/2010 49
5 1/24/2010 14
5 9/5/2010 0
5 9/7/2010 11
5 9/12/2010 1
5 12/1/2010 4
5 12/31/2010 14
run;
proc summarydata=T2 nway; %* Prepare data for merge;
class ID DATE2;
var EVENT_NUM;
outputout=T2A(keep=ID DATE2 EVENT_NUM) sum=;
run;
data T3; %* Merge;
merge T1A T2A;
by ID DATE2;
TOTAL= ifn(PRESENT=1, max(0, EVENT_NUM ), .) ;
run;
Give us a small set of data to play with:
data set have1
data set have2
data set want
To give you an answer here I have to try and test and I don't want to do all the work. So please provide proper data steps creating sample data and then show us how the desired result needs to look like.
Here is the code to create Dataset1 (I used just one year of data):
data dataset1;
input id date1 mmddyy10. present_201001 present_201002 present201003 present201004 present201005 present201006
present_201007 present_201008 present201009 present201010 present201011 present201012;
format date1 mmddyy10.;
cards ;
1 01/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
2 05/09/2010 0 0 0 0 1 1 1 1 1 1 1 1
3 04/12/2010 1 1 1 1 1 1 1 1 1 1 1 1
4 03/19/2010 1 1 1 1 1 1 1 1 1 0 0 0
5 12/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
;
run;
This is the output dataset when you run that code:
id | date1 | present_201001 | present_201002 | present_201003 | present_201004 | present_201005 | present_201006 | present_201007 | present_201008 | present_201009 | present_201010 | present_201011 | present_201012 |
1 | 1/7/2010 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
2 | 5/9/2010 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
3 | 4/12/2010 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
4 | 3/19/2010 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
5 | 12/7/2010 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Here is the code for Dataset2:
data dataset2;
input id date2 mmddyy10. event_num;
format date2 mmddyy10.;
cards ;
1 1/9/2010 17
1 1/21/2010 23
1 2/4/2010 32
1 2/7/2010 12
1 4/5/2010 14
1 4/10/2010 1
1 4/21/2010 9
1 5/4/2010 91
1 5/9/2010 17
1 6/12/2010 33
2 5/4/2010 56
2 5/12/2010 74
2 5/30/2010 80
2 6/4/2010 11
2 6/20/2010 52
2 10/12/2010 63
2 10/29/2010 78
2 10/31/2010 33
3 1/24/2010 12
3 1/27/2010 10
3 2/3/2010 8
3 2/7/2010 9
3 2/17/2010 19
3 7/9/2010 22
3 7/14/2010 55
4 1/23/2010 29
4 1/27/2010 44
4 1/29/2010 35
4 3/5/2010 74
4 3/14/2010 92
4 7/2/2010 2
4 7/5/2010 13
4 8/9/2010 53
5 1/5/2010 72
5 1/23/2010 49
5 1/24/2010 14
5 9/5/2010 20
5 9/7/2010 11
5 9/12/2010 1
5 12/1/2010 4
5 12/31/2010 14
;
run;
The output dataset:
id | date2 | event_num |
1 | 1/9/2010 | 7 |
1 | 1/21/2010 | 23 |
1 | 2/4/2010 | 2 |
1 | 2/7/2010 | 2 |
1 | 4/5/2010 | 4 |
1 | 4/10/2010 | 1 |
1 | 4/21/2010 | 9 |
1 | 5/4/2010 | 1 |
1 | 5/9/2010 | 7 |
1 | 6/12/2010 | 33 |
2 | 5/4/2010 | 6 |
2 | 5/12/2010 | 74 |
2 | 5/30/2010 | 80 |
2 | 6/4/2010 | 1 |
2 | 6/20/2010 | 52 |
2 | 10/12/2010 | 63 |
2 | 10/29/2010 | 78 |
2 | 10/31/2010 | 33 |
3 | 1/24/2010 | 12 |
3 | 1/27/2010 | 10 |
3 | 2/3/2010 | 3 |
3 | 2/17/2010 | 19 |
3 | 7/9/2010 | 2 |
3 | 7/14/2010 | 55 |
4 | 1/23/2010 | 29 |
4 | 1/27/2010 | 44 |
4 | 1/29/2010 | 35 |
4 | 3/5/2010 | 4 |
4 | 3/14/2010 | 92 |
4 | 7/2/2010 | 4 |
4 | 8/9/2010 | 3 |
5 | 1/5/2010 | 2 |
5 | 1/23/2010 | 49 |
5 | 1/24/2010 | 14 |
5 | 9/5/2010 | 0 |
5 | 9/7/2010 | 11 |
5 | 9/12/2010 | 1 |
5 | 12/1/2010 | 4 |
5 | 12/31/2010 | 14 |
The format of the dataset I want doesn't matter--the sums could be in columns or in rows, but the basic idea is:
1. Sums of event_num in each month relative to each person's date 1 (ie, -1 month, etc)
2. If a person does not have any events in a given month, they should have a missing value for the sum variable for that month UNLESS the present_201001 etc. vars from Dataset1 indicate that they were present in that month (=1), in which case, they should have a value of 0.
So one possible final dataset would be like this:
ID | Pre_date1_12_mths | Pre_date1_11_mths | Pre_date1_10_mths | Pre_date1_9_mths | Pre_date1_8_mths | Pre_date1_7_mths | Pre_date1_6_mths | Pre_date1_5_mths | Pre_date1_4_mths | Pre_date1_3_mths | Pre_date1_2_mths | Pre_date1_1_mths | Same_mth_as_Date1 | Post_date1_1_mths | Post_date1_2_mths | Post_date1_3_mths | Post_date1_4_mths | Post_date1_5_mths | Post_date1_6_mths | Post_date1_7_mths | Post_date1_8_mths | Post_date1_9_mths | Post_date1_10_mths | Post_date1_11_mths | Post_date1_12_mths |
1 | . | . | . | . | . | . | . | . | . | . | . | . | 30 | 4 | 0 | 14 | 8 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | . |
2 | . | . | . | . | . | . | . | . | . | . | . | . | 160 | 53 | 0 | 0 | 0 | 174 | 0 | 0 | . | . | . | . | . |
3 | . | . | . | . | . | . | . | . | . | 22 | 22 | 0 | 0 | 0 | 0 | 57 | 0 | 0 | 0 | 0 | 0 | . | . | . | . |
4 | . | . | . | . | . | . | . | . | . | . | 108 | 0 | 96 | 0 | 0 | 0 | 4 | 3 | 0 | . | . | . | . | . | . |
5 | 65 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12 | 0 | 0 | 18 | . | . | . | . | . | . | . | . | . | . | . | . |
But it could also be the same numbers structured in a different way, like this:
ID Relative_mth total
1 -12 .
etc.
This gives the values you are after.
data T1;
input ID DATE1 mmddyy10.
PRESENT_201001 PRESENT_201002 PRESENT201003 PRESENT201004 PRESENT201005 PRESENT201006
PRESENT_201007 PRESENT_201008 PRESENT201009 PRESENT201010 PRESENT201011 PRESENT201012;
DATE1=intnx('month',DATE1,0);
format DATE1 monyy.;
cards ;
1 01/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
2 05/09/2010 0 0 0 0 1 1 1 1 1 1 1 1
3 04/12/2010 1 1 1 1 1 1 1 1 1 1 1 1
4 03/19/2010 1 1 1 1 1 1 1 1 1 0 0 0
5 12/07/2010 1 1 1 1 1 1 1 1 1 1 1 1
run;
data T1A (keep=ID DATE1 DATE2 PRESENT); %* Prepare data for merge;
set T1;
array DATES [12] PRESENT:;
do I=1to12;
DATE2=intnx('month','01jan2010'd,I-1);
PRESENT=DATES;
output;
end;
format DATE2 monyy.;
run;
data T2;
input ID DATE2 : mmddyy. EVENT_NUM ;
DATE2=intnx('month',DATE2,0);
format DATE2 monyy.;
cards;
1 1/9/2010 7
1 1/21/2010 23
1 2/4/2010 2
1 2/7/2010 2
1 4/5/2010 4
1 4/10/2010 1
1 4/21/2010 9
1 5/4/2010 1
1 5/9/2010 7
1 6/12/2010 33
2 5/4/2010 6
2 5/12/2010 74
2 5/30/2010 80
2 6/4/2010 1
2 6/20/2010 52
2 10/12/2010 63
2 10/29/2010 78
2 10/31/2010 33
3 1/24/2010 12
3 1/27/2010 10
3 2/3/2010 3
3 2/17/2010 19
3 7/9/2010 2
3 7/14/2010 55
4 1/23/2010 29
4 1/27/2010 44
4 1/29/2010 35
4 3/5/2010 4
4 3/14/2010 92
4 7/2/2010 4
4 8/9/2010 3
5 1/5/2010 2
5 1/23/2010 49
5 1/24/2010 14
5 9/5/2010 0
5 9/7/2010 11
5 9/12/2010 1
5 12/1/2010 4
5 12/31/2010 14
run;
proc summarydata=T2 nway; %* Prepare data for merge;
class ID DATE2;
var EVENT_NUM;
outputout=T2A(keep=ID DATE2 EVENT_NUM) sum=;
run;
data T3; %* Merge;
merge T1A T2A;
by ID DATE2;
TOTAL= ifn(PRESENT=1, max(0, EVENT_NUM ), .) ;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.