Hello all,
I have code that allows me to reorganize my data (companies observed each year 2006-2015) based on a variable. The variable is an event-based variable that takes on values from 0-1 (1 if it happened). When (and/or when) the variable takes on a 1, it collapses the rows and sum the values on other relevant variables. It also creates different indicators where: (1) one variable indicates when time starts (2) one variable that indicates when the time ends, and (3) a variable that indicates 0/1 whether an event took place at all. Essentially, each observation of a case becomes a start-stop as to when the event-based variable was observed.
I Illustrate below what I have:
Have:
Time | ID | County | FocalEvent | X1mva | X2police |
1 | 222 | ABC | 1 | 10 | 15 |
2 | 222 | ABC | 0 | 5 | 10 |
3 | 222 | ABC | 0 | 15 | 5 |
4 | 222 | ABC | 0 | 2 | 4 |
5 | 222 | ABC | 1 | 5 | 10 |
6 | 222 | ABC | 0 | 10 | 5 |
7 | 222 | ABC | 0 | 5 | 10 |
8 | 222 | ABC | 0 | 4 | 4 |
9 | 222 | ABC | 0 | 5 | 5 |
10 | 222 | ABC | 1 | 10 | 10 |
1 | 333 | BBB | 0 | 3 | 4 |
2 | 333 | BBB | 0 | 2 | 1 |
3 | 333 | BBB | 0 | 5 | 5 |
4 | 333 | BBB | 1 | 10 | 5 |
5 | 333 | BBB | 0 | 15 | 15 |
6 | 333 | BBB | 0 | 10 | 10 |
7 | 333 | BBB | 0 | 10 | 10 |
8 | 333 | BBB | 1 | 5 | 5 |
9 | 333 | BBB | 0 | 10 | 10 |
10 | 333 | BBB | 0 | 15 | 15 |
1 | 444 | CCC | 0 | 1 | 1 |
2 | 444 | CCC | 0 | 5 | 5 |
3 | 444 | CCC | 0 | 4 | 4 |
4 | 444 | CCC | 0 | 5 | 5 |
5 | 444 | CCC | 0 | 1 | 1 |
6 | 444 | CCC | 1 | 5 | 5 |
7 | 444 | CCC | 0 | 10 | 10 |
8 | 444 | CCC | 0 | 5 | 5 |
9 | 444 | CCC | 0 | 5 | 5 |
10 | 444 | CCC | 0 | 5 | 5 |
1 | 555 | DDD | 0 | 1 | 1 |
2 | 555 | DDD | 0 | 1 | 1 |
3 | 555 | DDD | 0 | 1 | 1 |
Want:
ID | County | X1mva | X2police | Time1 | Time2 | Event |
222 | ABC | 10 | 15 | 1 | 1 | 1 |
222 | ABC | 27 | 29 | 1 | 5 | 1 |
222 | ABC | 34 | 34 | 5 | 10 | 1 |
333 | BBB | 20 | 15 | 1 | 4 | 1 |
333 | BBB | 40 | 40 | 4 | 8 | 1 |
333 | BBB | 25 | 25 | 8 | 10 | 0 |
444 | CCC | 21 | 21 | 1 | 6 | 1 |
444 | CCC | 25 | 25 | 6 | 10 | 0 |
555 | DDD | 3 | 3 | 1 | 3 | 0 |
Code:
data have;
input Time ID County :$ FocalEvent X1mva X2police;
cards;
1 222 ABC 1 10 15
2 222 ABC 0 5 10
3 222 ABC 0 15 5
4 222 ABC 0 2 4
5 222 ABC 1 5 10
6 222 ABC 0 10 5
7 222 ABC 0 5 10
8 222 ABC 0 4 4
9 222 ABC 0 5 5
10 222 ABC 1 10 10
1 333 BBB 0 3 4
2 333 BBB 0 2 1
3 333 BBB 0 5 5
4 333 BBB 1 10 5
5 333 BBB 0 15 15
6 333 BBB 0 10 10
7 333 BBB 0 10 10
8 333 BBB 1 5 5
9 333 BBB 0 10 10
10 333 BBB 0 15 15
1 444 CCC 0 1 1
2 444 CCC 0 5 5
3 444 CCC 0 4 4
4 444 CCC 0 5 5
5 444 CCC 0 1 1
6 444 CCC 1 5 5
7 444 CCC 0 10 10
8 444 CCC 0 5 5
9 444 CCC 0 5 5
10 444 CCC 0 5 5
1 555 DDD 0 1 1
2 555 DDD 0 1 1
3 555 DDD 0 1 1
;
run;
proc sort data = have;
by ID County Time;
run;
data want1;
set have;
by ID;
category + first.ID;
output;
category + FocalEvent;
run;
proc sql;
create table want2 as
select
ID, County,
sum(X1mva) as X1mva,
sum(X2police) as X2police,
max(min(Time)-1,1) as Time1,
max(Time) as Time2,
max(FocalEvent) as Event
from want1
group by
ID, County, category
order by
ID, County, Time1, Time2
;
quit;
proc print;
run;
In practice, the variables (e.g., X1MVA, X2POLICE) that I sum have a lot of zeros on them and the time based events are spread out and more sparse. Is there a way I can add to my code such that I can count the time since the latest occurrence of the X1MVA or X2POLICE variables that I sum. For instance:
Have:
Time | ID | County | FocalEvent | X1mva | X2police |
1 | 222 | ABC | 1 | 0 | 1 |
2 | 222 | ABC | 0 | 1 | 0 |
3 | 222 | ABC | 0 | 0 | 1 |
4 | 222 | ABC | 0 | 0 | 0 |
5 | 222 | ABC | 1 | 0 | 0 |
6 | 222 | ABC | 0 | 0 | 1 |
7 | 222 | ABC | 0 | 0 | 0 |
8 | 222 | ABC | 0 | 0 | 0 |
9 | 222 | ABC | 0 | 1 | 0 |
10 | 222 | ABC | 1 | 1 | 0 |
1 | 333 | BBB | 0 | 0 | 1 |
2 | 333 | BBB | 0 | 0 | 1 |
3 | 333 | BBB | 0 | 1 | 0 |
4 | 333 | BBB | 1 | 0 | 0 |
5 | 333 | BBB | 0 | 1 | 0 |
6 | 333 | BBB | 0 | 0 | 0 |
7 | 333 | BBB | 0 | 0 | 0 |
8 | 333 | BBB | 1 | 0 | 1 |
9 | 333 | BBB | 0 | 0 | 1 |
10 | 333 | BBB | 0 | 1 | 0 |
Want:
ID | County | X1mva | X2police | timesinceX1 | timesinceX2 | Time1 | Time2 | Event |
222 | ABC | 0 | 1 | 0 | 0 | 1 | 1 | 1 |
222 | ABC | 1 | 1 | 3 | 2 | 1 | 5 | 1 |
222 | ABC | 2 | 1 | 0 | 4 | 5 | 10 | 1 |
333 | BBB | 1 | 2 | 1 | 2 | 1 | 4 | 1 |
333 | BBB | 1 | 1 | 3 | 0 | 4 | 8 | 1 |
333 | BBB | 1 | 1 | 0 | 1 | 8 | 10 | 0 |
Thanks in advance and I greatly appreciate you taking a look.
Your first want dataset can be easily achieved in a single data step:
data want;
set have (rename=(X1mva=_X1mva X2police=_X2police));
by id;
retain
X1mva
X2police
time1
;
if first.id
then do;
X1mva = 0;
X2police = 0;
time1 = time;
end;
X1mva + _X1mva;
X2police + _X2police;
if focalevent = 1 or last.id
then do;
time2 = time;
event = focalevent;
output;
X1mva = 0;
X2police = 0;
time1 = time;
end;
drop _: time focalevent;
run;
For the second part, something like this:
if X1mva = 1
then timesinceX1 = 0;
else timesinceX1 + 1;
if X2police = 1
then timesinceX2 = 0;
else timesinceX2 + 1;
Please post your example data in a working data step with datalines, so it's easy to recreate your dataset for testing.
Thank you for your response. Greatly appreciated. I set up the code like so:
data test.agmagtest;
set test.agmag2 (rename=(X1mva=_X1mva X2police=_X2police));
by id;
retain
X1mva
X2police
time1
;
if first.id
then do;
X1mva = 0;
X2police = 0;
time1 = time;
end;
X1mva + _X1mva;
X2police + _X2police;
if focalevent = 1 or last.id
then do;
time2 = time;
event = focalevent;
output;
X1mva = 0;
X2police = 0;
time1 = time;
end;
if X1mva = 1
then timesinceX1 = 0;
else timesinceX1 + 1;
if X2police = 1
then timesinceX2 = 0;
else timesinceX2 + 1;
drop _: time focalevent;
run;
A couple issues I am having:
(1) I don't get sums, mins. and maxes for the x1 and x2 variables
(2) The timesinceX1 and timesinceX2 variables are producing impossible values. I only want it to tell me the time since the X1 or the time since the last X2 for that particular ID.
Could you please advise? Thanks a lot.
I missed to use underlines, and you positioned the additional code in the wrong place:
data have;
input Time ID $ County $ FocalEvent X1mva X2police;
datalines;
1 222 ABC 1 0 1
2 222 ABC 0 1 0
3 222 ABC 0 0 1
4 222 ABC 0 0 0
5 222 ABC 1 0 0
6 222 ABC 0 0 1
7 222 ABC 0 0 0
8 222 ABC 0 0 0
9 222 ABC 0 1 0
10 222 ABC 1 1 0
;
data want;
set have (rename=(X1mva=_X1mva X2police=_X2police));
by id;
retain
X1mva
X2police
timesinceX1
timesinceX2
time1
;
if first.id
then do;
X1mva = 0;
X2police = 0;
time1 = time;
end;
X1mva + _X1mva;
X2police + _X2police;
if _X1mva = 1 or first.id
then timesinceX1 = 0;
else timesinceX1 + 1;
if _X2police = 1 or first.id
then timesinceX2 = 0;
else timesinceX2 + 1;
if focalevent = 1 or last.id
then do;
time2 = time;
event = focalevent;
output;
X1mva = 0;
X2police = 0;
time1 = time;
end;
drop _: time focalevent;
run;
Kurt, thanks for your help on this. I am still having issues getting it just right. Specifically, the problem is that 'the want' spits out time1 and time2 values that aren't centered at zero. For example,
time1 time2
8 26
26 42
instead of
time1 time2
0 18
18 34
I wrote some code that spits out what I want better, but I still don't know how to insert in the timesince variables that I want. Could you advise?
proc sort data=have out=have_sorted;
by ID County Time FocalEvent;
run;
data have_group;
do i=1 until (last.FocalEvent);
set have_sorted;
by ID County FocalEvent notsorted;
if lag(FocalEvent)=1 then group+1;
if first.ID then group=1;
end;
run;
data have_group2;
set have_group;
by ID County group notsorted;
if first.County then nobs=1;
else nobs+1;
run;
proc means data=have_group2 noprint;
var FocalEvent X1mva X2police nobs;
class ID County group;
ways 3;
output out=have_sum sum= min= max= / autoname;
run;
data want;
set have_sum (keep=ID County FocalEvent_Sum X1mva_Sum X2police_Sum nobs_Min nobs_Max);
if nobs_Min ne 1 then Time1 = nobs_Min-1;
else Time1 = nobs_Min;
rename FocalEvent_Sum = Event X1mva_Sum = X1mva X2police_Sum = X2police nobs_Max = Time2;
run;
To "center on zero", save the first value of a group to a retained variable, and subtract that from all the values you calculate later.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.