BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
user95
Fluorite | Level 6

Hello everyone, 

 

I am a relatively new SAS user and working on a project for which I need to do the following:

For each company ID, I have a dummy variable "event occurrence" that is either 1 or 0, for each year from 1992 to 2017.

 

I want to be able to count the number of observations between every two consecutive events ( every time event occurrence =1)

Then, if the difference is less than three years then the variable (want_var) will equal 1 for those years

if the difference=3 then the variable (want_var) will equal 0 for those years.

if the difference is >3 years, then, the variable (want_var) will equal 1 for all those years up to three years before the next event.

 

What I am looking to achieve is the following

 

IDyearEvent occurrencewant_var
1004199200
1004199300
1004199400
1004199511
1004199611
1004199700
1004199800
1004199900
1004200011
1004200111
1004200201
1004200311
1004200401
1004200501
1004200601
1004200701
1004200801
1004200901
1004201001
1004201100
1004201200
1004201300
1004201411
1004201501
1004201601
    

 

Any advice is welcome!

 

Thank you 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Okay @user95   Again as you review your data thoroughly it appears the missings goes as missings and your sample contains a lot of missing values. However all these changes are very minor IF-THENs if you follow the code to make changes. While you asked for one step solution which is fine, I really wonder whether you are able to follow and understand the code to edit/update should you consider maintaining in a production environment. Of course I hope you do. Nonetheless, it's easier to have it split into steps to modify/edit/update if you do not understand the code I am afraid.  I leave that up to you.

 

data want;
do until(last.id);
 set have;
 by id Event_occurrence notsorted;
 array t(999999) _temporary_;
 if first.Event_occurrence then _n=0;
 if Event_occurrence then do; _f=1;y=year;end;
 if _f then  _n+(Event_occurrence=0);
 if last.Event_occurrence and Event_occurrence=0 and _f then do;
 _i+1;
 t(_i)=_n;
 end;
end;
call missing(of _:);
if not missing(y) then _y=year ne y;
do until(last.id);
 set have;
 by id Event_occurrence notsorted;
 want_var=Event_occurrence;
 if Event_occurrence then _f=1;
 if first.Event_occurrence then _n=0;
 if first.Event_occurrence and Event_occurrence=0 and _f then _i+1;
 if _f and Event_occurrence>. then do; _n+(Event_occurrence=0);want_var=1;end;
 if Event_occurrence=0 and _f then do;
  if  t(_i)<3 then want_var=1;
  else if t(_i)=3 then want_var=0;
  else if t(_i)>3 then want_var=_n<=t(_i)-3;
 end;
 if _y and Event_occurrence>.  then want_var=ifn(year>=y,1,want_var);
 output;
end;
call missing(of t(*),of _:);
drop y _:;
run;

View solution in original post

20 REPLIES 20
PaigeMiller
Diamond | Level 26

if the difference is >3 years, then, the variable (want_var) will equal 1 for all those years up to three years before the next event.

 

Is there an example of this in the data? If so, which year(s) show what you mean by this?

--
Paige Miller
user95
Fluorite | Level 6

Greetings, 

 

Overall, looking at the table I attacked, the logic is as follow:

 

Before the first event for each firm, (before 1995) the variable want_var would equal 0

 

The first event is in 1995, the next event is in 1996, so since the difference between those two events is only one year, the variable want_var will equal 1 for both those years (1995, and 1996)

 

The second event and the third event are three years apart ( 3 non event years apart) (1996 to 2000), so the variable want_var will equal 1 for the event year 1996 and 2000 and zero during the those three non event years.

 

Once again, the next event is only one year apart (in 2001) so want_var=1, 

Between the 2001 event and 2003 event there is only one non event year, so want_var=1 for all those years. (2001,2002,2003)

 

Between the 2003 event year and 2014 event year , there are 10 non event years, 

so what I want to achieve in the case of difference>3 of non event years years is having the new variable want_var equal 1 from 2004 until 3 years before the next event which is 2014, so it would equal 1 on 2003 and 2014 ,  from 2004 until 2010, and then equal 0 in 2011,2012,2013.

 

Then lastly, if the last event year (2014) is not the last year recorded for that firm (2016), want_var =1 for all consecutive years until the end of data of that specific firm.

 

i hope what I want to achieve is more clear

 

Thank you

 

PaigeMiller
Diamond | Level 26

This is a good explanation of your problem. Thank you.

 

I spent about 15 minutes on trying to solve this last night, and didn't succeed. I suspect this might be a good example where the hash objects in SAS data steps might help. So I hereby summon our hash experts @novinosrin and @hashman to see if they can come up with a solution, if they have the time and interest.

--
Paige Miller
novinosrin
Tourmaline | Level 20

Thank you Sir @PaigeMiller  for the mention, however I am just  Expert @hashman  wannabe .  Hopefully I become like him one day sooner than later. He knows I am striving hard for it. 🙂

 

Hi @user95   Since your year is continuous, the requirement appears pretty straight forward(unless I missed something very important that I am known for in lacking ATD aka Attention to detail) with a simple Double DOW i.e 1.  to Flag and Count 2. To look up. So here you go-


data have;
infile cards expandtabs truncover;
input ID	year	Event_occurrence;*	want_var;
cards;
1004	1992	0	0
1004	1993	0	0
1004	1994	0	0
1004	1995	1	1
1004	1996	1	1
1004	1997	0	0
1004	1998	0	0
1004	1999	0	0
1004	2000	1	1
1004	2001	1	1
1004	2002	0	1
1004	2003	1	1
1004	2004	0	1
1004	2005	0	1
1004	2006	0	1
1004	2007	0	1
1004	2008	0	1
1004	2009	0	1
1004	2010	0	1
1004	2011	0	0
1004	2012	0	0
1004	2013	0	0
1004	2014	1	1
1004	2015	0	1
1004	2016	0	1
;
/*RULES*/
/*Then, if the difference is less than three years then the variable (want_var) will equal 1 for those years*/
/**/
/*if the difference=3 then the variable (want_var) will equal 0 for those years.*/
/**/
/*if the difference is >3 years, then, the variable (want_var) will equal 1 for all those years up to three years before the next event.*/
data want; 
 do _n=1 by 1 until(last.Event_occurrence);
  set have;
  by id Event_occurrence notsorted;
  retain _f want_var;
  if first.id then do;_f=0;want_var=0;end;
  if Event_occurrence then _f=1;
 end;
 do _n_=1 by 1 until(last.Event_occurrence);
  set have;
  by id Event_occurrence notsorted;
  if _f then want_var=1;
  if Event_occurrence=0 and _f then do;
  if  _n<3 then want_var=1;
  else if _n=3 then want_var=0;
  else if _n>3 then want_var=_n_<=_n-3;
  end;
 output;
end;
drop _:;
run;

HTH & Best Regards!

 

PaigeMiller
Diamond | Level 26

Nice solution, @novinosrin. That method did not occur to me.

--
Paige Miller
novinosrin
Tourmaline | Level 20

A wit into performance by not having another "by"  in the 2nd pass as to read the records from cache

 

data want; 
 do _n=1 by 1 until(last.Event_occurrence);
  set have;
  by id Event_occurrence notsorted;
  retain _f want_var;
  if first.id then do;_f=0;want_var=0;end;
  if Event_occurrence then _f=1;
 end;
 do _n_=1 to _n;
  set have;
  if _f then want_var=1;
  if Event_occurrence=0 and _f then do;
  if  _n<3 then want_var=1;
  else if _n=3 then want_var=0;
  else if _n>3 then want_var=_n_<=_n-3;
  end;
 output;
end;
drop _:;
run;
user95
Fluorite | Level 6

Greetings, 

 

Thank you for taking the time to answer my question, 

 

There is one detail not incorporated into your code and it is as I mentioned before as follow :

"Then lastly, if the last event year (2014) is not the last year recorded for that firm (2016), want_var =1 for all consecutive years until the end of data of that specific firm."

But following your code, it gives me want_var=1 for all remaining year except the last 3 years of the data for that firm.

 

if the event_occurrence=1 is in 2014 but the last year recorded for that firm is 2020, then want_var for all consecutive years should be=1 until 2020, but in your code want_var=1 from 2014 until 3 years before the end of the data in 2020.

 

Is there any way to fix it?

 

Thank you once again!

novinosrin
Tourmaline | Level 20

Hi @user95   In your expected output (you wrote-"what i am looking to achieve") doesn't seem to comply with the below condition right?

"Then lastly, if the last event year (2014) is not the last year recorded for that firm (2016), want_var =1 for all consecutive years until the end of data of that specific firm."

 

from - until? I am not quite understanding FROM 

user95
Fluorite | Level 6

Greetings, 

 

Yes, I want to make the want_var=1 from the last event occurrence until the end of the data for that firm. So in my example it would be from 2014 until 2016

 

How can I achieve that?

 

Thank you once more

novinosrin
Tourmaline | Level 20

Hello @user95   To Make it syntactically less verbose and easier to follow logic, I am doing the final in another step(proc sql). Once you deem we are correct, we shall jump to one step solution if you want. 

 

Test and let me know

 

data have;
infile cards expandtabs truncover;
input ID	year	Event_occurrence;*	want_var;
cards;
1004	1992	0	0
1004	1993	0	0
1004	1994	0	0
1004	1995	1	1
1004	1996	1	1
1004	1997	0	0
1004	1998	0	0
1004	1999	0	0
1004	2000	1	1
1004	2001	1	1
1004	2002	0	1
1004	2003	1	1
1004	2004	0	1
1004	2005	0	1
1004	2006	0	1
1004	2007	0	1
1004	2008	0	1
1004	2009	0	1
1004	2010	0	1
1004	2011	0	0
1004	2012	0	0
1004	2013	0	0
1004	2014	1	1
1004	2015	0	1
1004	2016	0	1
;

data temp; 
 do _n=1 by 1 until(last.Event_occurrence);
  set have;
  by id Event_occurrence notsorted;
  retain _f want_var;
  if first.id then do;_f=0;want_var=0;end;
  if Event_occurrence then _f=1;
 end;
 do _n_=1 to _n;
  set have;
  if _f then want_var=1;
  if Event_occurrence=0 and _f then do;
  if  _n<3 then want_var=1;
  else if _n=3 then want_var=0;
  else if _n>3 then want_var=_n_<=_n-3;
  end;
 output;
end;
drop _:;
run;

proc sql;
create table want(drop=t:) as
select ID,year,Event_occurrence,max((Event_occurrence=1)*year) as t, max(year) as t1,
ifn(calculated t ne calculated t1 and year>=calculated t,1,want_var) as want_var
from temp
group by id
order by id, year;
quit;
user95
Fluorite | Level 6

Greetings, 

 

Yes, the last step works as intended, as you mentioned a one step solution would be better, so that I dont end up with a separate dataset that I would need to merge back.

 

How would it be possible to do it all in one step?

 

Thank you

novinosrin
Tourmaline | Level 20

Yes sure, just a slight adjustment to not have nested DOW, instead park the count in an array on the first and then look up for the reason max of the year and year of last event_occurrence=1 is fetched at group ID level

 


data have;
infile cards expandtabs truncover;
input ID	year	Event_occurrence;*	want_var;
cards;
1004	1992	0	0
1004	1993	0	0
1004	1994	0	0
1004	1995	1	1
1004	1996	1	1
1004	1997	0	0
1004	1998	0	0
1004	1999	0	0
1004	2000	1	1
1004	2001	1	1
1004	2002	0	1
1004	2003	1	1
1004	2004	0	1
1004	2005	0	1
1004	2006	0	1
1004	2007	0	1
1004	2008	0	1
1004	2009	0	1
1004	2010	0	1
1004	2011	0	0
1004	2012	0	0
1004	2013	0	0
1004	2014	1	1
1004	2015	0	1
1004	2016	0	1
;

data want;
if 0 then set have;
want_var=0;
do until(last.id);
 set have;
 by id Event_occurrence notsorted;
 array t(999999) _temporary_;
 if first.Event_occurrence then _n=0;
 if Event_occurrence then do; _f=1;y=year;end;
 if _f then  _n+(Event_occurrence=0);
 if last.Event_occurrence and Event_occurrence=0 and _f then do;
 _i+1;
 t(_i)=_n;
 end;
end;
call missing(of _:);
_y=year ne y;
do until(last.id);
 set have;
 by id Event_occurrence notsorted;
 if Event_occurrence then _f=1;
 if first.Event_occurrence then _n=0;
 if first.Event_occurrence and Event_occurrence=0 and _f then _i+1;
 if _f then do; _n+(Event_occurrence=0);want_var=1;end;
 if Event_occurrence=0 and _f then do;
  if  t(_i)<3 then want_var=1;
  else if t(_i)=3 then want_var=0;
  else if t(_i)>3 then want_var=_n<=t(_i)-3;
 end;
 if _y then want_var=ifn(year>=y,1,want_var);
 output;
end;
call missing(of t(*),of _:);
drop y _:;
run;

 

 

user95
Fluorite | Level 6

Greetings

 

Thank you for your reply, 

 

Looking more carefully at my data , I realized that sometimes I had  missing values for the event_occurrence variable throughout the years.

 

Meaning that for the some firm A from 1992 until 2016, there might be a few years of missing values for the event_occurence variable, which your code doesnt account for, and sometimes arbitrarily results in either 0 or 1 for the want_var.

 

How can I make sure that missing values for event_occurence result in missing values in the want_var variable?

 

I thought about deleting the missing values before running your code , but somehow it doesnt seem like a good idea.

 

How do you think I should handle it?

 

Thank you!

novinosrin
Tourmaline | Level 20

can you plz post a more representative sample of your HAVE and WANT?

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 20 replies
  • 1744 views
  • 1 like
  • 3 in conversation