10-02-2017 02:13 PM
Here comes a question I cant figure out. To be classified as retired, an individual needs to have atleast 80% of the annual income from pension - two years in a row.
I have a longitudinal dataset (short format) with the percentages every year, variables: 2005, 2006, 2007.... 2014. Lets use Frank and Anna as an example:
Frank: 2005 (82%), 2006 (76%), 2007 (81%), 2008 (98%), 2009 (100%), 2010 (100%)... which could be witten as [1 0 1 1 1 1]. In other words: Frank should be classified as retired 2007 since both 2007 and 2008 are above 80%.
Anna: 2005 (0%), 2006 (0%), 2007 (0%), 2008 (0%), 2009 (100%), 2010 (100%) [0 0 0 0 1 1]. Anna should be classified as retired 2009.
How do I handle this? The problem is the "two years in a row" thing. Thanks!
10-02-2017 03:38 PM
You haven't exactly described your data in a way that makes sense as a SAS data set. I'll illustrate how to go about this, but on some pretty stiff assumptions. If your data doesn't match this, you may need to adjust the program considerably.
First, assume you have three fields for each observation: name, year, and percent.
Second, assume your percent variable is on a scale of 0 to 100, not 0 to 1. So a value of 50 = 50%.
Third, assume your observations are in sorted order BY NAME YEAR.
Finally, assume you want to add a fourth variable to each observation, with a value of Y or N to indicate retirement year.
set have (firstobs=2 keep=percent rename=(percent=next_percent))
have (keep=percent rename=(percent=next_percent));
if first.name then ever_retired='N';
if ever_retired='N' and percent >= 80 and next_percent >= 80 and last.name=0 then do;
drop next_percent ever_retired;
It's untested code, but looks about right.
10-04-2017 05:45 AM
Astounding's solution is "about right", but I think it drops the final observation.
Here is another suggestion;
data want; set have; by name; if first.name then retired='N'; retain retired; if percent>80 then do; if retired='N' and not last.name then do; _N_=_N_+1; set have (keep=percent rename=(percent=next_percent)) point=_N_; if next_percent>80 then retired='Y'; end; end; else retired='N'; drop next_percent; run;
This solution is setting the RETIRED variable to 'Y' for all years that are in a series of years with more than 80% of their income coming from pensions. (meaning that if somebody is retired, and then has a year of less than 80% of income coming from pensions, that person switches back to RETIRED='N'). Is that what you want?
Or are you just looking for the first year in such a series for each person, meaning that retirement is a once in a lifetime event? then something like this may work:
data want; set have; by name; if percent>80 and lag(percent)>80 and not first.name then do; retirement_year=year-1; output; end; keep name retirement_year; run;
Like the previous example, it assumes that the data are sorted by NAME and YEAR, and that the years are consecutive.
10-04-2017 07:15 AM - edited 10-04-2017 07:17 AM
Thank you s_lassen for your help. Very appreciated!!!
Since I am about to do survival analysis, I am interesting in duration (time to retirement) and event (retired y/n). So it is as you wrote in last sentence - I am only interested in one-life event.
I am using several datasets and merge them into one, currently in short-format-style (see attached illustrative Excel). Since my goal is to only have 1 variable with retirement year, for example. ID 4 become retired 2004, retirement_year=2004. Then I may drop the other information and only use the exact retirement date as in the Excel-example.
If the Excel doesn't work: https://imgur.com/a/31naK
10-04-2017 11:31 AM - edited 10-04-2017 11:32 AM
While my solution is untested, I think it should be fine. It won't drop the final observation because the second SET statement mentions HAVE twice. So the first SET statement determines when the DATA step ends. At any rate, all of that assumes there is nothing else I overlooked (such as failing to set retired to "N" which would be a possible desired outcome instead of blank vs. "Y").
@s_lassen you might want to take a look at what happens if you have three observations in a row with percent > 80. I'm not sure your logic is complete for that. Looks like a viable approach, though.
10-06-2017 06:12 AM - edited 10-06-2017 06:16 AM
Thank you both for helping me! I've tried both of your codes and the only I get to work is the last proposed by @s_lassen.
And it works! it only select years with >80% with the next year also >80%. BUT!
It do not select the FIRST year so in cases were individuals have many years according to the criteria it can look as this below:
Can I add some "Select minimum year" for every ID into the program below?
This also happens in cases were people 2003 have 81%, then 79% in 2004 and then 82% in 2005 and 2006. Were there becomes a gap (but this is not a problem)
data LIB1.Test2; set LIB1.Test1; by ID; if percent>80 and lag(percent)>80 and not first.ID then do; retirement_event=Retirement_year-1; output; end; keep ID retirement_event ; run;
10-06-2017 11:09 PM
Any year that is part of consecutive percent >=80 is retire='Y', otherwise retire='N'.
data want (drop=nxt_pct); set have (keep=name); by name; merge have have (firstobs=2 keep=percent rename=(percent=nxt_pct)); retire= ifc(min(lag(percent),percent)>=80 and first.name=0,'Y','N'); if min(percent,nxt_pct)>=80 then retire='Y'; run;