Re: Meeting a criteria "two in a row"

Weberian · Posted 10-02-2017 02:13 PM

Here comes a question I cant figure out. To be classified as retired, an individual needs to have atleast 80% of the annual income from pension - two years in a row.

I have a longitudinal dataset (short format) with the percentages every year, variables: 2005, 2006, 2007.... 2014. Lets use Frank and Anna as an example:

Frank: 2005 (82%), 2006 (76%), 2007 (81%), 2008 (98%), 2009 (100%), 2010 (100%)... which could be witten as [1 0 1 1 1 1]. In other words: Frank should be classified as retired 2007 since both 2007 and 2008 are above 80%.

Anna: 2005 (0%), 2006 (0%), 2007 (0%), 2008 (0%), 2009 (100%), 2010 (100%) [0 0 0 0 1 1]. Anna should be classified as retired 2009.

How do I handle this? The problem is the "two years in a row" thing. Thanks!

Astounding · Posted 10-02-2017 03:38 PM

You haven't exactly described your data in a way that makes sense as a SAS data set. I'll illustrate how to go about this, but on some pretty stiff assumptions. If your data doesn't match this, you may need to adjust the program considerably.

First, assume you have three fields for each observation: name, year, and percent.

Second, assume your percent variable is on a scale of 0 to 100, not 0 to 1. So a value of 50 = 50%.

Third, assume your observations are in sorted order BY NAME YEAR.

Finally, assume you want to add a fourth variable to each observation, with a value of Y or N to indicate retirement year.

data want;

set have;

by name;

set have (firstobs=2 keep=percent rename=(percent=next_percent))

have (keep=percent rename=(percent=next_percent));

if first.name then ever_retired='N';

retain ever_retired;

if ever_retired='N' and percent >= 80 and next_percent >= 80 and last.name=0 then do;

ever_retired='Y';

retired='Y';

end;

drop next_percent ever_retired;

run;

It's untested code, but looks about right.

Weberian · Posted 10-02-2017 04:04 PM

Thank you Astounding. Let me try and I will come back ASAP with more details if it works or not!

Thansk again!

s_lassen · Posted 10-04-2017 05:45 AM

@Weberian:

Astounding's solution is "about right", but I think it drops the final observation.

Here is another suggestion;

data want;
  set have;
  by name;
  if first.name then
    retired='N';
  retain retired;
  if percent>80 then do;
    if retired='N' and not last.name then do;
      _N_=_N_+1;
      set have (keep=percent rename=(percent=next_percent)) point=_N_;
      if next_percent>80 then
        retired='Y';
      end;
    end;
  else 
    retired='N';
  drop next_percent;
run;

This solution is setting the RETIRED variable to 'Y' for all years that are in a series of years with more than 80% of their income coming from pensions. (meaning that if somebody is retired, and then has a year of less than 80% of income coming from pensions, that person switches back to RETIRED='N'). Is that what you want?

Or are you just looking for the first year in such a series for each person, meaning that retirement is a once in a lifetime event? then something like this may work:

data want;
  set have;
  by name;
  if percent>80 and lag(percent)>80 and not first.name then do;
    retirement_year=year-1;
    output;
    end;
  keep name retirement_year;
run;

Like the previous example, it assumes that the data are sorted by NAME and YEAR, and that the years are consecutive.

Weberian · Posted 10-04-2017 07:15 AM

Thank you s_lassen for your help. Very appreciated!!!

Since I am about to do survival analysis, I am interesting in duration (time to retirement) and event (retired y/n). So it is as you wrote in last sentence - I am only interested in one-life event.

I am using several datasets and merge them into one, currently in short-format-style (see attached illustrative Excel). Since my goal is to only have 1 variable with retirement year, for example. ID 4 become retired 2004, retirement_year=2004. Then I may drop the other information and only use the exact retirement date as in the Excel-example.

Since I only have one variable for retirement year, it doesnt matter wether I use long or short format?
But maybe I have do make it LONG-format to be able to sort by year (percentage) and use the program you sent me?

Thanks!

If the Excel doesn't work: https://imgur.com/a/31naK

Astounding · Posted 10-04-2017 11:31 AM

While my solution is untested, I think it should be fine. It won't drop the final observation because the second SET statement mentions HAVE twice. So the first SET statement determines when the DATA step ends. At any rate, all of that assumes there is nothing else I overlooked (such as failing to set retired to "N" which would be a possible desired outcome instead of blank vs. "Y").

@s_lassen you might want to take a look at what happens if you have three observations in a row with percent > 80. I'm not sure your logic is complete for that. Looks like a viable approach, though.

Weberian · Posted 10-06-2017 06:12 AM

Thank you both for helping me! I've tried both of your codes and the only I get to work is the last proposed by @s_lassen.

And it works! it only select years with >80% with the next year also >80%. BUT!

It do not select the FIRST year so in cases were individuals have many years according to the criteria it can look as this below:

ID	Reirement_event
12278	2005
12278	2006
12294	2002
12294	2003
12294	2004
12317	2002
12317	2003
12317	2004
12317	2005
12317	2006

Can I add some "Select minimum year" for every ID into the program below?

This also happens in cases were people 2003 have 81%, then 79% in 2004 and then 82% in 2005 and 2006. Were there becomes a gap (but this is not a problem)

ID	Reirement_event
12322	2003
12322	2005
12322	2006

data LIB1.Test2;
set LIB1.Test1;
by ID;
if percent>80 and lag(percent)>80 and not first.ID then do;
retirement_event=Retirement_year-1;
output;
end;
keep ID retirement_event ;
run;

mkeintz · Posted 10-06-2017 11:09 PM

Any year that is part of consecutive percent >=80 is retire='Y', otherwise retire='N'.

data want (drop=nxt_pct);
  set have (keep=name);
  by name;

  merge have
        have (firstobs=2 keep=percent rename=(percent=nxt_pct));

  retire=  ifc(min(lag(percent),percent)>=80 and first.name=0,'Y','N');
  if min(percent,nxt_pct)>=80 then retire='Y';

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Weberian · Posted 10-09-2017 12:35 PM

Thank you all for helping me. It works now.

@Astounding @s_lassen @mkeintz

Classroom Training Available!