Looking for exceptions

Noelle125 · Posted 02-17-2012 04:16 PM

I am struggling with what I believe is a fairly simple logic string.

My data looks like this:

Incident ID	Year	Discscore	Emp No.
1	2005	2	123
2	2006	5	123
3	2007	0.5	123
4	2008	1	123
5	2009	1.5	123
6	2001	2.5	456
7	2002	3.5	456
8	2003	5	456
9	2009	4	789
10	2010	5	789
11	2006	5	1011

I am looking for employees (EMP NO.) where at some point they had a DISCSCORE of 5 but then in subsequent years, they received any other score. So, drawing the example from my dataset above, the syntax would return Emp No. 123 because they had a discscore of 5 in 2006 and then received a 0.5 in 2007. The syntax would NOT return Emp No.'s 456, 789 or 1011 because the last Discscore they received were 5s.

Thanks in advance.

Haikuo · Posted 02-17-2012 04:31 PM

What do you plan to do the cases with 5 then 0.5 then get back to 5 again?

Noelle125 · Posted 02-17-2012 04:40 PM

Good question. That situation does exist in the dataset. Ideally, those would be flagged as well but I believe those are truly rare and so if they, too, were returned, I would deal with them manually. Obviously, it is easy to find all the 5s. I'm just trying to not have to manually look through thousands of records to find those cases where an employee received a 5 and then received some other score later on.

art297 · Posted 02-17-2012 04:45 PM

Here is a way to do it with a datastep:

proc sort data=have;

by Emp_No Year;

run;

data want (keep=Emp_No);

set have;

by Emp_No;

retain have5 want;

if first.Emp_No then do;

have5=0;

want=0;

end;

if have5 and Discscore ne 5 then want=1;

if Discscore eq 5 then have5=1;

if last.Emp_no and have5 and want

then output;

run;

Noelle125 · Posted 02-17-2012 04:51 PM

Thank you! You have given me another thing to try. I really appreciate this. For whatever reason, I was really stumped by this seemingly simple issue.

Haikuo · Posted 02-17-2012 04:47 PM

Well, this will only take care of 1. there were incidence of '5'. 2.The last of group is not '5'.

data have;

infile cards;

input IncidentID Year Discscore Empno;

cards;

1 2005 2 123

2 2006 5 123

3 2007 0.5 123

4 2008 1 123

5 2009 1.5 123

6 2001 2.5 456

7 2002 3.5 456

8 2003 5 456

9 2009 4 789

10 2010 5 789

11 2006 5 1011

;

data want (keep=empno);

do until (last.empno);

set have;

by empno year;

if first.empno then _f=0;

if discscore=5 then _f=1;

if last.empno then _f1=(discscore<5);

end;

if _f*_f1=1 then output;

run;

proc print;run;

Will look into more general approach. I hope this will get you started.

Regards,

Haikuo

Noelle125 · Posted 02-17-2012 04:53 PM

Thank you! This will definitely get me started. Again, I truly appreciate all of the help.

Haikuo · Posted 02-17-2012 04:58 PM

This will take care of those fluctuated cases as well, it will first check if there is '5', and if yes, then check if there is '<5' after within the same group.

data want (keep=empno);

do until (last.empno);

set have;

by empno year;

if discscore=5 then _f=1;

if _f=1 and discscore < 5 then _f1=1;

end;

if _f*_f1 then output;

run;

HTH,

Haikuo

Noelle125 · Posted 02-17-2012 05:15 PM

Wow! Thank you!

PGStats · Posted 02-17-2012 04:37 PM

A self join should do:

data have;

input incidentId year discScore EmpNo;

datalines;

1 2005 2 123

2 2006 5 123

3 2007 0.5 123

4 2008 1 123

5 2009 1.5 123

6 2001 2.5 456

7 2002 3.5 456

8 2003 5 456

9 2009 4 789

10 2010 5 789

11 2006 5 1011

;

proc sql;

create table want as

select unique a.empNo

from have as a inner join have as b

on a.empNo=b.empNo and a.year>b.year

where b.discScore>=5;

PG

Noelle125 · Posted 02-17-2012 04:45 PM

Thank you. This is very helpful as I know next to nothing about SQL. I will try this right now.

Linlin · Posted 02-17-2012 07:36 PM

data have;

input incidentId year discScore EmpNo;

datalines;

1 2005 2 123

2 2006 5 123

3 2007 0.5 123

4 2008 1 123

5 2009 1.5 123

6 2001 2.5 456

7 2002 3.5 456

8 2003 5 456

9 2009 4 789

10 2010 5 789

11 2006 5 1011

;

proc sql;

create table want as

select distinct a.empno

from have a, have b

where (a.empNo=b.empNo) and a.discscore>=5 and (a.year<b.year);

quit;

proc print;run;

Haikuo · Posted 02-17-2012 08:03 PM

PGstats and Linlin,

SQL works for OP's as-is data, however, multiple adjacent '5's within one group will also be pulled. Check this out:

data have;

input incidentId year discScore EmpNo;

datalines;

1 2005 2 123

2 2006 5 123

3 2007 0.5 123

4 2008 1 123

5 2009 1.5 123

6 2001 2.5 456

7 2002 3.5 456

8 2003 5 456

8.1 2004 5 456

9 2009 4 789

10 2010 5 789

11 2006 5 1011

;

in this case, '456' will also be pulled, even there is no '<5' existing after '5'.

Kindly Regards,

Haikuo

PGStats · Posted 02-17-2012 10:27 PM

You are perfectly right Hai.kuo, there was a clause missing in my SQL query and the equivalent one from Linlin. As we fix this, we can also fix the problem of a sequence hitting discScore=5 many times, as illustrated below. I think the query would be most useful if it read :

data have;
input incidentId year discScore EmpNo;
datalines;
1 2005 2 123
2 2006 5 123
3 2007 0.5 123
4 2008 1 123
5 2009 1.5 123
6 2001 2.5 456
7 2002 3.5 456
8 2003 5 456
8.1 2004 5 456
9 2009 4 789
10 2010 5 789
10.1 2011 3 789
10.2 2012 5 789
10.3 2013 0.5 789
11 2006 5 1011
;

proc sql;
create table want as
select unique b.incidentId, b.year, b.empNo
from have as a inner join have as b
on a.empNo = b.empNo and
a.year > b.year and
a.discScore ne b.discScore
where b.discScore >= 5;

proc print; run;

Then every case where a series of scores hits 5 followed later by some other score would be pulled out. Given the problem statement, it is not clear if the extra condition should read a.discScore ne b.discScore (other score, as stated) or a.discScore < b.discScore (lower score, as was possibly meant). Of course, if the maximum score is 5, then both conditions are equivalent.

PG

Ksharp · Posted 02-20-2012 01:56 AM

If I understood what you mean.

data have;
input incidentId year discScore EmpNo;
datalines;
1 2005 2 123
2 2006 5 123
3 2007 0.5 123
4 2008 1 123
5 2009 1.5 123
6 2001 2.5 456
7 2002 3.5 456
8 2003 5 456
9 2004 5 456
10 2009 4 789
11 2010 5 789
12 2006 5 1011
;
run;
proc sql;
 create table want as
  select distinct empno
   from have as a
    where discscore=5
     having (select count(*) from have where empno=a.empno and year gt a.year) gt 0;
quit;

Ksharp

Registration is open

SAS Training: Just a Click Away