SAS Programming

SM1

Hello,

I am working with personnel records for 80,000+ individuals across about 15 years. A minimum of 1 record per month is generated for each employee, but multiple additional records can also be generated during any month. The records include a pay grade variable (coded 0 to 10, with 0=most junior and 10=most senior). I need to accurately identify records that document demotions. As with most administrative records, the data are mostly clean - except when they are not.

I've included some sample datalines below that reflect the 3 most common data patterns:

1. a single pay grade change that reflects a demotion (see id=23456)

2. a single pay grade change that appears to reflect a random bad record, but not a demotion (see id=12345)

3. a flurry of records that bounce between 2 pay grades, but finally begin to reflect the same pay grade consistently, which appears to reflect a demotion (see id=34567)

I am using a lag variable to identify records where paygrade < lag_paygrade, and the resulting new variable "demotion" currently acts like a general flag for records that require further QA checks.

If anyone has a recommendation for how to better differentiate between the 3 patterns of pay grade values, I'd really appreciate the help.

Thanks!

data have;
input id date paygrade; format date date9.;
datalines;

23456	01JAN2021	7
23456	01FEB2021	7
23456	01MAR2021	7
23456	01APR2021	7
23456	05APR2021	7
23456	10APR2021	7
23456	15APR2021	6
23456	01MAY2021	6
23456	01JUN2021	6

12345	01JAN2021	3
12345	01FEB2021	3
12345	01MAR2021	3
12345	01APR2021	3
12345	05APR2021	3
12345	10APR2021	2
12345	15APR2021	3
12345	01MAY2021	3
12345	01JUN2021	3

34567	01JAN2021	5
34567	01FEB2021	5
34567	01MAR2021	5
34567	01APR2021	4
34567	05APR2021	5
34567	10APR2021	5
34567	15APR2021	4
34567	01MAY2021	4
34567	01JUN2021	4

run;

proc sort data=have; by id date; run;

data have1; set have;
by id date; 
lag_paygrade=lag(paygrade); if first.ssn then lag_paygrade=.;
if paygrade<lag_paygrade then demotion='yes';
run;

ballardw

Can you clearly state the research objective? That would give us possible ideas of how to differentiate.

If anyone has a recommendation for how to better differentiate between the 3 patterns of pay grade values, I'd really appreciate the help.

It appears that some sort of count of +/- changes might be in order. That could be done with the DIF function followed by proc freq.

Your data step input statement needs to be adjusted to read the date correctly otherwise it throws many errors:

input id date :date9. paygrade; format date date9.;

will read the date with the date9. informat.

data want;
   set have;
   by id notsorted;
   paydif = dif(paygrade);
   if first.id then paydif=.;
run;
proc freq data=want;
   table id*paydif/list;
run;

Will show total number of + or - changes. If there is only one then you get 0 and the change (1 means a one level increase).

If you get more than one -1 then you have multiple decreases. A series of promotions over time would be shown but multiple +1 but no -1.

KachiM

A _temporary_ array is used to store paygrades per ID. It is sized to 9999999 to handle large data, but your sample data requires 9. We then compare one cell with the next cell to find the demotion and stop the comparison once we find it. If necessary, the comparison can be continued.

data have;
input id date date9. paygrade; 
format date date9.;
datalines;
23456 01JAN2021   7
23456 01FEB2021   7
23456 01MAR2021   7
23456 01APR2021   7
23456 05APR2021   7
23456 10APR2021   7
23456 15APR2021   6
23456 01MAY2021   6
23456 01JUN2021   6
12345 01JAN2021   3
12345 01FEB2021   3
12345 01MAR2021   3
12345 01APR2021   3
12345 05APR2021   3
12345 10APR2021   2
12345 15APR2021   3
12345 01MAY2021   3
12345 01JUN2021   3
34567 01JAN2021   5
34567 01FEB2021   5
34567 01MAR2021   5
34567 01APR2021   4
34567 05APR2021   5
34567 10APR2021   5
34567 15APR2021   4
34567 01MAY2021   4
34567 01JUN2021   4
;
run;

data want;
   array k[9999999] _temporary_; 
   do i = 1 by 1 until(last.id);
      set have end = eof;
      by id;
      k[i] = paygrade;
   end;

   do j = 1 to i;
      if k[j] > k[j+1] then do;
         demotion = 1;
         leave;
      end;
   end;
   call missing(of k[*]);
drop i j date;
run;

proc print data = want;
run;

SM1

@ballardw - thanks for your reply.

The ultimate objective is to correctly identify records that accurately document demotions across comprehensive employee administrative records as indicated by (1) the change in the pay grade variable (compared to the previous record) and (2) the continuation of the new pay grade variable in subsequent records. Some employees have 2+ demotions across their tenure and some records extend for 15+ years.

I previously tried using syntax (see below) that used a character version of the pay grade variable ('pg') to concatenate the value across each encounter and then compress the consecutive occurrences of the same code to a single value, but I couldn't distinguish between employees with good data (like id=23456) and 1-off bad records (like id=12345).

proc sort data=have; by id date; run;
data want;
	retain id; 
	length comb $110;
		do until(last.id);
  		set have;
  		by id pg notsorted;
		if first.pg then comb=cats(comb, char(pg,1)); 
		end;
run;

Ksharp

The following my code is suited for only TWO levels of paygrade as you showed in your data.

If you need to apply to more pattern ,that is another story.

data have;
infile cards expandtabs;
input id date :date9. paygrade;
format date date9.;
datalines;
23456	01JAN2021	7
23456	01FEB2021	7
23456	01MAR2021	7
23456	01APR2021	7
23456	05APR2021	7
23456	10APR2021	7
23456	15APR2021	6
23456	01MAY2021	6
23456	01JUN2021	6
12345	01JAN2021	3
12345	01FEB2021	3
12345	01MAR2021	3
12345	01APR2021	3
12345	05APR2021	3
12345	10APR2021	2
12345	15APR2021	3
12345	01MAY2021	3
12345	01JUN2021	3
34567	01JAN2021	5
34567	01FEB2021	5
34567	01MAR2021	5
34567	01APR2021	4
34567	05APR2021	5
34567	10APR2021	5
34567	15APR2021	4
34567	01MAY2021	4
34567	01JUN2021	4
;

proc freq data=have noprint order=data;
table id*paygrade/out=level;
run;
data level;
 set level;
 by id notsorted;
 if first.id then level=-1;
 level+1;
 keep id paygrade level;
run;
data temp;
if _n_=1 then do;
 if 0 then set level;
 declare hash h(dataset:'level');
 h.definekey('id','paygrade');
 h.definedata('level');
 h.definedone();
end;
set have;
call missing(level);
rc=h.find();
drop rc;
run;
data temp2;
 set temp;
 by id level notsorted;
 if first.level;
 keep id level;
run;
data want;
do until(last.id);
 set temp2;
 by id notsorted;
 length pattern $ 80;
 pattern=cats(pattern,level);
end;
keep id pattern;
run;

SAS Programming

Differentiate between potential errors versus good values in a long table?

Re: Differentiate between potential errors versus good values in a long table?

Re: Differentiate between potential errors versus good values in a long table?

Re: Differentiate between potential errors versus good values in a long table?

Re: Differentiate between potential errors versus good values in a long table?

Unleashing Potential – SAS Hackathon

Global Surveillance System for Potential Bioterrorism Attacks

Unleashing Potential: Join the SAS Hackathon 2024

Hackin’SAS (the good way)

%SASDIFF, a SAS macro for differential file comparison

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...