Re: How do I reduce duplicate rows/observations by keeping records of ...

Ronnyclay32 · Posted 12-15-2018 06:30 PM

I have a dataset containing the start date (date_initial) and the end date (date_final) that an employee has assumed a function (effective or temporary) in a company. I would like to reduce the number of rows, by type of function assumed (effective or temporary) in the company, since the start date of a most recently assumed function was obtained the day after the completion of another previous function.

For example, the employee Douglas assumed the oldest effective function from 03/15/2013 until 04/19/2016. Subsequently, he uninterruptedly assumed other functions in an effective manner, and the latter ended on 06/15/2018. In this case, the employee would only keep an observation (row) containing the initial date on 03/15/2013 when he assumed the first function and the end date on 06/15/2018, when he assumed the last effective function.

The following is part of the dataset:

obs	name	function_type	date_initial	date_final
1	Douglas	effective	15/03/2013	19/04/2015
2	Douglas	effective	20/04/2015	07/08/2016
3	Douglas	effective	08/08/2016	15/06/2018
4	Douglas	temporary	16/06/2018	16/12/2018
5	Emily	effective	15/02/2015	16/09/2017
6	Emily	effective	17/09/2017	15/12/2018
7	Olivia	effective	03/01/2014	25/04/2016
8	Olivia	effective	17/07/2016	20/11/2017
9	Olivia	temporary	21/11/2017	03/03/2018

And here is the result that I hope to obtain:

obs	name	function_type	date_initial	date_final
1	Douglas	effective	15/03/2013	15/06/2018
2	Douglas	temporary	16/06/2018	16/12/2018
3	Emily	effective	15/02/2015	15/12/2018
4	Olivia	effective	03/01/2014	25/04/2016
5	Olivia	effective	17/07/2016	20/11/2017
6	Olivia	temporary	21/11/2017	03/03/2018

I would appreciate it if you could help me solve this problem.

Jagadishkatam · Posted 12-16-2018 06:53 PM

data have;
input 	name:$	function_type:$9.	date_initial:ddmmyy10.	date_final:ddmmyy10.;
lag_date=lag(date_final);
format date_initial date_final lag_date date9.;
cards;
Douglas effective 15/03/2013 19/04/2015
Douglas effective 20/04/2015 07/08/2016
Douglas effective 08/08/2016 15/06/2018
Douglas temporary 16/06/2018 16/12/2018
Emily effective 15/02/2015 16/09/2017
Emily effective 17/09/2017 15/12/2018
Olivia effective 03/01/2014 25/04/2016
Olivia effective 17/07/2016 20/11/2017
Olivia temporary 21/11/2017 03/03/2018
;

data want;
set have;
by name function_type date_initial;
if first.function_type then lag_date=.;
dif=date_initial-lag_date;
if first.date_initial and (dif eq . or dif >1); 
drop lag_date dif;
run;

Thanks,
Jag

learsaas · Posted 12-17-2018 03:34 AM

data a;
	input 	name:$	function_type:$9.	date_initial:ddmmyy10.	date_final:ddmmyy10.;
	format date_initial date_final date9.;
	cards;
Douglas effective 15/03/2013 19/04/2015
Douglas effective 20/04/2015 07/08/2016
Douglas effective 08/08/2016 15/06/2018
Douglas temporary 16/06/2018 16/12/2018
Emily effective 15/02/2015 16/09/2017
Emily effective 17/09/2017 15/12/2018
Olivia effective 03/01/2014 25/04/2016
Olivia effective 17/07/2016 20/11/2017
Olivia temporary 21/11/2017 03/03/2018
;
run;
data result;
	set a;
	by name function_type date_initial;
	retain nstart nend;
	select;
	   when(first.function_type) do;
			nstart=date_initial;nend=date_final;
	   end;
	   when(date_initial=nend+1) do;
			 nend=date_final;
	   end;
	   otherwise do;
	  		 nstart=date_initial;nend=date_final;
	   end;
	end;
run;
data result;
	set result;
	by name function_type nstart nend;
	if last.nstart then do;
		date_initial=nstart;
		output;
	end;
	drop nstart nend;
run;

Ksharp · Posted 12-17-2018 08:02 AM

data a;
	input 	name:$	function_type:$9.	date_initial:ddmmyy10.	date_final:ddmmyy10.;
	format date_initial date_final  ddmmyy10.;
	cards;
Douglas effective 15/03/2013 19/04/2015
Douglas effective 20/04/2015 07/08/2016
Douglas effective 08/08/2016 15/06/2018
Douglas temporary 16/06/2018 16/12/2018
Emily effective 15/02/2015 16/09/2017
Emily effective 17/09/2017 15/12/2018
Olivia effective 03/01/2014 25/04/2016
Olivia effective 17/07/2016 20/11/2017
Olivia temporary 21/11/2017 03/03/2018
;
run;
data b;
 set a;
 by name function_type;
 if first.function_type then group=0;
 if date_initial-lag(date_final) > 1 then group+1;
run;
data want;
 set b(rename=(date_initial=_date_initial));
 by name function_type group;
 retain date_initial;
 if first.group then date_initial=_date_initial;
 if last.group;
 drop group _date_initial;
 format date_initial ddmmyy10.;
run;

mkeintz · Posted 12-17-2018 05:36 PM

data a;
	input 	name:$	function_type:$9.	date_initial:ddmmyy10.	date_final:ddmmyy10.;
	format date_initial date_final date9.;
	cards;
Douglas effective 15/03/2013 19/04/2015  S
Douglas effective 20/04/2015 07/08/2016  -
Douglas effective 08/08/2016 15/06/2018  F
Douglas temporary 16/06/2018 16/12/2018  SF
Emily effective 15/02/2015 16/09/2017    S
Emily effective 17/09/2017 15/12/2018    F
Olivia effective 03/01/2014 25/04/2016   SF
Olivia effective 17/07/2016 20/11/2017   SF
Olivia temporary 21/11/2017 03/03/2018   SF
;
run;
data want (drop=_:);
  set a (keep=name function_type);
  by name function_type;
  merge a 
        a (firstobs=2 keep=date_initial rename=(date_initial=_nxt_initial));

  retain _save_initial ;
 
  if first.function_type or date_initial-1 ^=lag(date_final) 
    then _save_initial=date_initial;

  if last.function_type or _nxt_initial^=date_final+1;
  date_initial=_save_initial;
run;

How it works:

The SET and BY statements are there merely to generate the automatic dummy vars first.function_type and last.function_type. This lets you know when the observation in hand is the beginning (or end) of a function_type group.
The merge of A with itself provides a way to look ahead one observation (the firstobs=2 option). The lookahead record keeps only one variable - date_initial - which is renamed to _nxt_initial to keep it distinct from date_initial in the current record.
retain _save_initial statement is needed because if a time span covers more the one observation, the corresponding beginning date_initial value has to be retained across observations.
The "if first.function_type or ... " detects the start of any time-span and saves the date_initial value.
The "if last.function_type or ..." is a subsetting if, keeping only observations that must represent the end of a time-span.
Having passed the subsetting if filter, the date_initial value is recovered from the earlier saved value.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Re: How do I reduce duplicate rows/observations by keeping records of different observations?

Registration is open