Solved: How to delete row(s) for and ID if conditions are met?

d0816 · Posted 04-22-2019 10:36 AM

Hi,

I have following sample dataset, where I am trying to look up all the rows for an ID and delete the row or rows if End Date and End Reason are missing for an ID but if all the other variables have the same data values considering all the rows for an ID.

For e.g. There are two rows for an ID 1, I would like to delete the second row.

There are three rows for ID 19, all three rows are a keep.

Would appreciate any suggestions to solve this using Proc SQL and/or data steps.

Thank you.

Start Month	Start Date	End Date	Class	End Reason	ID	College	Location	Location Number	Session	College Name
2015-03	2015-03-04	2015-05-04	A	SC	1	U	M	40	TM	U
2015-03	2015-03-04		A		1	U	M	40	TM	U
2016-07	2016-07-28	2016-09-21	A	SC	2	U	M	40	TM	U
2016-09	2016-09-22	2016-09-30	A	SC	2	U	M	40	TM	U
2017-06	2017-06-18	2017-07-31	A	SC	3	EM	M	40	TM	EM
2017-08	2017-08-14	2017-09-15	A	UNSC/INT	3	EM	M	40	TM	EM
2016-10	2016-10-06	2016-11-30	A	UNSC/INT	4	EM	M	40	TM	EM
2017-03	2017-03-27	2017-06-30	A	UNSC/INT	4	EM-MR	M	40	MR	EM-MR
2018-01	2018-01-29	2018-03-29	A	SC	5	U	M	40	TM	U
2018-03	2018-03-30	2018-04-10	A	SC	5	U	M	40	TM	U
2017-04	2017-04-06	2017-12-26	A	SC	6	NWC	ASH	02	T	NWC
2017-12	2017-12-28	2018-02-26	A	SC	6	NWC	ASH	02	T	NWC
2014-09	2014-09-21	2014-09-21	A	FTP-NGC	7	U	M	40	TM	U
2014-10	2014-10-24	2014-10-30	A	UNSC/INT	7	U	M	40	TM	U
2015-06	2015-06-15	2015-08-13	A	SC	8	U	M	40	TM	U
2015-08	2015-08-14	2015-08-30	A	SC	8	U	M	40	TM	U
2016-10	2016-10-31	2016-10-31	A	SC	9	U	M	40	TM	U
2016-11	2016-11-01	2016-12-31	A	SC	9	U	M	40	TM	U
2016-04	2016-04-11	2016-06-30	A	UNSC/INT	10	EM	M	40	TM	EM
2016-04	2016-04-11		A		10	EM	M	40	TM	EM
2014-08	2014-08-11	2014-09-30	A	UNSC/INT	11	U	M	40	TM	U
2014-11	2014-11-13	2015-07-07	A	SC	11	U	M	40	TM	U
2015-10	2015-10-19	2015-10-30	A	UNSC/INT	12	U	M	40	TM	U
2015-10	2015-10-19		A		12	U	M	40	TM	U
2015-06	2015-06-30	2015-08-28	A	SC	13	U	M	40	TM	U
2015-08	2015-08-29	2015-08-30	A	SC	13	U	M	40	TM	U
2018-03	2018-03-21	2019-02-27	A	RWC	14	NWC	TAY	60	T	NWC
2019-02	2019-02-28	2019-02-28	A	FTP-NGC	14	NWC	TAY	60	T	NWC
2016-05	2016-05-17	2016-05-30	A	UNSC/INT	15	U	M	40	TM	U
2016-05	2016-05-17		A		15	U	M	40	TM	U
2015-06	2015-06-19	2015-06-19	A	SC	16	U	M	40	TM	U
2015-06	2015-06-20	2015-08-30	A	SC	16	U	M	40	TM	U
2018-07	2018-07-30	2018-09-27	A	SC	17	U	M	40	TM	U
2018-09	2018-09-28	2018-10-30	A	SC	17	U	M	40	TM	U
2016-01	2016-01-12	2016-07-11	A	SC	18	U	M	40	TM	U
2016-01	2016-01-12		A		18	U	M	40	TM	U
2019-01	2019-01-08	2019-02-27	A	RWC	19	NWC	ASH	02	T	NWC
2019-02	2019-02-28		A		19	NWC	ASH	02	T	NWC
2019-02	2019-02-28		A		19	NWC	PRI	50	T	NWC

Astounding · Posted 04-22-2019 03:35 PM

Good catch. Let's modify slightly:

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name and first.college_name then return;
   if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;

View solution in original post

Astounding · Posted 04-22-2019 11:13 AM

First sort your data if it is not already in sorted order:

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

Then remove the records you don't want:

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name=0 and (end_date=. or end_reason=' ') then delete;
run;

d0816 · Posted 04-22-2019 12:03 PM

HI, Thanks for your reply.

I tried the code you suggested but this is not working the way I want. This keeps one record per ID but I want to delete records only if value for two variables are missing AND all the other remaining variables have same data values for the same ID with multiple rows.

novinosrin · Posted 04-22-2019 12:24 PM

Hi @d0816 Would it be possible to post your sample as datastep or replace the blanks with . , so that respondents can copy paste, test plz?

d0816 · Posted 04-22-2019 01:56 PM

@novinosrin

I replaced the blanks with a . , Please see below.

Start_Month	Start_Date	End_Date	Class	End_Reason	ID	College	Location	Location_Number	Session	College_Name
2015-03	3/4/2015	5/4/2015	A	SC	1	U	M	40	TM	U
2015-03	3/4/2015	.	A	.	1	U	M	40	TM	U
2016-07	7/28/2016	9/21/2016	A	SC	2	U	M	40	TM	U
2016-09	9/22/2016	9/30/2016	A	SC	2	U	M	40	TM	U
2017-06	6/18/2017	7/31/2017	A	SC	3	EM	M	40	TM	EM
2017-08	8/14/2017	9/15/2017	A	UNSC/INT	3	EM	M	40	TM	EM
2016-10	10/6/2016	11/30/2016	A	UNSC/INT	4	EM	M	40	TM	EM
2017-03	3/27/2017	6/30/2017	A	UNSC/INT	4	EM-MR	M	40	MR	EM-MR
2018-01	1/29/2018	3/29/2018	A	SC	5	U	M	40	TM	U
2018-03	3/30/2018	4/10/2018	A	SC	5	U	M	40	TM	U
2017-04	4/6/2017	12/26/2017	A	SC	6	NWC	ASH	2	T	NWC
2017-12	12/28/2017	2/26/2018	A	SC	6	NWC	ASH	2	T	NWC
2014-09	9/21/2014	9/21/2014	A	FTP-NGC	7	U	M	40	TM	U
2014-10	10/24/2014	10/30/2014	A	UNSC/INT	7	U	M	40	TM	U
2015-06	6/15/2015	8/13/2015	A	SC	8	U	M	40	TM	U
2015-08	8/14/2015	8/30/2015	A	SC	8	U	M	40	TM	U
2016-10	10/31/2016	10/31/2016	A	SC	9	U	M	40	TM	U
2016-11	11/1/2016	12/31/2016	A	SC	9	U	M	40	TM	U
2016-04	4/11/2016	6/30/2016	A	UNSC/INT	10	EM	M	40	TM	EM
2016-04	4/11/2016	.	A	.	10	EM	M	40	TM	EM
2014-08	8/11/2014	9/30/2014	A	UNSC/INT	11	U	M	40	TM	U
2014-11	11/13/2014	7/7/2015	A	SC	11	U	M	40	TM	U
2015-10	10/19/2015	10/30/2015	A	UNSC/INT	12	U	M	40	TM	U
2015-10	10/19/2015	.	A	.	12	U	M	40	TM	U
2015-06	6/30/2015	8/28/2015	A	SC	13	U	M	40	TM	U
2015-08	8/29/2015	8/30/2015	A	SC	13	U	M	40	TM	U
2018-03	3/21/2018	2/27/2019	A	RWC	14	NWC	TAY	60	T	NWC
2019-02	2/28/2019	2/28/2019	A	FTP-NGC	14	NWC	TAY	60	T	NWC
2016-05	5/17/2016	5/30/2016	A	UNSC/INT	15	U	M	40	TM	U
2016-05	5/17/2016	.	A	.	15	U	M	40	TM	U
2015-06	6/19/2015	6/19/2015	A	SC	16	U	M	40	TM	U
2015-06	6/20/2015	8/30/2015	A	SC	16	U	M	40	TM	U
2018-07	7/30/2018	9/27/2018	A	SC	17	U	M	40	TM	U
2018-09	9/28/2018	10/30/2018	A	SC	17	U	M	40	TM	U
2016-01	1/12/2016	7/11/2016	A	SC	18	U	M	40	TM	U
2016-01	1/12/2016	.	A	.	18	U	M	40	TM	U
2019-01	1/8/2019	2/27/2019	A	RWC	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	PRI	50	T	NWC

Astounding · Posted 04-22-2019 12:25 PM

OK, let's start by changing OR to AND:

(end_date=. and end_reason=' ')

d0816 · Posted 04-22-2019 03:11 PM

I think this will still delete the rows I want to keep for an ID. For. eg. for ID 19, using this code will delete both 2nd and third row.

Astounding · Posted 04-22-2019 03:35 PM

Good catch. Let's modify slightly:

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name and first.college_name then return;
   if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;

d0816 · Posted 04-22-2019 05:40 PM

@Astounding This worked to delete the duplicate rows that meet the conditions.

Can you please tell me why these two line of codes (in green) were required and what they are actually doing?

if last.college_name and first.college_name then return;

if last.college_name=0 and (end_date=. and end_reason=' ') then delete;

Thanks so much for your help.

Astounding · Posted 04-23-2019 10:51 PM

The first line in green applies when there is only one observation for the group of variables ending with COLLEGE_NAME. In that case, nothing should be deleted, so RETURN avoids the statement that follows.

The second line in green deletes observations that contain a missing END_DATE and END_REASON, but avoids deleting if it is the last observation for the group of variables. Because of the sorted order, if the last observation has a missing value for END_DATE and END_REASON, all the observations must have a missing value for END_DATE and END_REASON. So that line keeps one observation from the group when END_DATE and END_REASON are always missing.

d0816 · Posted 04-26-2019 12:24 PM

@Astounding

Thank you so much for the explanation. It really helped understand.

Tom · Posted 04-22-2019 03:25 PM

How is what you want different than PROC SORT ?

proc sort data=have out=want nodupkey;
  by _all_;
run;

What about if you first used an UPDATE statement to carry forward values over missing END_DATE/END_REASON and then sorted to eliminate duplicate rows?

data fixed ;
  update have(keep=id end_date end_reason obs=0) have(keep=id end_date end_reason);
   by id;
   set have (drop=id end_date end_reason);
  output;
run;
proc sort data=fixed out=want nodupkey;
  by _all_;
run;

d0816 · Posted 04-23-2019 09:45 AM

@Tom I tried the code with UPDATE and then SORT. I got the error message that By variables are not properly sorted on the dataset.

novinosrin · Posted 04-23-2019 10:17 AM




data have;
input (Start_Month Start_Date	End_Date	Class	End_Reason) ($)	ID	(College	Location	Location_Number	Session	College_Name) ($);
cards;
2015-03	3/4/2015	5/4/2015	A	SC	1	U	M	40	TM	U
2015-03	3/4/2015	.	A	.	1	U	M	40	TM	U
2016-07	7/28/2016	9/21/2016	A	SC	2	U	M	40	TM	U
2016-09	9/22/2016	9/30/2016	A	SC	2	U	M	40	TM	U
2017-06	6/18/2017	7/31/2017	A	SC	3	EM	M	40	TM	EM
2017-08	8/14/2017	9/15/2017	A	UNSC/INT	3	EM	M	40	TM	EM
2016-10	10/6/2016	11/30/2016	A	UNSC/INT	4	EM	M	40	TM	EM
2017-03	3/27/2017	6/30/2017	A	UNSC/INT	4	EM-MR	M	40	MR	EM-MR
2018-01	1/29/2018	3/29/2018	A	SC	5	U	M	40	TM	U
2018-03	3/30/2018	4/10/2018	A	SC	5	U	M	40	TM	U
2017-04	4/6/2017	12/26/2017	A	SC	6	NWC	ASH	2	T	NWC
2017-12	12/28/2017	2/26/2018	A	SC	6	NWC	ASH	2	T	NWC
2014-09	9/21/2014	9/21/2014	A	FTP-NGC	7	U	M	40	TM	U
2014-10	10/24/2014	10/30/2014	A	UNSC/INT	7	U	M	40	TM	U
2015-06	6/15/2015	8/13/2015	A	SC	8	U	M	40	TM	U
2015-08	8/14/2015	8/30/2015	A	SC	8	U	M	40	TM	U
2016-10	10/31/2016	10/31/2016	A	SC	9	U	M	40	TM	U
2016-11	11/1/2016	12/31/2016	A	SC	9	U	M	40	TM	U
2016-04	4/11/2016	6/30/2016	A	UNSC/INT	10	EM	M	40	TM	EM
2016-04	4/11/2016	.	A	.	10	EM	M	40	TM	EM
2014-08	8/11/2014	9/30/2014	A	UNSC/INT	11	U	M	40	TM	U
2014-11	11/13/2014	7/7/2015	A	SC	11	U	M	40	TM	U
2015-10	10/19/2015	10/30/2015	A	UNSC/INT	12	U	M	40	TM	U
2015-10	10/19/2015	.	A	.	12	U	M	40	TM	U
2015-06	6/30/2015	8/28/2015	A	SC	13	U	M	40	TM	U
2015-08	8/29/2015	8/30/2015	A	SC	13	U	M	40	TM	U
2018-03	3/21/2018	2/27/2019	A	RWC	14	NWC	TAY	60	T	NWC
2019-02	2/28/2019	2/28/2019	A	FTP-NGC	14	NWC	TAY	60	T	NWC
2016-05	5/17/2016	5/30/2016	A	UNSC/INT	15	U	M	40	TM	U
2016-05	5/17/2016	.	A	.	15	U	M	40	TM	U
2015-06	6/19/2015	6/19/2015	A	SC	16	U	M	40	TM	U
2015-06	6/20/2015	8/30/2015	A	SC	16	U	M	40	TM	U
2018-07	7/30/2018	9/27/2018	A	SC	17	U	M	40	TM	U
2018-09	9/28/2018	10/30/2018	A	SC	17	U	M	40	TM	U
2016-01	1/12/2016	7/11/2016	A	SC	18	U	M	40	TM	U
2016-01	1/12/2016	.	A	.	18	U	M	40	TM	U
2019-01	1/8/2019	2/27/2019	A	RWC	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	PRI	50	T	NWC
;


proc sql;
create table temp as
select *,count(*) as c1 
from
(select *,n(id) as c from have group by id)
group by Start_Month ,Start_Date,id,	Class,College,	Location,	Location_Number,	Session,	College_Name,c
order by id;
quit;


data want;
set temp;
by id;
if c=c1 and cmiss(end_date,end_reason)=2 then delete;
drop c c1;
run;

EDITED to correct cmiss(end_date,end_reason)=2

d0816 · Posted 04-23-2019 02:46 PM

@novinosrin Thank you this solution using Proc SQL. The code is working for the sample dataset I provided.

However, If there were only one row for an ID with missing End_date and End_Reason (meaning that the "Class" is ongoing), those got deleted too. I did not have a case like this in the sample dataset and I was not clear on this. But I do not want to delete IDs with only one row.For eg.

Start_Month	Start_Date	End_Date	Class	End_Reason	ID	College	Location	Location_Number	Session	College_Name
2018-11	2018-11-05	.	A	SC	20	U	M	40	TM	U
2019-02	2019-02-07	.	A	.	21	U	M	40	TM	U

For IDs with multiple rows, looking at all the rows for the ID, I want to delete those rows with missing End_date and End_Reason but all other data values are the same with one of the rows. If its only one row for an ID whether End_date and End_Reason are missing or not missing, I want to keep it.

Sorry, I was not clear on this early on.

How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Re: How to delete row(s) for and ID if conditions are met?

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away