BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
d0816
Quartz | Level 8

Hi,

I have following sample dataset, where I am trying to look up all the rows for an ID and delete the row or rows if End Date and End Reason are missing for an ID but if all the other variables have the same data values considering all the rows for an ID.

 

For e.g. There are two rows for an ID 1, I would like to delete the second row.

There are three rows for ID 19, all three rows are a keep.

 

Would appreciate any suggestions to solve this using Proc SQL and/or data steps.

 

Thank you.

 

Start
  Month
Start DateEnd DateClassEnd ReasonIDCollegeLocationLocation NumberSessionCollege Name
2015-032015-03-042015-05-04ASC1UM40TMU
2015-032015-03-04 A 1UM40TMU
2016-072016-07-282016-09-21ASC2UM40TMU
2016-092016-09-222016-09-30ASC2UM40TMU
2017-062017-06-182017-07-31ASC3EMM40TMEM
2017-082017-08-142017-09-15AUNSC/INT3EMM40TMEM
2016-102016-10-062016-11-30AUNSC/INT4EMM40TMEM
2017-032017-03-272017-06-30AUNSC/INT4EM-MRM40MREM-MR
2018-012018-01-292018-03-29ASC5UM40TMU
2018-032018-03-302018-04-10ASC5UM40TMU
2017-042017-04-062017-12-26ASC6NWCASH02TNWC
2017-122017-12-282018-02-26ASC6NWCASH02TNWC
2014-092014-09-212014-09-21AFTP-NGC7UM40TMU
2014-102014-10-242014-10-30AUNSC/INT7UM40TMU
2015-062015-06-152015-08-13ASC8UM40TMU
2015-082015-08-142015-08-30ASC8UM40TMU
2016-102016-10-312016-10-31ASC9UM40TMU
2016-112016-11-012016-12-31ASC9UM40TMU
2016-042016-04-112016-06-30AUNSC/INT10EMM40TMEM
2016-042016-04-11 A 10EMM40TMEM
2014-082014-08-112014-09-30AUNSC/INT11UM40TMU
2014-112014-11-132015-07-07ASC11UM40TMU
2015-102015-10-192015-10-30AUNSC/INT12UM40TMU
2015-102015-10-19 A 12UM40TMU
2015-062015-06-302015-08-28ASC13UM40TMU
2015-082015-08-292015-08-30ASC13UM40TMU
2018-032018-03-212019-02-27ARWC14NWCTAY60TNWC
2019-022019-02-282019-02-28AFTP-NGC14NWCTAY60TNWC
2016-052016-05-172016-05-30AUNSC/INT15UM40TMU
2016-052016-05-17 A 15UM40TMU
2015-062015-06-192015-06-19ASC16UM40TMU
2015-062015-06-202015-08-30ASC16UM40TMU
2018-072018-07-302018-09-27ASC17UM40TMU
2018-092018-09-282018-10-30ASC17UM40TMU
2016-012016-01-122016-07-11ASC18UM40TMU
2016-012016-01-12 A 18UM40TMU
2019-012019-01-082019-02-27ARWC19NWCASH02TNWC
2019-022019-02-28 A 19NWCASH02TNWC
2019-022019-02-28 A 19NWCPRI50TNWC
1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Good catch.  Let's modify slightly:

 

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name and first.college_name then return;
   if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;

View solution in original post

18 REPLIES 18
Astounding
PROC Star

First sort your data if it is not already in sorted order:

 

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

Then remove the records you don't want:

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name=0 and (end_date=. or end_reason=' ') then delete;
run;
d0816
Quartz | Level 8

HI, Thanks for your reply.

I tried the code you suggested but this is not working the way I want. This keeps one record per ID but I want to delete records only if value for two variables are missing AND all the other remaining variables have same data values for the same ID with multiple rows.

 

novinosrin
Tourmaline | Level 20

Hi @d0816   Would it be possible to post your sample as datastep or replace the blanks with , so that respondents can copy paste, test plz?

d0816
Quartz | Level 8

@novinosrin 

I replaced the blanks with a . , Please see below.

 

Start_Month

Start_DateEnd_DateClassEnd_ReasonIDCollegeLocationLocation_NumberSessionCollege_Name
2015-033/4/20155/4/2015ASC1UM40TMU
2015-033/4/2015.A.1UM40TMU
2016-077/28/20169/21/2016ASC2UM40TMU
2016-099/22/20169/30/2016ASC2UM40TMU
2017-066/18/20177/31/2017ASC3EMM40TMEM
2017-088/14/20179/15/2017AUNSC/INT3EMM40TMEM
2016-1010/6/201611/30/2016AUNSC/INT4EMM40TMEM
2017-033/27/20176/30/2017AUNSC/INT4EM-MRM40MREM-MR
2018-011/29/20183/29/2018ASC5UM40TMU
2018-033/30/20184/10/2018ASC5UM40TMU
2017-044/6/201712/26/2017ASC6NWCASH2TNWC
2017-1212/28/20172/26/2018ASC6NWCASH2TNWC
2014-099/21/20149/21/2014AFTP-NGC7UM40TMU
2014-1010/24/201410/30/2014AUNSC/INT7UM40TMU
2015-066/15/20158/13/2015ASC8UM40TMU
2015-088/14/20158/30/2015ASC8UM40TMU
2016-1010/31/201610/31/2016ASC9UM40TMU
2016-1111/1/201612/31/2016ASC9UM40TMU
2016-044/11/20166/30/2016AUNSC/INT10EMM40TMEM
2016-044/11/2016.A.10EMM40TMEM
2014-088/11/20149/30/2014AUNSC/INT11UM40TMU
2014-1111/13/20147/7/2015ASC11UM40TMU
2015-1010/19/201510/30/2015AUNSC/INT12UM40TMU
2015-1010/19/2015.A.12UM40TMU
2015-066/30/20158/28/2015ASC13UM40TMU
2015-088/29/20158/30/2015ASC13UM40TMU
2018-033/21/20182/27/2019ARWC14NWCTAY60TNWC
2019-022/28/20192/28/2019AFTP-NGC14NWCTAY60TNWC
2016-055/17/20165/30/2016AUNSC/INT15UM40TMU
2016-055/17/2016.A.15UM40TMU
2015-066/19/20156/19/2015ASC16UM40TMU
2015-066/20/20158/30/2015ASC16UM40TMU
2018-077/30/20189/27/2018ASC17UM40TMU
2018-099/28/201810/30/2018ASC17UM40TMU
2016-011/12/20167/11/2016ASC18UM40TMU
2016-011/12/2016.A.18UM40TMU
2019-011/8/20192/27/2019ARWC19NWCASH2TNWC
2019-022/28/2019.A.19NWCASH2TNWC
2019-022/28/2019.A.19NWCPRI50TNWC
Astounding
PROC Star
OK, let's start by changing OR to AND:

(end_date=. and end_reason=' ')
d0816
Quartz | Level 8

I think this will still delete the rows I want to keep for an ID. For. eg. for ID 19, using this code will delete both 2nd and third row.

Astounding
PROC Star

Good catch.  Let's modify slightly:

 

proc sort data=have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
run;

data want;
   set have;
   by id start_month start_date class college location location_number session college_name end_date end_reason;
   if last.college_name and first.college_name then return;
   if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;
d0816
Quartz | Level 8

@Astounding This worked to delete the duplicate rows that meet the conditions.

 

Can you please tell me why these two line of codes (in green) were required and what they are actually doing?

 

if last.college_name and first.college_name then return;  

if last.college_name=0 and (end_date=. and end_reason=' ') then delete;

 

Thanks so much for your help.

Astounding
PROC Star

The first line in green applies when there is only one observation for the group of variables ending with COLLEGE_NAME.  In that case, nothing should be deleted, so RETURN avoids the statement that follows.

 

The second line in green deletes observations that contain a missing END_DATE and END_REASON, but avoids deleting if it is the last observation for the group of variables.  Because of the sorted order, if the last observation has a missing value for END_DATE and END_REASON, all the observations must have a missing value for END_DATE and END_REASON.  So that line keeps one observation from the group when END_DATE and END_REASON are always missing.

d0816
Quartz | Level 8

@Astounding 

Thank you so much for the explanation. It really helped understand.

Tom
Super User Tom
Super User

How is what you want different than PROC SORT ?

proc sort data=have out=want nodupkey;
  by _all_;
run;

What about if you first used an UPDATE statement to carry forward values over missing END_DATE/END_REASON and then sorted to eliminate duplicate rows?

data fixed ;
  update have(keep=id end_date end_reason obs=0) have(keep=id end_date end_reason);
   by id;
   set have (drop=id end_date end_reason);
  output;
run;
proc sort data=fixed out=want nodupkey;
  by _all_;
run;
d0816
Quartz | Level 8

@Tom I tried the code with UPDATE and then SORT. I got the error message that By variables are not properly sorted on the dataset.

novinosrin
Tourmaline | Level 20



data have;
input (Start_Month Start_Date	End_Date	Class	End_Reason) ($)	ID	(College	Location	Location_Number	Session	College_Name) ($);
cards;
2015-03	3/4/2015	5/4/2015	A	SC	1	U	M	40	TM	U
2015-03	3/4/2015	.	A	.	1	U	M	40	TM	U
2016-07	7/28/2016	9/21/2016	A	SC	2	U	M	40	TM	U
2016-09	9/22/2016	9/30/2016	A	SC	2	U	M	40	TM	U
2017-06	6/18/2017	7/31/2017	A	SC	3	EM	M	40	TM	EM
2017-08	8/14/2017	9/15/2017	A	UNSC/INT	3	EM	M	40	TM	EM
2016-10	10/6/2016	11/30/2016	A	UNSC/INT	4	EM	M	40	TM	EM
2017-03	3/27/2017	6/30/2017	A	UNSC/INT	4	EM-MR	M	40	MR	EM-MR
2018-01	1/29/2018	3/29/2018	A	SC	5	U	M	40	TM	U
2018-03	3/30/2018	4/10/2018	A	SC	5	U	M	40	TM	U
2017-04	4/6/2017	12/26/2017	A	SC	6	NWC	ASH	2	T	NWC
2017-12	12/28/2017	2/26/2018	A	SC	6	NWC	ASH	2	T	NWC
2014-09	9/21/2014	9/21/2014	A	FTP-NGC	7	U	M	40	TM	U
2014-10	10/24/2014	10/30/2014	A	UNSC/INT	7	U	M	40	TM	U
2015-06	6/15/2015	8/13/2015	A	SC	8	U	M	40	TM	U
2015-08	8/14/2015	8/30/2015	A	SC	8	U	M	40	TM	U
2016-10	10/31/2016	10/31/2016	A	SC	9	U	M	40	TM	U
2016-11	11/1/2016	12/31/2016	A	SC	9	U	M	40	TM	U
2016-04	4/11/2016	6/30/2016	A	UNSC/INT	10	EM	M	40	TM	EM
2016-04	4/11/2016	.	A	.	10	EM	M	40	TM	EM
2014-08	8/11/2014	9/30/2014	A	UNSC/INT	11	U	M	40	TM	U
2014-11	11/13/2014	7/7/2015	A	SC	11	U	M	40	TM	U
2015-10	10/19/2015	10/30/2015	A	UNSC/INT	12	U	M	40	TM	U
2015-10	10/19/2015	.	A	.	12	U	M	40	TM	U
2015-06	6/30/2015	8/28/2015	A	SC	13	U	M	40	TM	U
2015-08	8/29/2015	8/30/2015	A	SC	13	U	M	40	TM	U
2018-03	3/21/2018	2/27/2019	A	RWC	14	NWC	TAY	60	T	NWC
2019-02	2/28/2019	2/28/2019	A	FTP-NGC	14	NWC	TAY	60	T	NWC
2016-05	5/17/2016	5/30/2016	A	UNSC/INT	15	U	M	40	TM	U
2016-05	5/17/2016	.	A	.	15	U	M	40	TM	U
2015-06	6/19/2015	6/19/2015	A	SC	16	U	M	40	TM	U
2015-06	6/20/2015	8/30/2015	A	SC	16	U	M	40	TM	U
2018-07	7/30/2018	9/27/2018	A	SC	17	U	M	40	TM	U
2018-09	9/28/2018	10/30/2018	A	SC	17	U	M	40	TM	U
2016-01	1/12/2016	7/11/2016	A	SC	18	U	M	40	TM	U
2016-01	1/12/2016	.	A	.	18	U	M	40	TM	U
2019-01	1/8/2019	2/27/2019	A	RWC	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	ASH	2	T	NWC
2019-02	2/28/2019	.	A	.	19	NWC	PRI	50	T	NWC
;


proc sql;
create table temp as
select *,count(*) as c1 
from
(select *,n(id) as c from have group by id)
group by Start_Month ,Start_Date,id,	Class,College,	Location,	Location_Number,	Session,	College_Name,c
order by id;
quit;


data want;
set temp;
by id;
if c=c1 and cmiss(end_date,end_reason)=2 then delete;
drop c c1;
run;

EDITED to correct cmiss(end_date,end_reason)=2

d0816
Quartz | Level 8

@novinosrin  Thank you this solution using Proc SQL. The code is working for the sample dataset I provided.

 

However, If there were only one row for an ID with missing End_date and End_Reason (meaning that the "Class" is ongoing), those got deleted too. I did not have a case like this in the sample dataset and I was not clear on this. But I do not want to delete IDs with only one row.For eg.  

Start_Month

Start_DateEnd_DateClassEnd_ReasonIDCollegeLocationLocation_NumberSessionCollege_Name
2018-112018-11-05.ASC20UM40TMU
2019-022019-02-07.A.21UM40TMU

 

For IDs with multiple rows, looking at all the rows for the ID, I want to delete those rows with missing End_date and End_Reason but  all other data values are the same with one of the rows. If its only one row for an ID whether End_date and End_Reason are missing or not missing, I want to keep it. 

 

Sorry, I was not clear on this early on.

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 18 replies
  • 12980 views
  • 0 likes
  • 4 in conversation