Hi,
I have following sample dataset, where I am trying to look up all the rows for an ID and delete the row or rows if End Date and End Reason are missing for an ID but if all the other variables have the same data values considering all the rows for an ID.
For e.g. There are two rows for an ID 1, I would like to delete the second row.
There are three rows for ID 19, all three rows are a keep.
Would appreciate any suggestions to solve this using Proc SQL and/or data steps.
Thank you.
Start Month | Start Date | End Date | Class | End Reason | ID | College | Location | Location Number | Session | College Name |
2015-03 | 2015-03-04 | 2015-05-04 | A | SC | 1 | U | M | 40 | TM | U |
2015-03 | 2015-03-04 | A | 1 | U | M | 40 | TM | U | ||
2016-07 | 2016-07-28 | 2016-09-21 | A | SC | 2 | U | M | 40 | TM | U |
2016-09 | 2016-09-22 | 2016-09-30 | A | SC | 2 | U | M | 40 | TM | U |
2017-06 | 2017-06-18 | 2017-07-31 | A | SC | 3 | EM | M | 40 | TM | EM |
2017-08 | 2017-08-14 | 2017-09-15 | A | UNSC/INT | 3 | EM | M | 40 | TM | EM |
2016-10 | 2016-10-06 | 2016-11-30 | A | UNSC/INT | 4 | EM | M | 40 | TM | EM |
2017-03 | 2017-03-27 | 2017-06-30 | A | UNSC/INT | 4 | EM-MR | M | 40 | MR | EM-MR |
2018-01 | 2018-01-29 | 2018-03-29 | A | SC | 5 | U | M | 40 | TM | U |
2018-03 | 2018-03-30 | 2018-04-10 | A | SC | 5 | U | M | 40 | TM | U |
2017-04 | 2017-04-06 | 2017-12-26 | A | SC | 6 | NWC | ASH | 02 | T | NWC |
2017-12 | 2017-12-28 | 2018-02-26 | A | SC | 6 | NWC | ASH | 02 | T | NWC |
2014-09 | 2014-09-21 | 2014-09-21 | A | FTP-NGC | 7 | U | M | 40 | TM | U |
2014-10 | 2014-10-24 | 2014-10-30 | A | UNSC/INT | 7 | U | M | 40 | TM | U |
2015-06 | 2015-06-15 | 2015-08-13 | A | SC | 8 | U | M | 40 | TM | U |
2015-08 | 2015-08-14 | 2015-08-30 | A | SC | 8 | U | M | 40 | TM | U |
2016-10 | 2016-10-31 | 2016-10-31 | A | SC | 9 | U | M | 40 | TM | U |
2016-11 | 2016-11-01 | 2016-12-31 | A | SC | 9 | U | M | 40 | TM | U |
2016-04 | 2016-04-11 | 2016-06-30 | A | UNSC/INT | 10 | EM | M | 40 | TM | EM |
2016-04 | 2016-04-11 | A | 10 | EM | M | 40 | TM | EM | ||
2014-08 | 2014-08-11 | 2014-09-30 | A | UNSC/INT | 11 | U | M | 40 | TM | U |
2014-11 | 2014-11-13 | 2015-07-07 | A | SC | 11 | U | M | 40 | TM | U |
2015-10 | 2015-10-19 | 2015-10-30 | A | UNSC/INT | 12 | U | M | 40 | TM | U |
2015-10 | 2015-10-19 | A | 12 | U | M | 40 | TM | U | ||
2015-06 | 2015-06-30 | 2015-08-28 | A | SC | 13 | U | M | 40 | TM | U |
2015-08 | 2015-08-29 | 2015-08-30 | A | SC | 13 | U | M | 40 | TM | U |
2018-03 | 2018-03-21 | 2019-02-27 | A | RWC | 14 | NWC | TAY | 60 | T | NWC |
2019-02 | 2019-02-28 | 2019-02-28 | A | FTP-NGC | 14 | NWC | TAY | 60 | T | NWC |
2016-05 | 2016-05-17 | 2016-05-30 | A | UNSC/INT | 15 | U | M | 40 | TM | U |
2016-05 | 2016-05-17 | A | 15 | U | M | 40 | TM | U | ||
2015-06 | 2015-06-19 | 2015-06-19 | A | SC | 16 | U | M | 40 | TM | U |
2015-06 | 2015-06-20 | 2015-08-30 | A | SC | 16 | U | M | 40 | TM | U |
2018-07 | 2018-07-30 | 2018-09-27 | A | SC | 17 | U | M | 40 | TM | U |
2018-09 | 2018-09-28 | 2018-10-30 | A | SC | 17 | U | M | 40 | TM | U |
2016-01 | 2016-01-12 | 2016-07-11 | A | SC | 18 | U | M | 40 | TM | U |
2016-01 | 2016-01-12 | A | 18 | U | M | 40 | TM | U | ||
2019-01 | 2019-01-08 | 2019-02-27 | A | RWC | 19 | NWC | ASH | 02 | T | NWC |
2019-02 | 2019-02-28 | A | 19 | NWC | ASH | 02 | T | NWC | ||
2019-02 | 2019-02-28 | A | 19 | NWC | PRI | 50 | T | NWC |
Good catch. Let's modify slightly:
proc sort data=have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
run;
data want;
set have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
if last.college_name and first.college_name then return;
if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;
First sort your data if it is not already in sorted order:
proc sort data=have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
run;
Then remove the records you don't want:
data want;
set have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
if last.college_name=0 and (end_date=. or end_reason=' ') then delete;
run;
HI, Thanks for your reply.
I tried the code you suggested but this is not working the way I want. This keeps one record per ID but I want to delete records only if value for two variables are missing AND all the other remaining variables have same data values for the same ID with multiple rows.
Hi @d0816 Would it be possible to post your sample as datastep or replace the blanks with . , so that respondents can copy paste, test plz?
I replaced the blanks with a . , Please see below.
Start_Month | Start_Date | End_Date | Class | End_Reason | ID | College | Location | Location_Number | Session | College_Name |
2015-03 | 3/4/2015 | 5/4/2015 | A | SC | 1 | U | M | 40 | TM | U |
2015-03 | 3/4/2015 | . | A | . | 1 | U | M | 40 | TM | U |
2016-07 | 7/28/2016 | 9/21/2016 | A | SC | 2 | U | M | 40 | TM | U |
2016-09 | 9/22/2016 | 9/30/2016 | A | SC | 2 | U | M | 40 | TM | U |
2017-06 | 6/18/2017 | 7/31/2017 | A | SC | 3 | EM | M | 40 | TM | EM |
2017-08 | 8/14/2017 | 9/15/2017 | A | UNSC/INT | 3 | EM | M | 40 | TM | EM |
2016-10 | 10/6/2016 | 11/30/2016 | A | UNSC/INT | 4 | EM | M | 40 | TM | EM |
2017-03 | 3/27/2017 | 6/30/2017 | A | UNSC/INT | 4 | EM-MR | M | 40 | MR | EM-MR |
2018-01 | 1/29/2018 | 3/29/2018 | A | SC | 5 | U | M | 40 | TM | U |
2018-03 | 3/30/2018 | 4/10/2018 | A | SC | 5 | U | M | 40 | TM | U |
2017-04 | 4/6/2017 | 12/26/2017 | A | SC | 6 | NWC | ASH | 2 | T | NWC |
2017-12 | 12/28/2017 | 2/26/2018 | A | SC | 6 | NWC | ASH | 2 | T | NWC |
2014-09 | 9/21/2014 | 9/21/2014 | A | FTP-NGC | 7 | U | M | 40 | TM | U |
2014-10 | 10/24/2014 | 10/30/2014 | A | UNSC/INT | 7 | U | M | 40 | TM | U |
2015-06 | 6/15/2015 | 8/13/2015 | A | SC | 8 | U | M | 40 | TM | U |
2015-08 | 8/14/2015 | 8/30/2015 | A | SC | 8 | U | M | 40 | TM | U |
2016-10 | 10/31/2016 | 10/31/2016 | A | SC | 9 | U | M | 40 | TM | U |
2016-11 | 11/1/2016 | 12/31/2016 | A | SC | 9 | U | M | 40 | TM | U |
2016-04 | 4/11/2016 | 6/30/2016 | A | UNSC/INT | 10 | EM | M | 40 | TM | EM |
2016-04 | 4/11/2016 | . | A | . | 10 | EM | M | 40 | TM | EM |
2014-08 | 8/11/2014 | 9/30/2014 | A | UNSC/INT | 11 | U | M | 40 | TM | U |
2014-11 | 11/13/2014 | 7/7/2015 | A | SC | 11 | U | M | 40 | TM | U |
2015-10 | 10/19/2015 | 10/30/2015 | A | UNSC/INT | 12 | U | M | 40 | TM | U |
2015-10 | 10/19/2015 | . | A | . | 12 | U | M | 40 | TM | U |
2015-06 | 6/30/2015 | 8/28/2015 | A | SC | 13 | U | M | 40 | TM | U |
2015-08 | 8/29/2015 | 8/30/2015 | A | SC | 13 | U | M | 40 | TM | U |
2018-03 | 3/21/2018 | 2/27/2019 | A | RWC | 14 | NWC | TAY | 60 | T | NWC |
2019-02 | 2/28/2019 | 2/28/2019 | A | FTP-NGC | 14 | NWC | TAY | 60 | T | NWC |
2016-05 | 5/17/2016 | 5/30/2016 | A | UNSC/INT | 15 | U | M | 40 | TM | U |
2016-05 | 5/17/2016 | . | A | . | 15 | U | M | 40 | TM | U |
2015-06 | 6/19/2015 | 6/19/2015 | A | SC | 16 | U | M | 40 | TM | U |
2015-06 | 6/20/2015 | 8/30/2015 | A | SC | 16 | U | M | 40 | TM | U |
2018-07 | 7/30/2018 | 9/27/2018 | A | SC | 17 | U | M | 40 | TM | U |
2018-09 | 9/28/2018 | 10/30/2018 | A | SC | 17 | U | M | 40 | TM | U |
2016-01 | 1/12/2016 | 7/11/2016 | A | SC | 18 | U | M | 40 | TM | U |
2016-01 | 1/12/2016 | . | A | . | 18 | U | M | 40 | TM | U |
2019-01 | 1/8/2019 | 2/27/2019 | A | RWC | 19 | NWC | ASH | 2 | T | NWC |
2019-02 | 2/28/2019 | . | A | . | 19 | NWC | ASH | 2 | T | NWC |
2019-02 | 2/28/2019 | . | A | . | 19 | NWC | PRI | 50 | T | NWC |
I think this will still delete the rows I want to keep for an ID. For. eg. for ID 19, using this code will delete both 2nd and third row.
Good catch. Let's modify slightly:
proc sort data=have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
run;
data want;
set have;
by id start_month start_date class college location location_number session college_name end_date end_reason;
if last.college_name and first.college_name then return;
if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
run;
@Astounding This worked to delete the duplicate rows that meet the conditions.
Can you please tell me why these two line of codes (in green) were required and what they are actually doing?
if last.college_name and first.college_name then return;
if last.college_name=0 and (end_date=. and end_reason=' ') then delete;
Thanks so much for your help.
The first line in green applies when there is only one observation for the group of variables ending with COLLEGE_NAME. In that case, nothing should be deleted, so RETURN avoids the statement that follows.
The second line in green deletes observations that contain a missing END_DATE and END_REASON, but avoids deleting if it is the last observation for the group of variables. Because of the sorted order, if the last observation has a missing value for END_DATE and END_REASON, all the observations must have a missing value for END_DATE and END_REASON. So that line keeps one observation from the group when END_DATE and END_REASON are always missing.
Thank you so much for the explanation. It really helped understand.
How is what you want different than PROC SORT ?
proc sort data=have out=want nodupkey;
by _all_;
run;
What about if you first used an UPDATE statement to carry forward values over missing END_DATE/END_REASON and then sorted to eliminate duplicate rows?
data fixed ;
update have(keep=id end_date end_reason obs=0) have(keep=id end_date end_reason);
by id;
set have (drop=id end_date end_reason);
output;
run;
proc sort data=fixed out=want nodupkey;
by _all_;
run;
@Tom I tried the code with UPDATE and then SORT. I got the error message that By variables are not properly sorted on the dataset.
data have;
input (Start_Month Start_Date End_Date Class End_Reason) ($) ID (College Location Location_Number Session College_Name) ($);
cards;
2015-03 3/4/2015 5/4/2015 A SC 1 U M 40 TM U
2015-03 3/4/2015 . A . 1 U M 40 TM U
2016-07 7/28/2016 9/21/2016 A SC 2 U M 40 TM U
2016-09 9/22/2016 9/30/2016 A SC 2 U M 40 TM U
2017-06 6/18/2017 7/31/2017 A SC 3 EM M 40 TM EM
2017-08 8/14/2017 9/15/2017 A UNSC/INT 3 EM M 40 TM EM
2016-10 10/6/2016 11/30/2016 A UNSC/INT 4 EM M 40 TM EM
2017-03 3/27/2017 6/30/2017 A UNSC/INT 4 EM-MR M 40 MR EM-MR
2018-01 1/29/2018 3/29/2018 A SC 5 U M 40 TM U
2018-03 3/30/2018 4/10/2018 A SC 5 U M 40 TM U
2017-04 4/6/2017 12/26/2017 A SC 6 NWC ASH 2 T NWC
2017-12 12/28/2017 2/26/2018 A SC 6 NWC ASH 2 T NWC
2014-09 9/21/2014 9/21/2014 A FTP-NGC 7 U M 40 TM U
2014-10 10/24/2014 10/30/2014 A UNSC/INT 7 U M 40 TM U
2015-06 6/15/2015 8/13/2015 A SC 8 U M 40 TM U
2015-08 8/14/2015 8/30/2015 A SC 8 U M 40 TM U
2016-10 10/31/2016 10/31/2016 A SC 9 U M 40 TM U
2016-11 11/1/2016 12/31/2016 A SC 9 U M 40 TM U
2016-04 4/11/2016 6/30/2016 A UNSC/INT 10 EM M 40 TM EM
2016-04 4/11/2016 . A . 10 EM M 40 TM EM
2014-08 8/11/2014 9/30/2014 A UNSC/INT 11 U M 40 TM U
2014-11 11/13/2014 7/7/2015 A SC 11 U M 40 TM U
2015-10 10/19/2015 10/30/2015 A UNSC/INT 12 U M 40 TM U
2015-10 10/19/2015 . A . 12 U M 40 TM U
2015-06 6/30/2015 8/28/2015 A SC 13 U M 40 TM U
2015-08 8/29/2015 8/30/2015 A SC 13 U M 40 TM U
2018-03 3/21/2018 2/27/2019 A RWC 14 NWC TAY 60 T NWC
2019-02 2/28/2019 2/28/2019 A FTP-NGC 14 NWC TAY 60 T NWC
2016-05 5/17/2016 5/30/2016 A UNSC/INT 15 U M 40 TM U
2016-05 5/17/2016 . A . 15 U M 40 TM U
2015-06 6/19/2015 6/19/2015 A SC 16 U M 40 TM U
2015-06 6/20/2015 8/30/2015 A SC 16 U M 40 TM U
2018-07 7/30/2018 9/27/2018 A SC 17 U M 40 TM U
2018-09 9/28/2018 10/30/2018 A SC 17 U M 40 TM U
2016-01 1/12/2016 7/11/2016 A SC 18 U M 40 TM U
2016-01 1/12/2016 . A . 18 U M 40 TM U
2019-01 1/8/2019 2/27/2019 A RWC 19 NWC ASH 2 T NWC
2019-02 2/28/2019 . A . 19 NWC ASH 2 T NWC
2019-02 2/28/2019 . A . 19 NWC PRI 50 T NWC
;
proc sql;
create table temp as
select *,count(*) as c1
from
(select *,n(id) as c from have group by id)
group by Start_Month ,Start_Date,id, Class,College, Location, Location_Number, Session, College_Name,c
order by id;
quit;
data want;
set temp;
by id;
if c=c1 and cmiss(end_date,end_reason)=2 then delete;
drop c c1;
run;
EDITED to correct cmiss(end_date,end_reason)=2
@novinosrin Thank you this solution using Proc SQL. The code is working for the sample dataset I provided.
However, If there were only one row for an ID with missing End_date and End_Reason (meaning that the "Class" is ongoing), those got deleted too. I did not have a case like this in the sample dataset and I was not clear on this. But I do not want to delete IDs with only one row.For eg.
Start_Month | Start_Date | End_Date | Class | End_Reason | ID | College | Location | Location_Number | Session | College_Name |
2018-11 | 2018-11-05 | . | A | SC | 20 | U | M | 40 | TM | U |
2019-02 | 2019-02-07 | . | A | . | 21 | U | M | 40 | TM | U |
For IDs with multiple rows, looking at all the rows for the ID, I want to delete those rows with missing End_date and End_Reason but all other data values are the same with one of the rows. If its only one row for an ID whether End_date and End_Reason are missing or not missing, I want to keep it.
Sorry, I was not clear on this early on.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.