I have a dataset that I want to de-duplicate based on an ID variable and a date. Most IDs have >1 row, but not all rows are as complete. My problem is the phone-number column. During the dedup I want to keep the rows where the phone number is present, but I know some will be missing a phone number, so if that's the case I still want to keep that unique record.
Is there a way to easily do this? I have tried using proc sort with nodupkey, but when I just use ID and date as my by variables, it typically sorts in a way that leaves only the empty phone number records. If I leave phone in the by statement, it keeps too much.
data have; input ID date :mmddyy10. phone ; format date mmddyy10.; cards; 1 11/01/2020 1112223333 1 11/01/2020 00 2 11/02/2020 2223334444 2 11/02/2020 00 3 11/02/2020 00 4 11/01/2020 00 4 11/01/2020 3334445555 5 11/03/2020 4445556666 5 11/03/2020 5556667777 4 11/03/2020 00 4 11/03/2020 3334445555 ; run; data want; input ID date :mmddyy10. phone ; format date mmddyy10.; cards; 1 11/01/2020 1112223333 2 11/02/2020 2223334444 3 11/02/2020 00 4 11/01/2020 3334445555 5 11/03/2020 4445556666 4 11/03/2020 3334445555 ; run; proc sort data=have nodupkey; by id date; run;
This create a new phone number variable to be the "last" occurring non-0 value.
proc sort data=have; by id date; run; /* next step assumes sorted by id date*/ data want; set have; by id date; retain finalphone; if first.id then call missing(finalphone); if phone ne 0 then finalphone=phone; if last.id; run;
Since I seldom do arithmetic with phone numbers I would typically say to have non-valid phone numbers as character missing, i.e. blank and test differently than the "phone ne 0" used above.
Note that this does the de-duplication after a simple sort. So see the basic results matches your need.
You need two variables, so you could at the last do something like:
if last.id then do:
phone=finalphone;
output;
end;
and drop the finalphone variable from the output set.
Hi @csanfor2 Your question is a classic example that presents a neat use-case for NOTSORTED and DOW loop utilization as a combo. Thank you for the question-
data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
;
run;
data want;
do until(last.id);
set have;
by id notsorted;
if _n_ and phone not in (' ','00') then do;
_n_=0;
output;
end;
end;
if _n_ then output;
run;
proc print noobs;run;
ID | date | phone |
---|---|---|
1 | 11/01/2020 | 1112223333 |
2 | 11/02/2020 | 2223334444 |
3 | 11/02/2020 | 00 |
4 | 11/01/2020 | 3334445555 |
5 | 11/03/2020 | 4445556666 |
4 | 11/03/2020 | 3334445555 |
This is one of those times I probably would not go to the DOW loop. My less-busy suggestion is:
data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
run;
data want;
set have;
by id notsorted;
if first.id then ptr=0;
if phone^='00' and ptr=0 then ptr=_n_;
if last.id;
if ptr ^=0 then set have point=ptr;
run;
The set with a POINT= option works to retrieve the first record with a qualifying phone number for each id group. If no qualifying phone number is found then the last record for the id group is output. This produces what the OP requested, but if an ID only has multiple '00' phones, we have no guidance on which record to keep (i.e. which date).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.