Re: How do I sort during de-duplication without using that as a criter...

csanfor2 · Posted 11-10-2020 03:46 PM

I have a dataset that I want to de-duplicate based on an ID variable and a date. Most IDs have >1 row, but not all rows are as complete. My problem is the phone-number column. During the dedup I want to keep the rows where the phone number is present, but I know some will be missing a phone number, so if that's the case I still want to keep that unique record.

Is there a way to easily do this? I have tried using proc sort with nodupkey, but when I just use ID and date as my by variables, it typically sorts in a way that leaves only the empty phone number records. If I leave phone in the by statement, it keeps too much.

data have;
input ID date :mmddyy10. phone ;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
;
run;


data want;
input ID date :mmddyy10. phone ;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
2 11/02/2020 2223334444
3 11/02/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
4 11/03/2020 3334445555
;
run;

proc sort data=have nodupkey;
	by id date;
run;

ballardw · Posted 11-10-2020 04:21 PM

This create a new phone number variable to be the "last" occurring non-0 value.

proc sort data=have;
   by id date;
run;
/* next step assumes sorted by id date*/

data want;
   set have;
   by id date;
   retain finalphone;
   if first.id then call missing(finalphone);
   if phone ne 0 then finalphone=phone;
   if last.id;
run;

Since I seldom do arithmetic with phone numbers I would typically say to have non-valid phone numbers as character missing, i.e. blank and test differently than the "phone ne 0" used above.

Note that this does the de-duplication after a simple sort. So see the basic results matches your need.

You need two variables, so you could at the last do something like:

if last.id then do:

phone=finalphone;

output;

end;

and drop the finalphone variable from the output set.

novinosrin · Posted 11-10-2020 04:34 PM

Hi @csanfor2 Your question is a classic example that presents a neat use-case for NOTSORTED and DOW loop utilization as a combo. Thank you for the question-



data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
;
run;

data want;
 do until(last.id);
  set have;
  by id notsorted;
  if _n_ and phone not in (' ','00') then do;
   _n_=0;
   output;
  end;
 end;
 if _n_ then output;
run;

proc print noobs;run;

ID	date	phone
1	11/01/2020	1112223333
2	11/02/2020	2223334444
3	11/02/2020	00
4	11/01/2020	3334445555
5	11/03/2020	4445556666
4	11/03/2020	3334445555

mkeintz · Posted 11-10-2020 10:37 PM

This is one of those times I probably would not go to the DOW loop. My less-busy suggestion is:

data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
run;

data want;
  set have;
  by id notsorted;
  if first.id then ptr=0;
  if phone^='00' and ptr=0 then ptr=_n_;
  if last.id;
  if ptr ^=0 then set have point=ptr;
run;

The set with a POINT= option works to retrieve the first record with a qualifying phone number for each id group. If no qualifying phone number is found then the last record for the id group is output. This produces what the OP requested, but if an ID only has multiple '00' phones, we have no guidance on which record to keep (i.e. which date).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

Re: How do I sort during de-duplication without using that as a criteria for deduplicating?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away