BookmarkSubscribeRSS Feed
csanfor2
Fluorite | Level 6

I have a dataset that I want to de-duplicate based on an ID variable and a date. Most IDs have >1 row, but not all rows are as complete. My problem is the phone-number column. During the dedup I want to keep the rows where the phone number is present, but I know some will be missing a phone number, so if that's the case I still want to keep that unique record.

 

Is there a way to easily do this? I have tried using proc sort with nodupkey, but when I just use ID and date as my by variables, it typically sorts in a way that leaves only the empty phone number records. If I leave phone in the by statement, it keeps too much.

 

data have;
input ID date :mmddyy10. phone ;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
;
run;


data want;
input ID date :mmddyy10. phone ;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
2 11/02/2020 2223334444
3 11/02/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
4 11/03/2020 3334445555
;
run;

proc sort data=have nodupkey;
	by id date;
run;
3 REPLIES 3
ballardw
Super User

This create a new phone number variable to be the "last" occurring non-0 value.

proc sort data=have;
   by id date;
run;
/* next step assumes sorted by id date*/

data want;
   set have;
   by id date;
   retain finalphone;
   if first.id then call missing(finalphone);
   if phone ne 0 then finalphone=phone;
   if last.id;
run;

Since I seldom do arithmetic with phone numbers I would typically say to have non-valid phone numbers as character missing, i.e. blank and test differently than the "phone ne 0" used above.

Note that this does the de-duplication after a simple sort. So see the basic results matches your need.

You need two variables, so you could at the last do something like:

 

if last.id then do:

   phone=finalphone;

   output;

end;

 

and drop the finalphone variable from the output set.

novinosrin
Tourmaline | Level 20

Hi @csanfor2  Your question is a classic example that presents a neat use-case for NOTSORTED and DOW loop utilization as a combo. Thank you for the question-



data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
;
run;

data want;
 do until(last.id);
  set have;
  by id notsorted;
  if _n_ and phone not in (' ','00') then do;
   _n_=0;
   output;
  end;
 end;
 if _n_ then output;
run;

proc print noobs;run;
ID date phone
1 11/01/2020 1112223333
2 11/02/2020 2223334444
3 11/02/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
4 11/03/2020 3334445555
mkeintz
PROC Star

This is one of those times I probably would not go to the DOW loop.  My less-busy suggestion is:

 

data have;
input ID date :mmddyy10. phone :$10.;
format date mmddyy10.;
cards;
1 11/01/2020 1112223333
1 11/01/2020 00
2 11/02/2020 2223334444
2 11/02/2020 00
3 11/02/2020 00
4 11/01/2020 00
4 11/01/2020 3334445555
5 11/03/2020 4445556666
5 11/03/2020 5556667777
4 11/03/2020 00
4 11/03/2020 3334445555
run;

data want;
  set have;
  by id notsorted;
  if first.id then ptr=0;
  if phone^='00' and ptr=0 then ptr=_n_;
  if last.id;
  if ptr ^=0 then set have point=ptr;
run;

 

The set with a POINT= option works to retrieve the first record with a qualifying phone number for each id group.  If no qualifying phone number is found then the last record for the id group is output.  This produces what the OP requested, but if an ID only has multiple '00' phones, we have no guidance on which record to keep (i.e. which date).  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 350 views
  • 0 likes
  • 4 in conversation