BookmarkSubscribeRSS Feed
umar_milanzi
Calcite | Level 5

hi All,

 

i need help. I have a dataset, sample below:

 

CUSTOMER_NAME           MOBILE_NUMBER

John                                      555-777

Sam                                       556-865

Joe                                        555-000

Sam+Joe                                556-865

Tim                                         555-777

 

want to output customers with same mobile number but different names. in the data above we can see that Sam and Sam+Joe have the same number, this means Sam is Joe's partner so they can have the same number. but John and Tim are not connected and have the same number, those are the guys i want to output

 

thanks in advance

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Post test data in the form of a datastep.

 

Maybe something like (and can't test as not test data):

proc sql;
  create table WANT as
  select  distinct
          MOBILE_NUMBER,
          CUSTOMER_NAME
  from    HAVE
  group by MOBILE_NUMBER,
           CUSTOMER_NAME
  having count(*) gt 1;
quit;
  
umar_milanzi
Calcite | Level 5


data have;
input customer_name $ Mobile_Number ;

datalines;
john 555777
sam 556865
joe 555888
sam+joe 556865
tim 555777
; run;

Ksharp
Super User

"but John and Tim are not connected and have the same number, those are the guys i want to output"

You mean you do not want to output ?

Astounding
PROC Star

Names are often spelled differently, when entered more than once.  As a result, I suggest you go about this as a two-step problem.  First, find all the differences:

 

proc sql;

create table differences as

select a.*, b.customer_name as matching_name

from have a, have b

where a.mobile_number = b.mobile_number

and a.customer_name ne b.customer_name;

quit;

 

data differences2;

set differences;

if length(customer_name) le length(matching_name) then do;

   dummy = customer_name;

   customer_name = matching_name;

   matching_name = dummy;

end;

drop dummy;

run;

 

In DIFFERENCES2, CUSTOMER_NAME and MATCHING_NAME will be different, but the DATA step guarantees that CUSTOMER_NAME will be the variation that contains more characters.

 

Then you can begin to automate the process of removing matches that should be removed.  Due to variations in spelling, you may need more than one pass through the data.  For example, does "Sam" match "Samuel"?  Should those be output?  Here is one pass through to eliminate the duplicates:

 

data want;

set differences2;

if index(customer_name, strip(matching_name)) = 0;

run;

 

Instead of the INDEX function, you may want to use FINDW.  It really depends on the nature of the data, as to which function gives you better results.

umar_milanzi
Calcite | Level 5

Thanks @Astounding this looks like it is doing what i want. 

 

Thanks everyone for the inputs. 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 2399 views
  • 1 like
  • 4 in conversation