Hi Team,
Greetings for the day!
Please help me to write a sql to get the related party details.
For example suppose i have opened my account with mobile and email.
My wife has used my mobile number while opening her account and i have used my email id for my daughter's account.
In this case my wife's account is related with my mobile and daughter's account is related with my email id and scenario could be daughter has used wife's email.
So if i am searching for my account, i should get wife's and daughter's account as well.
I have tried to prepare the sample data with result output:
Sample :
ID | Phone | |
5534385355 | 385534616655 | ABC@GMAIL.COM |
5548558464 | 3588333353 | ABC12.com |
5564556334 | 6366636666 | ABC2@gmail.com |
554814331 | 6001066401 | ABC10.com |
5564556331 | 6366636666 | ABC3@gmail.com |
5548558456 | 68136551346 | ABC7.com |
5538083556 | 6066688561 | ABC4D.COM |
554816644 | 68141055686 | ABC9.com |
5563065535 | 6355555563486 | ABC5@yahoo.com |
55155355386 | 6460535366 | ABC11.com |
553438530 | 385534616655 | ABC1@GMAIL.COM |
554884666 | 6354641861 | ABC13.com |
5534385355 | 385534616655 | ABC@GMAIL.COM |
555336340 | 6836833860 | ABC6.com |
554363456 | 6066688561 | ABC4D.COM |
55464315555 | 6863083133 | ABC8.com |
556343006 | 6355555563486 | ABC5@yahoo.com |
Output :
ID | Phone | |
5534385355 | 385534616655 | ABC@GMAIL.COM |
553438530 | 385534616655 | ABC1@GMAIL.COM |
5564556334 | 6366636666 | ABC2@gmail.com |
5564556331 | 6366636666 | ABC3@gmail.com |
5538083556 | 6066688561 | ABC4D.COM |
554363456 | 6066688561 | ABC4D.COM |
556343006 | 6355555563486 | ABC5@yahoo.com |
5563065535 | 6355555563486 | ABC5@yahoo.com |
Please help.
Regards,
Uma Shanker Saini
Is there a reason that it must be SQL and can't be a data step?
Is a single query just getting a single account as input and you then want all the related accounts OR is this actually about creating networks for all of your data?
How many rows do you have in your actual data?
I believe what you really will have to do is something along the line of How to find all connected components in a graph
@Patrick , I was thinking the same. A hashobject would be easier in my opinion. Otherwise with SQL I am thinking of joining the table with itself
@Mazi If I understand the problem then account A and C could have no common attributes but are still linked via account B that shares attributes with both A and C ...which a single SQL self-join couldn't detect.
@Patrick, Agreed. I am also not entirely sure I understand the requirement/purpose here. My thinking was to have several self joins and finally transpose the columns to have them as required.
Does this work?
proc sql;
create table want as
(select*, count(distinct id) as gr1
from have
group by phone
having gr1 gt 1)
union corr
(select*, count(distinct id) as gr2
from have
group by email
having gr2 gt 1)
order by email, phone, id;
quit;
I suspect the problem with an SQL solution is that you have to recursively find links until no more can be found. I.e. with every newly linked ID, you have to review the remaining obs for a even newer link by phone or email. Result: you would need to have a criterion within your sql code to determine that there are no more to be added. I.e. you have to know when to stop joins.
Do you have SAS/OR? If so, then PROC OPTNET can solve the problem of recursion. It will find all the possible links and "connected components" in the entire data set, including the components connected to your desired id (say "5534385355" in your case).
To do so, you have to find all links via phone number, and all links via email, as in :
data have;
input ID Phone Email :$30.;
datalines;
5534385355 385534616655 ABC@GMAIL.COM
5548558464 3588333353 ABC12.com
5564556334 6366636666 ABC2@gmail.com
554814331 6001066401 ABC10.com
5564556331 6366636666 ABC3@gmail.com
5548558456 68136551346 ABC7.com
5538083556 6066688561 ABC4D.COM
554816644 68141055686 ABC9.com
5563065535 6355555563486 ABC5@yahoo.com
55155355386 6460535366 ABC11.com
553438530 385534616655 ABC1@GMAIL.COM
554884666 6354641861 ABC13.com
5534385355 385534616655 ABC@GMAIL.COM
555336340 6836833860 ABC6.com
554363456 6066688561 ABC4D.COM
55464315555 6863083133 ABC8.com
556343006 6355555563486 ABC5@yahoo.com
run;
proc freq data=have noprint;
tables phone*id / out=phid (keep=phone id);
tables email*id / out=emid (keep=email id);
run;
The datasets produced by proc freq will effectively list, in order, all the ID's for each phone (in PHID), and all the ID's for each email (EMID). From that you can create all discovered pairs:
data pairs (keep=id1 id2 weight);
set phid emid;
retain weight 1;
id1=lag(id);
id2=id;
if phone=lag(phone) and email=lag(email);
run;
Now in the above, you might have ID=1001 connected to ID=1002, and another obs with ID=1002 connected to ID=1003. PROC OPTNET below will know that 1003 is also connected (for your purposes) to 1001. even though there is no explicit obs stating that linkage:
proc optnet data_links=pairs out_nodes=id_nodes (rename=(node=id));
data_link_vars from=id1 to=id2;
concomp;
run;
The "concomp" statement tells optnet to create a connection id (concomp) that has the same value for every ID in a connected group. Take a look at the dataset OUT_NODES it creates. So take dataset out_nodes and get original data for the concomp that contains ID 5534385355:
data want;
if 0 then set have;
if _n_=1 then do;
declare hash h ();
h.definekey('id');
h.definedata('id');
h.definedone();
do until (end_of_wantlist);
merge id_nodes (where=(id=5534385355) in=wanted)
id_nodes end=end_of_wantlist;
by concomp;
if wanted then h.add();
end;
end;
set have;
if h.check()=0;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.