I suspect the problem with an SQL solution is that you have to recursively find links until no more can be found. I.e. with every newly linked ID, you have to review the remaining obs for a even newer link by phone or email. Result: you would need to have a criterion within your sql code to determine that there are no more to be added. I.e. you have to know when to stop joins.
Do you have SAS/OR? If so, then PROC OPTNET can solve the problem of recursion. It will find all the possible links and "connected components" in the entire data set, including the components connected to your desired id (say "5534385355" in your case).
To do so, you have to find all links via phone number, and all links via email, as in :
data have;
input ID Phone Email :$30.;
datalines;
5534385355 385534616655 ABC@GMAIL.COM
5548558464 3588333353 ABC12.com
5564556334 6366636666 ABC2@gmail.com
554814331 6001066401 ABC10.com
5564556331 6366636666 ABC3@gmail.com
5548558456 68136551346 ABC7.com
5538083556 6066688561 ABC4D.COM
554816644 68141055686 ABC9.com
5563065535 6355555563486 ABC5@yahoo.com
55155355386 6460535366 ABC11.com
553438530 385534616655 ABC1@GMAIL.COM
554884666 6354641861 ABC13.com
5534385355 385534616655 ABC@GMAIL.COM
555336340 6836833860 ABC6.com
554363456 6066688561 ABC4D.COM
55464315555 6863083133 ABC8.com
556343006 6355555563486 ABC5@yahoo.com
run;
proc freq data=have noprint;
tables phone*id / out=phid (keep=phone id);
tables email*id / out=emid (keep=email id);
run;
The datasets produced by proc freq will effectively list, in order, all the ID's for each phone (in PHID), and all the ID's for each email (EMID). From that you can create all discovered pairs:
data pairs (keep=id1 id2 weight);
set phid emid;
retain weight 1;
id1=lag(id);
id2=id;
if phone=lag(phone) and email=lag(email);
run;
Now in the above, you might have ID=1001 connected to ID=1002, and another obs with ID=1002 connected to ID=1003. PROC OPTNET below will know that 1003 is also connected (for your purposes) to 1001. even though there is no explicit obs stating that linkage:
proc optnet data_links=pairs out_nodes=id_nodes (rename=(node=id));
data_link_vars from=id1 to=id2;
concomp;
run;
The "concomp" statement tells optnet to create a connection id (concomp) that has the same value for every ID in a connected group. Take a look at the dataset OUT_NODES it creates. So take dataset out_nodes and get original data for the concomp that contains ID 5534385355:
data want;
if 0 then set have;
if _n_=1 then do;
declare hash h ();
h.definekey('id');
h.definedata('id');
h.definedone();
do until (end_of_wantlist);
merge id_nodes (where=(id=5534385355) in=wanted)
id_nodes end=end_of_wantlist;
by concomp;
if wanted then h.add();
end;
end;
set have;
if h.check()=0;
run;
... View more