Hi Experts,
I want to find all the links that are connected to the address_ID I specified directly or indirectly (or any other nodeID, like email_ID, IP_ID etc.) and return only application_ID. for example:
Application_ID Email_ID IP_ID Address_ID phone_ID
1 email1 ip1 address1 phone1
2 email2 ip2 address1 phone2
3 email3 ip2 address2 phone5
4 email5 ip1 address3 phone13
5 email1 ip13 address13 phone13
if I specify parameter address_id=address1
then the program should able to return all 5 applications because
address1 is connected to ip1 and ip2 from application_id 1 and application_id 2 so both application should be returned in the final output
address1 is connected to ip1 which also connected to application_id 4 so appliation_id 4 should be returned in the final output
address1 is connected to ip2 which is connected to application_id 3 so applicaiton_id 3 should be returned in the final output
address1 is connected to email1 which is conneted to application_id 5 so application_id 5 should be returned in the final output
I have used pro sql join to handle this problem but it seems to be inefficient as I need to specify number of time the iteration should run.
I want to achieve the outcome using hash table if possible!
I hope I made it clear.
Thanks
Build the set of all connections, then use the subgraphs macro described here. It is a hash solution to this problem.
Thanks for your reply. but your program is finding the link between node which is not the interests of mine.
I am seeking for the application_id that linked to address_id (I specified) either directly or indirectly.
Since those four variables are independent respectively, it is easy for SQL. data have; input Application_ID (Email_ID IP_ID Address_ID phone_ID) ( :$20.); cards; 1 email1 ip1 address1 phone1 2 email2 ip2 address1 phone2 3 email3 ip2 address2 phone5 4 email5 ip1 address3 phone13 5 email1 ip13 address13 phone13 ; run; %let add=address1; proc sql; select application_id from have where email_id in (select email_id from have where address_id="&add") union select application_id from have where ip_id in (select ip_id from have where address_id="&add") union select application_id from have where phone_id in (select phone_id from have where address_id="&add") ; quit;
Hi Ksharp,
Thanks for your reply but I think your code only return part of the node that application linked to, it can go further than that.
like the following example:
Application_ID Email_ID IP_ID Address_ID phone_ID
1 email1 ip1 address1 phone1
2 email2 ip1 address2 phone2
3 email2 ip2 address3 phone3
you code only return application_id 1 and application_id 2
but if you look at the dataset address1 from application_id 1 is linked to application_id 2 via ip1 and application_id 2 is linked to application_id 3 via email2 so application_id 1 is linked to application_id 3 indirectly
Replace <= by < if you don't want clusters with a single ID
data have;
length Application_ID 8 Email_ID IP_ID Address_ID phone_ID $12;
input Application_ID Email_ID IP_ID Address_ID phone_ID;
datalines;
1 email1 ip1 address1 phone1
2 email2 ip2 address1 phone2
3 email3 ip2 address2 phone5
4 email5 ip1 address3 phone13
5 email1 ip13 address13 phone13
6 email9 ip19 address19 phone19
;
proc sql;
create table arcs as
select
a.application_ID as from,
b.application_ID as to
from
have as a inner join
have as b
on
a.Email_ID = b.Email_ID or
a.IP_ID = b.IP_ID or
a.Address_ID = b.Address_ID or
a.phone_ID = b.phone_ID
where a.application_ID <= b.application_ID;
quit;
%include "&sasforum\subgraphsmacro.sas";
%subgraphs(arcs);
proc print data=clusters; run;
Hi PG,
I think your program only find direct relationship between two applications for a specified address_id. but it failed to find the indirect relationship like the following data.
1 email1 ip1 address1 phone1
2 email2 ip1 address2 phone2
3 email2 ip2 address3 phone3
1 is having relationship with 3 if address1
another question is in the subgraph macro, it has a macro parameter&colon and it is not defined anywhere.
how do I find this value.
thanks
The procedure I gave you does find the correct answer. Unfortunately the forum parser garbled my macro code. Please try again with the attached version.
OK. I see what is different now. Try this one. data x; input Application_ID (Email_ID IP_ID Address_ID phone_ID) ( :$20.); cards; 1 email1 ip1 address1 phone1 2 email2 ip2 address1 phone2 3 email3 ip2 address2 phone5 4 email5 ip1 address3 phone13 5 email1 ip13 address13 phone13 11 email21 ip21 address21 phone21 12 email22 ip21 address22 phone22 13 email22 ip22 address23 phone23 ; run; data have; set x; array x{*} $ Email_ID IP_ID Address_ID phone_ID; length from to $ 100; do i=1 to dim(x)-1; from=x{i}; do j=i+1 to dim(x); to=x{j};output; end; end; keep from to; run; data full; set have end=last; if _n_ eq 1 then do; declare hash h(); h.definekey('node'); h.definedata('node'); h.definedone(); end; output; node=from; h.replace(); from=to; to=node; output; node=from; h.replace(); if last then h.output(dataset:'node'); drop node; run; data want(keep=node household); declare hash ha(ordered:'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('last'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedone(); if 0 then set full; declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y'); from_to.definekey('from'); from_to.definedata('to'); from_to.definedone(); if 0 then set node; declare hash no(dataset:'node'); declare hiter hi_no('no'); no.definekey('node'); no.definedata('node'); no.definedone(); do while(hi_no.next()=0); household+1; output; count=1; key=node;_ha.add(); last=node;ha.add(); rc=hi.first(); do while(rc=0); from=last;rx=from_to.find(); do while(rx=0); key=to;ry=_ha.check(); if ry ne 0 then do; node=to;output;rr=no.remove(key:node); key=to;_ha.add(); count+1; last=to;ha.add(); end; rx=from_to.find_next(); end; rc=hi.next(); end; ha.clear();_ha.clear(); end; stop; run; data final_want; if _n_=1 then do; if 0 then set want; declare hash h(dataset:'want'); h.definekey('node'); h.definedata('household'); h.definedone(); end; set x; array x{*} $ Email_ID IP_ID Address_ID phone_ID; do i=1 to dim(x); node=x{i};call missing(household); if h.find()=0 then leave; end; drop i node; run; proc print noobs;run;
Thanks for your reply!
your code is very nice! but to further extend a bit, if the data contain missing values like one or all the columns are blank for some applicaitons, how do you handle this problem?
I modified your code a little like below:
data have;
set n;
array x{*} $ Customer_Contact_Number name_dob_cluster1 driver_cluster1 address_75_cluster1;
if driver_cluster1='1' then
driver_cluster1='';
length from to $ 100;
do i=1 to dim(x)-1;
from=x{i};
do j=i+1 to dim(x);
to=x{j};
if from^='' or to^='' then
output;
end;
end;
keep from to;
run;
data have1;
set have;
if from='' then
do;
from1+1;
from=cats('a',from1);
end;
if to='' then
do;
to1+1;
to=cats('b',to1);
end;
run;
so if all of the columns are blank then it should be removed. if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 (all the blank column must have a distinct id)
what do you think?
You are on the right way. data have; set x; array x{*} $ Email_ID IP_ID Address_ID phone_ID; length from to $ 100; do i=1 to dim(x)-1; from=x{i}; do j=i+1 to dim(x); to=x{j}; if not missing(x{i}) and not missing(x{j}) then output; end; end; keep from to; run; "if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 " I don't understand what you mean. Post an example to describe your question.
I realised using the following statment won't be able to find the correct result because It will ignore both from and to value if one of them is missing.
for example:
from to
1 missing
then that record will be ignored. However it should be included as from=1 might connect to some applications
if not missing(x{i}) and not missing(x{j}) then output;
Also I think something need to be done at X table by giving a distinct number to all the missing value (so in other words, treat the missing value as distinct ID) so as to get the application number at the last step.
anything wrong please let me know thank!
OK. I know what you mean now.
The best way is filling these missing value as unique value just as you said before.
The following code could give you what you want.
CODE NOT TESTED.
data have;
set have;
array x{*} $ _character_;
do i=1 to dim(x);
if missing(x{i}) then do; n+1; x{i}=cats('a',n);end;
end;
run;
And make sure your character variable's length is big enough to hold 'a9999999' .
cool, I have done things similar but takes ages to run for like 1.3 miliions applications. Maybe I need to find a way to make it more efficent processing. thanks a lot!!!!!
Actually you do not have to transform all the missing into axxxxx .
The only problem exist in this scenario is there is only one non-missing value in obs.like:
id address phone email ..........
1 xx .......rest are all missing .
So you can reduce the number of node like:
data have; set x; array x{*} $ Email_ID IP_ID Address_ID phone_ID; length from to $ 100;
if cmiss(of x{*})=dim(x) then delete;
else if cmiss(of x{*})=1 then do;
from=coalescec(of x{*});to=uuidgen(0);output; end;
end;
else do;
do i=1 to dim(x)-1; from=x{i}; do j=i+1 to dim(x); to=x{j}; if not missing(from) and not missing(to) then output; end; end;
end;
keep from to; run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.