How to find link between nodes

gyambqt · Posted 03-16-2017 08:06 PM

Hi Experts,

I want to find all the links that are connected to the address_ID I specified directly or indirectly (or any other nodeID, like email_ID, IP_ID etc.) and return only application_ID. for example:

Application_ID Email_ID IP_ID Address_ID phone_ID

1 email1 ip1 address1 phone1

2 email2 ip2 address1 phone2

3 email3 ip2 address2 phone5

4 email5 ip1 address3 phone13

5 email1 ip13 address13 phone13

if I specify parameter address_id=address1

then the program should able to return all 5 applications because

address1 is connected to ip1 and ip2 from application_id 1 and application_id 2 so both application should be returned in the final output

address1 is connected to ip1 which also connected to application_id 4 so appliation_id 4 should be returned in the final output

address1 is connected to ip2 which is connected to application_id 3 so applicaiton_id 3 should be returned in the final output

address1 is connected to email1 which is conneted to application_id 5 so application_id 5 should be returned in the final output

I have used pro sql join to handle this problem but it seems to be inefficient as I need to specify number of time the iteration should run.

I want to achieve the outcome using hash table if possible!

I hope I made it clear.

Thanks

PGStats · Posted 03-16-2017 11:01 PM

Build the set of all connections, then use the subgraphs macro described here. It is a hash solution to this problem.

PG

gyambqt · Posted 03-17-2017 12:35 AM

Thanks for your reply. but your program is finding the link between node which is not the interests of mine.
I am seeking for the application_id that linked to address_id (I specified) either directly or indirectly.

Ksharp · Posted 03-16-2017 11:31 PM

Since those four variables are independent respectively,
it is easy for SQL.



data have;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
;
run;

%let add=address1;

proc sql;


select application_id
 from have
  where email_id in
(select email_id from have where address_id="&add")
  
union

select application_id
 from have
  where ip_id in
(select ip_id from have where address_id="&add")

union

select application_id
 from have
  where phone_id in
(select phone_id from have where address_id="&add")

 ;
 quit;

gyambqt · Posted 03-17-2017 12:14 AM

Hi Ksharp,

Thanks for your reply but I think your code only return part of the node that application linked to, it can go further than that.

like the following example:

Application_ID Email_ID IP_ID Address_ID phone_ID

1 email1 ip1 address1 phone1

2 email2 ip1 address2 phone2

3 email2 ip2 address3 phone3

you code only return application_id 1 and application_id 2

but if you look at the dataset address1 from application_id 1 is linked to application_id 2 via ip1 and application_id 2 is linked to application_id 3 via email2 so application_id 1 is linked to application_id 3 indirectly

PGStats · Posted 03-17-2017 12:40 AM

Replace <= by < if you don't want clusters with a single ID

data have;
length Application_ID 8 Email_ID IP_ID Address_ID phone_ID $12;
input Application_ID   Email_ID  IP_ID Address_ID phone_ID;
datalines;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
6                            email9  ip19       address19  phone19
;

proc sql;
create table arcs as
select 
    a.application_ID as from,
    b.application_ID as to
from 
    have as a inner join
    have as b
    on 
        a.Email_ID = b.Email_ID or
        a.IP_ID = b.IP_ID or
        a.Address_ID = b.Address_ID or 
        a.phone_ID = b.phone_ID
where a.application_ID <= b.application_ID;
quit;

%include "&sasforum\subgraphsmacro.sas";

%subgraphs(arcs);

proc print data=clusters; run;

PG

gyambqt · Posted 03-17-2017 12:58 AM

Hi PG,

I think your program only find direct relationship between two applications for a specified address_id. but it failed to find the indirect relationship like the following data.

1 email1 ip1 address1 phone1

2 email2 ip1 address2 phone2

3 email2 ip2 address3 phone3

1 is having relationship with 3 if address1

another question is in the subgraph macro, it has a macro parameter&colon and it is not defined anywhere.

how do I find this value.

thanks

PGStats · Posted 03-17-2017 01:21 PM

The procedure I gave you does find the correct answer. Unfortunately the forum parser garbled my macro code. Please try again with the attached version.

PG

Ksharp · Posted 03-17-2017 06:31 AM

OK. I see what is different now.
Try this one.





data x;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;
 do i=1 to dim(x)-1;
  from=x{i};
  do j=i+1 to dim(x);
   to=x{j};output;
  end;
 end;
 keep from to;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data final_want;
if _n_=1 then do;
 if 0 then set want;
 declare hash h(dataset:'want');
 h.definekey('node');
 h.definedata('household');
 h.definedone();
end;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 do i=1 to dim(x);
  node=x{i};call missing(household);
  if h.find()=0 then leave;
 end;
 drop i node;
run;
proc print noobs;run;

gyambqt · Posted 03-19-2017 11:40 PM

Thanks for your reply!

your code is very nice! but to further extend a bit, if the data contain missing values like one or all the columns are blank for some applicaitons, how do you handle this problem?

I modified your code a little like below:

data have;

set n;

array x{*} $ Customer_Contact_Number name_dob_cluster1 driver_cluster1 address_75_cluster1;

if driver_cluster1='1' then

driver_cluster1='';

length from to $ 100;

do i=1 to dim(x)-1;

from=x{i};

do j=i+1 to dim(x);

to=x{j};

if from^='' or to^='' then

output;

end;

keep from to;

run;

data have1;

set have;

if from='' then

do;

from1+1;

from=cats('a',from1);

end;

if to='' then

do;

to1+1;

to=cats('b',to1);

end;

run;

so if all of the columns are blank then it should be removed. if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 (all the blank column must have a distinct id)

what do you think?

Ksharp · Posted 03-20-2017 01:38 AM

You are on the right way.


data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;
 do i=1 to dim(x)-1;
  from=x{i};
  do j=i+1 to dim(x);
   to=x{j};

   if not missing(x{i}) and not missing(x{j}) then output;
  
  end;
 end;
 keep from to;
run;



"if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 "
I don't understand what you mean. Post an example to describe your question.

gyambqt · Posted 03-20-2017 01:59 AM

I realised using the following statment won't be able to find the correct result because It will ignore both from and to value if one of them is missing.

for example:

from to

1 missing

then that record will be ignored. However it should be included as from=1 might connect to some applications

 if not missing(x{i}) and not missing(x{j}) then output;

Also I think something need to be done at X table by giving a distinct number to all the missing value (so in other words, treat the missing value as distinct ID) so as to get the application number at the last step.

anything wrong please let me know thank!

Ksharp · Posted 03-20-2017 02:15 AM

OK. I know what you mean now.

The best way is filling these missing value as unique value just as you said before.

The following code could give you what you want.

CODE NOT TESTED.

data have;

set have;

array x{*} $ _character_;

do i=1 to dim(x);

if missing(x{i}) then do; n+1; x{i}=cats('a',n);end;

end;

run;

And make sure your character variable's length is big enough to hold 'a9999999' .

gyambqt · Posted 03-20-2017 02:37 AM

cool, I have done things similar but takes ages to run for like 1.3 miliions applications. Maybe I need to find a way to make it more efficent processing. thanks a lot!!!!!

Ksharp · Posted 03-20-2017 06:07 AM

Actually you do not have to transform all the missing into axxxxx .

The only problem exist in this scenario is there is only one non-missing value in obs.like:

id address phone email ..........

1 xx .......rest are all missing .

So you can reduce the number of node like:

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;

if cmiss(of x{*})=dim(x) then delete;
 else if cmiss(of x{*})=1 then do;
     from=coalescec(of x{*});to=uuidgen(0);output; end;
 end;
  else do;

          do i=1 to dim(x)-1;
           from=x{i};
            do j=i+1 to dim(x);
             to=x{j};
             if not missing(from) and not missing(to) then output;
             end;
          end;

       end;


 keep from to;
run;

How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Re: How to find link between nodes

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away