BookmarkSubscribeRSS Feed
gyambqt
Obsidian | Level 7

Hi Experts,

I want to find all the links that are connected to the address_ID I specified directly or indirectly (or any other nodeID, like email_ID, IP_ID etc.)  and return only application_ID. for example:

 

Application_ID   Email_ID  IP_ID Address_ID phone_ID

1                            email1    ip1       address1     phone1    

2                            email2    ip2        address1    phone2

3                            email3   ip2         address2    phone5

4                            email5    ip1         address3     phone13

5                            email1  ip13       address13  phone13

 

if I specify parameter address_id=address1

then the program should able to return all 5 applications because

 

address1 is connected to ip1 and ip2 from application_id 1 and application_id 2 so both application should be returned in the final output

 

 address1 is connected to ip1 which also  connected to application_id 4 so appliation_id 4 should be returned in the final output

 

address1 is connected to ip2 which is connected to application_id 3 so applicaiton_id 3 should be returned in the final output

 

address1 is connected to email1 which is conneted to application_id 5 so application_id 5 should be returned in the final output

 

I have used pro sql join to handle this problem but it seems to be inefficient as I need to specify number of time the iteration should run.

 

I want to achieve the outcome using hash table if possible!

 

I hope I made it clear.

 

Thanks

18 REPLIES 18
PGStats
Opal | Level 21

Build the set of all connections, then use the subgraphs macro described here. It is a hash solution to this problem.

 

PG
gyambqt
Obsidian | Level 7

Thanks for your reply. but your program is finding the link between node which is not the interests of mine.
I am seeking for the application_id that linked to address_id (I specified) either directly or indirectly.

Ksharp
Super User
Since those four variables are independent respectively,
it is easy for SQL.



data have;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
;
run;

%let add=address1;

proc sql;


select application_id
 from have
  where email_id in
(select email_id from have where address_id="&add")
  
union

select application_id
 from have
  where ip_id in
(select ip_id from have where address_id="&add")

union

select application_id
 from have
  where phone_id in
(select phone_id from have where address_id="&add")

 ;
 quit;
  
  

gyambqt
Obsidian | Level 7

Hi Ksharp,

Thanks for your reply but I think your code only return part of the node that application linked to, it can go further than that.

 

 

like the following example:

Application_ID Email_ID IP_ID Address_ID phone_ID

1 email1 ip1 address1 phone1

2 email2 ip1 address2 phone2

3 email2 ip2 address3 phone3

you code only return application_id 1 and application_id 2

 

but if you look at the dataset address1 from application_id 1 is linked to application_id 2 via ip1 and application_id 2 is linked to application_id 3 via email2 so application_id 1 is linked to application_id 3 indirectly

 

 

 

 

PGStats
Opal | Level 21

Replace <= by < if you don't want clusters with a single ID

 

data have;
length Application_ID 8 Email_ID IP_ID Address_ID phone_ID $12;
input Application_ID   Email_ID  IP_ID Address_ID phone_ID;
datalines;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
6                            email9  ip19       address19  phone19
;

proc sql;
create table arcs as
select 
    a.application_ID as from,
    b.application_ID as to
from 
    have as a inner join
    have as b
    on 
        a.Email_ID = b.Email_ID or
        a.IP_ID = b.IP_ID or
        a.Address_ID = b.Address_ID or 
        a.phone_ID = b.phone_ID
where a.application_ID <= b.application_ID;
quit;

%include "&sasforum\subgraphsmacro.sas";

%subgraphs(arcs);

proc print data=clusters; run;

PG
gyambqt
Obsidian | Level 7

Hi PG,

I think your program only find direct relationship between two applications for a specified address_id. but it failed to find the indirect relationship like the following data.

 

1 email1 ip1 address1 phone1

2 email2 ip1 address2 phone2

3 email2 ip2 address3 phone3

 

1 is having relationship with 3 if address1

 

 

another question is in the subgraph macro, it has a macro parameter&colon and it is not defined anywhere.

how do I find this value.

 

thanks

 

 

PGStats
Opal | Level 21

The procedure I gave you does find the correct answer. Unfortunately the forum parser garbled my macro code. Please try again with the attached version.

PG
Ksharp
Super User
OK. I see what is different now.
Try this one.





data x;
input Application_ID   (Email_ID  IP_ID Address_ID phone_ID) ( :$20.);
cards;
1                            email1    ip1       address1     phone1    
2                            email2    ip2        address1    phone2
3                            email3   ip2         address2    phone5
4                            email5    ip1         address3     phone13
5                            email1  ip13       address13  phone13
11 email21 ip21 address21 phone21
12 email22 ip21 address22 phone22
13 email22 ip22 address23 phone23
;
run;

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;
 do i=1 to dim(x)-1;
  from=x{i};
  do j=i+1 to dim(x);
   to=x{j};output;
  end;
 end;
 keep from to;
run;



data full;
  set have end=last;
  if _n_ eq 1 then do;
   declare hash h();
    h.definekey('node');
     h.definedata('node');
     h.definedone();
  end;
  output;
  node=from; h.replace();
  from=to; to=node;
  output;
  node=from; h.replace();
  if last then h.output(dataset:'node');
  drop node;
run;


data want(keep=node household);
declare hash ha(ordered:'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('last');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedone();

if 0 then set full;
declare hash from_to(dataset:'full(where=(from is not missing and to is not missing))',hashexp:20,multidata:'y');
 from_to.definekey('from');
 from_to.definedata('to');
 from_to.definedone();

if 0 then set node;
declare hash no(dataset:'node');
declare hiter hi_no('no');
 no.definekey('node');
 no.definedata('node');
 no.definedone();
 

do while(hi_no.next()=0);
 household+1; output;
 count=1;
 key=node;_ha.add();
 last=node;ha.add();
 rc=hi.first();
 do while(rc=0);
   from=last;rx=from_to.find();
   do while(rx=0);
     key=to;ry=_ha.check();
      if ry ne 0 then do;
       node=to;output;rr=no.remove(key:node);
       key=to;_ha.add();
       count+1;
       last=to;ha.add();
      end;
      rx=from_to.find_next();
   end;
   rc=hi.next();
end;
ha.clear();_ha.clear();
end;
stop;
run;

data final_want;
if _n_=1 then do;
 if 0 then set want;
 declare hash h(dataset:'want');
 h.definekey('node');
 h.definedata('household');
 h.definedone();
end;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 do i=1 to dim(x);
  node=x{i};call missing(household);
  if h.find()=0 then leave;
 end;
 drop i node;
run;
proc print noobs;run;

gyambqt
Obsidian | Level 7

Thanks for your reply!

your code is very nice! but to further extend a bit, if the data contain missing values like one or all the columns are blank for some applicaitons, how do you handle this problem?

I modified your code a little like below:

 

data have;

set n;

array x{*} $ Customer_Contact_Number name_dob_cluster1 driver_cluster1 address_75_cluster1;

if driver_cluster1='1' then

driver_cluster1='';

length from to $ 100;

do i=1 to dim(x)-1;

from=x{i};

do j=i+1 to dim(x);

to=x{j};

 

if from^='' or to^='' then

output;

end;

end;

keep from to;

run;

data have1;

set have;

if from='' then

do;

from1+1;

from=cats('a',from1);

end;

if to='' then

do;

to1+1;

to=cats('b',to1);

end;

run;

 

so if all of the columns are blank then it should be removed. if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 (all the blank column must have a distinct id)

 

what do you think?

Ksharp
Super User
You are on the right way.


data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;
 do i=1 to dim(x)-1;
  from=x{i};
  do j=i+1 to dim(x);
   to=x{j};

   if not missing(x{i}) and not missing(x{j}) then output;
  
  end;
 end;
 keep from to;
run;



"if one or some of the columns are blank then the blank columns will be assigned a ID like a1, a2 a3 "
I don't understand what you mean. Post an example to describe your question.


gyambqt
Obsidian | Level 7

I realised using the following statment won't be able to find the correct result because It will ignore both from and to value if one of them is missing.

for example:

from to

1      missing

 

then that record will be ignored. However it should be included as from=1 might connect to some applications

 if not missing(x{i}) and not missing(x{j}) then output;

 Also I think something need to be done at X table by giving a distinct number to all the missing value (so in other words, treat the missing value as distinct ID) so as to get the application number at the last step.

 

anything wrong please let me know thank!

Ksharp
Super User

OK. I know what you mean now.

The best way is filling these missing value as unique value just as you said before.

 

The following code could give you what you want.

CODE NOT TESTED.

 

data have;

 set have;

 array x{*} $ _character_;

 do i=1 to dim(x);

   if missing(x{i}) then do; n+1; x{i}=cats('a',n);end;

 end;

run;

 

And make sure your character variable's length is big enough to hold 'a9999999' .

gyambqt
Obsidian | Level 7

cool, I have done things similar but takes ages to run for like 1.3 miliions applications. Maybe I need to find a way to make it more efficent processing. thanks a lot!!!!!

Ksharp
Super User

Actually you do not have to transform all the missing into axxxxx .

The only problem exist in this scenario is there is only one non-missing value in obs.like:

 

id  address phone email ..........

1    xx         .......rest are all missing .

 

So you can reduce the number of node like:

 

 

data have;
 set x;
 array x{*} $ Email_ID  IP_ID Address_ID phone_ID;
 length from to $ 100;

if cmiss(of x{*})=dim(x) then delete;
else if cmiss(of x{*})=1 then do;
from=coalescec(of x{*});to=uuidgen(0);output; end;
end;
else do;
do i=1 to dim(x)-1; from=x{i}; do j=i+1 to dim(x); to=x{j}; if not missing(from) and not missing(to) then output; end; end;

end;

keep from to; run;

  

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 1072 views
  • 0 likes
  • 4 in conversation