Solved: Re: How to count for links within the data

Shayan2012 · Posted 06-23-2013 12:21 AM

Hi All,

I've been struggling to write a neat code for some task but gets very cumbersome and complex each time.

I have a data set like below.

NAME	Company	INDUSTRY
John	A	Finance
John	B	finance
John	C	information
John	D	Insurance
Mark	…	…
Mark	…	…

Whenever a person is present in two companies, there is a link between these two companies. What I am trying to compute is the number of links between a company with a company from same or another industry. For example, in the above, as long as John is present in Company A and B of finance, C of information and D of Insurance , I want to count 2 outside links and 1 inside link for A. So, my result should be something like this:

Company	Links outside industry	Links inside industry
A	2	1
B	2	1
C	1	0
D	1	0

I feel like I am missing some procs that might get my work to be done easier. Would some one please help me with the basic idea and Procs? Thanks a lot!

Tom · Posted 06-23-2013 11:16 AM

You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

create table xx as

select a.name

, a.industry=b.industry as inside

, a.industry^=b.industry as outside

, a.company as companya

, b.company as companyb

, a.industry as industrya

, b.industry as industryb

from have a

, have b

where a.name = b.name

and not (a.industry = b.industry and a.company=b.company)

;

quit;

proc print;

run;

proc sql noprint ;

create table want as

select companya , sum(inside) as num_inside , sum(outside) as num_outside

from xx

group by 1

order by 1

;

quit;

proc print;

run;

num_ num_

Obs companya inside outside

1 A 1 2

2 B 1 2

3 C 0 3

4 D 0 3

View solution in original post

Tom · Posted 06-23-2013 11:16 AM

You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

create table xx as

select a.name

, a.industry=b.industry as inside

, a.industry^=b.industry as outside

, a.company as companya

, b.company as companyb

, a.industry as industrya

, b.industry as industryb

from have a

, have b

where a.name = b.name

and not (a.industry = b.industry and a.company=b.company)

;

quit;

proc print;

run;

proc sql noprint ;

create table want as

select companya , sum(inside) as num_inside , sum(outside) as num_outside

from xx

group by 1

order by 1

;

quit;

proc print;

run;

num_ num_

Obs companya inside outside

1 A 1 2

2 B 1 2

3 C 0 3

4 D 0 3

PGStats · Posted 06-23-2013 03:43 PM

I agree with Tom, SQL is your friend here, and it prefers clean data. I asumed that links provided by more than one name count only once and that companies associated with more than one industry count only once :

data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A Finance
John B finance
John C information
John D Insurance
Henry   A   FINANCE
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

proc sql;
create table links as
select
     Company,
     sum(sameIndustry) as insideLinks,
     sum(not sameIndustry) as outsideLinks
from (
     select distinct
          a.Company as Company,
          b.Company as otherCompany,
          a.Industry=b.Industry as sameIndustry
     from
          have as a left join
          have as b on a.Name=b.Name and a.Company ne b.Company
     )
group by Company;

select * from links;

quit;

                Company                   insideLinks outsideLinks
                ___________________________________________________
                A                                   1             2
                B                                   1             2
                C                                   0             3
                D                                   0             3

PG

Haikuo · Posted 06-23-2013 07:10 PM

I concur with Tom and PG. You need to clarify your rules further to rationalize your output. Tom and PG’s interpretation is clear and consistent, therefore programmatic doable. I always believe that for any one of Proc SQL approaches, you can find its counterpart in data step Hash(), and this one seems no exception:

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

data want;

if _n_=1 then do;

if 0 then set have(rename=(company=_com industry=_ind));

declare hash h(dataset:'have(rename=(company=_com industry=_ind))', multidata:'y');

h.definekey('name');

h.definedata(all:'y');

h.definedone();

end;

set have;

inside=0;outside=0;

do rc=h.find() by 0 while (rc=0);

if industry=_ind and company ne _com then inside+1;

if industry ne _ind and company ne _com then outside+1;

rc=h.find_next();

end;

drop rc _:;

run;

Haikuo

Ksharp · Posted 06-23-2013 09:22 PM

Your post is not clear , why A=2 while B=0 not 2 too ?

  data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A finance
John B finance
John C information
John D Insurance
Henry   A   finance
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

 

proc sql;
create table links as
select *,(select count(*) from have where have.name=a.name and have.industry ne a.industry ) as outside,
         (select count(*)-1 from have where have.name=a.name and have.industry eq a.industry ) as inside
 from have as a;
quit;

Ksharp

Message was edited by: xia keshan

Shayan2012 · Posted 06-24-2013 10:52 PM

You are absolutely correct. ksharp! . I corrected the table for future references.

Shayan2012 · Posted 06-24-2013 10:51 PM

Wow. Thanks a lot guys. Your codes were super helpful to me. Actually, I was writing the code in matlab with lots of for loops and it was horrible. Guess need to learn a lot of sql!

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away