BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shayan2012
Quartz | Level 8

Hi All,

I've been struggling to write a neat code for some task but gets very cumbersome and complex each time.

I have a data set like below.

NAMECompanyINDUSTRY
JohnAFinance
JohnBfinance
JohnCinformation
JohnDInsurance
Mark
Mark

Whenever a person is present in two companies, there is a link between these two companies. What I am trying to compute is the number of links between a company with a company from same or another industry. For example, in the above, as long as John is present in Company A and B of finance, C of information and D of Insurance , I want to count 2 outside links and 1 inside link for A. So, my result should be something like this:

CompanyLinks outside industryLinks inside industry
A21
B21
C10
D10

I feel like I am missing some procs that might get my work to be done easier. Would some one please help me with the basic idea and Procs? Thanks a lot!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You need to be clearer on how to count.  Try this.  It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match). 

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

  create table xx as

  select a.name

       , a.industry=b.industry as inside

       , a.industry^=b.industry as outside

         , a.company as companya

         , b.company as companyb

         , a.industry as industrya

         , b.industry as industryb

  from have a

     , have b

  where a.name = b.name

    and not (a.industry = b.industry and a.company=b.company)

  ;

quit;

proc print;

run;

proc sql noprint ;

  create table want as

  select companya , sum(inside) as num_inside , sum(outside) as num_outside

  from xx

  group by 1

  order by 1

  ;

quit;

proc print;

run;

                    num_       num_

Obs    companya    inside    outside

1        A           1         2

2        B           1         2

3        C           0         3

4        D           0         3

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

You need to be clearer on how to count.  Try this.  It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match). 

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

  create table xx as

  select a.name

       , a.industry=b.industry as inside

       , a.industry^=b.industry as outside

         , a.company as companya

         , b.company as companyb

         , a.industry as industrya

         , b.industry as industryb

  from have a

     , have b

  where a.name = b.name

    and not (a.industry = b.industry and a.company=b.company)

  ;

quit;

proc print;

run;

proc sql noprint ;

  create table want as

  select companya , sum(inside) as num_inside , sum(outside) as num_outside

  from xx

  group by 1

  order by 1

  ;

quit;

proc print;

run;

                    num_       num_

Obs    companya    inside    outside

1        A           1         2

2        B           1         2

3        C           0         3

4        D           0         3

PGStats
Opal | Level 21

I agree with Tom, SQL is your friend here, and it prefers clean data. I asumed that links provided by more than one name count only once and that companies associated with more than one industry count only once :

data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A Finance
John B finance
John C information
John D Insurance
Henry   A   FINANCE
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

proc sql;
create table links as
select
     Company,
     sum(sameIndustry) as insideLinks,
     sum(not sameIndustry) as outsideLinks
from (
     select distinct
          a.Company as Company,
          b.Company as otherCompany,
          a.Industry=b.Industry as sameIndustry
     from
          have as a left join
          have as b on a.Name=b.Name and a.Company ne b.Company
     )
group by Company;


select * from links;


quit;

                Company                   insideLinks  outsideLinks
                ___________________________________________________
                A                                   1             2
                B                                   1             2
                C                                   0             3
                D                                   0             3

PG

PG
Haikuo
Onyx | Level 15

I concur with Tom and PG. You need to clarify your rules further to rationalize your output. Tom and PG’s interpretation is clear and consistent, therefore programmatic doable. I always believe that for any one of Proc SQL approaches, you can find its counterpart in data step Hash(), and this one seems no exception:

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

data want;

  if _n_=1 then do;

    if 0 then set have(rename=(company=_com industry=_ind));

       declare hash h(dataset:'have(rename=(company=_com industry=_ind))', multidata:'y');

       h.definekey('name');

       h.definedata(all:'y');

       h.definedone();

  end;

  set have;

  inside=0;outside=0;

    do rc=h.find() by 0 while (rc=0);

        if industry=_ind and company ne _com then inside+1;

        if industry ne _ind and company ne _com then outside+1;

rc=h.find_next();

       end;

drop rc _:;

run;

Haikuo

Ksharp
Super User

Your post is not clear , why A=2 while B=0 not 2 too ?

  data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A finance
John B finance
John C information
John D Insurance
Henry   A   finance
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

 

proc sql;
create table links as
select *,(select count(*) from have where have.name=a.name and have.industry ne a.industry ) as outside,
         (select count(*)-1 from have where have.name=a.name and have.industry eq a.industry ) as inside
 from have as a;
quit;
              

Ksharp

Message was edited by: xia keshan

Shayan2012
Quartz | Level 8

You are absolutely correct. ksharp! . I corrected the table for future references.

Shayan2012
Quartz | Level 8

Wow. Thanks a lot guys. Your codes were super helpful to me. Actually, I was writing the code in matlab with lots of for loops and it was horrible. Guess need to learn a lot of sql!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1061 views
  • 6 likes
  • 5 in conversation