BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shayan2012
Quartz | Level 8

Hi All,

I've been struggling to write a neat code for some task but gets very cumbersome and complex each time.

I have a data set like below.

NAMECompanyINDUSTRY
JohnAFinance
JohnBfinance
JohnCinformation
JohnDInsurance
Mark
Mark

Whenever a person is present in two companies, there is a link between these two companies. What I am trying to compute is the number of links between a company with a company from same or another industry. For example, in the above, as long as John is present in Company A and B of finance, C of information and D of Insurance , I want to count 2 outside links and 1 inside link for A. So, my result should be something like this:

CompanyLinks outside industryLinks inside industry
A21
B21
C10
D10

I feel like I am missing some procs that might get my work to be done easier. Would some one please help me with the basic idea and Procs? Thanks a lot!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You need to be clearer on how to count.  Try this.  It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match). 

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

  create table xx as

  select a.name

       , a.industry=b.industry as inside

       , a.industry^=b.industry as outside

         , a.company as companya

         , b.company as companyb

         , a.industry as industrya

         , b.industry as industryb

  from have a

     , have b

  where a.name = b.name

    and not (a.industry = b.industry and a.company=b.company)

  ;

quit;

proc print;

run;

proc sql noprint ;

  create table want as

  select companya , sum(inside) as num_inside , sum(outside) as num_outside

  from xx

  group by 1

  order by 1

  ;

quit;

proc print;

run;

                    num_       num_

Obs    companya    inside    outside

1        A           1         2

2        B           1         2

3        C           0         3

4        D           0         3

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

You need to be clearer on how to count.  Try this.  It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match). 

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

proc sql noprint ;

  create table xx as

  select a.name

       , a.industry=b.industry as inside

       , a.industry^=b.industry as outside

         , a.company as companya

         , b.company as companyb

         , a.industry as industrya

         , b.industry as industryb

  from have a

     , have b

  where a.name = b.name

    and not (a.industry = b.industry and a.company=b.company)

  ;

quit;

proc print;

run;

proc sql noprint ;

  create table want as

  select companya , sum(inside) as num_inside , sum(outside) as num_outside

  from xx

  group by 1

  order by 1

  ;

quit;

proc print;

run;

                    num_       num_

Obs    companya    inside    outside

1        A           1         2

2        B           1         2

3        C           0         3

4        D           0         3

PGStats
Opal | Level 21

I agree with Tom, SQL is your friend here, and it prefers clean data. I asumed that links provided by more than one name count only once and that companies associated with more than one industry count only once :

data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A Finance
John B finance
John C information
John D Insurance
Henry   A   FINANCE
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

proc sql;
create table links as
select
     Company,
     sum(sameIndustry) as insideLinks,
     sum(not sameIndustry) as outsideLinks
from (
     select distinct
          a.Company as Company,
          b.Company as otherCompany,
          a.Industry=b.Industry as sameIndustry
     from
          have as a left join
          have as b on a.Name=b.Name and a.Company ne b.Company
     )
group by Company;


select * from links;


quit;

                Company                   insideLinks  outsideLinks
                ___________________________________________________
                A                                   1             2
                B                                   1             2
                C                                   0             3
                D                                   0             3

PG

PG
Haikuo
Onyx | Level 15

I concur with Tom and PG. You need to clarify your rules further to rationalize your output. Tom and PG’s interpretation is clear and consistent, therefore programmatic doable. I always believe that for any one of Proc SQL approaches, you can find its counterpart in data step Hash(), and this one seems no exception:

data have;

input NAME $ Company $ INDUSTRY $20.;

cards;

John A Finance

John B Finance

John C Information

John D Insurance

run;

data want;

  if _n_=1 then do;

    if 0 then set have(rename=(company=_com industry=_ind));

       declare hash h(dataset:'have(rename=(company=_com industry=_ind))', multidata:'y');

       h.definekey('name');

       h.definedata(all:'y');

       h.definedone();

  end;

  set have;

  inside=0;outside=0;

    do rc=h.find() by 0 while (rc=0);

        if industry=_ind and company ne _com then inside+1;

        if industry ne _ind and company ne _com then outside+1;

rc=h.find_next();

       end;

drop rc _:;

run;

Haikuo

Ksharp
Super User

Your post is not clear , why A=2 while B=0 not 2 too ?

  data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A finance
John B finance
John C information
John D Insurance
Henry   A   finance
Henry   B   finance
Henry   C   Information
Henry   D   Banking
;

 

proc sql;
create table links as
select *,(select count(*) from have where have.name=a.name and have.industry ne a.industry ) as outside,
         (select count(*)-1 from have where have.name=a.name and have.industry eq a.industry ) as inside
 from have as a;
quit;
              

Ksharp

Message was edited by: xia keshan

Shayan2012
Quartz | Level 8

You are absolutely correct. ksharp! . I corrected the table for future references.

Shayan2012
Quartz | Level 8

Wow. Thanks a lot guys. Your codes were super helpful to me. Actually, I was writing the code in matlab with lots of for loops and it was horrible. Guess need to learn a lot of sql!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 827 views
  • 6 likes
  • 5 in conversation