Hi All,
I've been struggling to write a neat code for some task but gets very cumbersome and complex each time.
I have a data set like below.
|
Whenever a person is present in two companies, there is a link between these two companies. What I am trying to compute is the number of links between a company with a company from same or another industry. For example, in the above, as long as John is present in Company A and B of finance, C of information and D of Insurance , I want to count 2 outside links and 1 inside link for A. So, my result should be something like this:
Company | Links outside industry | Links inside industry |
A | 2 | 1 |
B | 2 | 1 |
C | 1 | 0 |
D | 1 | 0 |
I feel like I am missing some procs that might get my work to be done easier. Would some one please help me with the basic idea and Procs? Thanks a lot!
You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
proc sql noprint ;
create table xx as
select a.name
, a.industry=b.industry as inside
, a.industry^=b.industry as outside
, a.company as companya
, b.company as companyb
, a.industry as industrya
, b.industry as industryb
from have a
, have b
where a.name = b.name
and not (a.industry = b.industry and a.company=b.company)
;
quit;
proc print;
run;
proc sql noprint ;
create table want as
select companya , sum(inside) as num_inside , sum(outside) as num_outside
from xx
group by 1
order by 1
;
quit;
proc print;
run;
num_ num_
Obs companya inside outside
1 A 1 2
2 B 1 2
3 C 0 3
4 D 0 3
You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
proc sql noprint ;
create table xx as
select a.name
, a.industry=b.industry as inside
, a.industry^=b.industry as outside
, a.company as companya
, b.company as companyb
, a.industry as industrya
, b.industry as industryb
from have a
, have b
where a.name = b.name
and not (a.industry = b.industry and a.company=b.company)
;
quit;
proc print;
run;
proc sql noprint ;
create table want as
select companya , sum(inside) as num_inside , sum(outside) as num_outside
from xx
group by 1
order by 1
;
quit;
proc print;
run;
num_ num_
Obs companya inside outside
1 A 1 2
2 B 1 2
3 C 0 3
4 D 0 3
I agree with Tom, SQL is your friend here, and it prefers clean data. I asumed that links provided by more than one name count only once and that companies associated with more than one industry count only once :
data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A Finance
John B finance
John C information
John D Insurance
Henry A FINANCE
Henry B finance
Henry C Information
Henry D Banking
;
proc sql;
create table links as
select
Company,
sum(sameIndustry) as insideLinks,
sum(not sameIndustry) as outsideLinks
from (
select distinct
a.Company as Company,
b.Company as otherCompany,
a.Industry=b.Industry as sameIndustry
from
have as a left join
have as b on a.Name=b.Name and a.Company ne b.Company
)
group by Company;
select * from links;
quit;
Company insideLinks outsideLinks
___________________________________________________
A 1 2
B 1 2
C 0 3
D 0 3
PG
I concur with Tom and PG. You need to clarify your rules further to rationalize your output. Tom and PG’s interpretation is clear and consistent, therefore programmatic doable. I always believe that for any one of Proc SQL approaches, you can find its counterpart in data step Hash(), and this one seems no exception:
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
data want;
if _n_=1 then do;
if 0 then set have(rename=(company=_com industry=_ind));
declare hash h(dataset:'have(rename=(company=_com industry=_ind))', multidata:'y');
h.definekey('name');
h.definedata(all:'y');
h.definedone();
end;
set have;
inside=0;outside=0;
do rc=h.find() by 0 while (rc=0);
if industry=_ind and company ne _com then inside+1;
if industry ne _ind and company ne _com then outside+1;
rc=h.find_next();
end;
drop rc _:;
run;
Haikuo
Your post is not clear , why A=2 while B=0 not 2 too ?
data have; length Name Company Industry $24; informat Name Company Industry $upcase.; input Name Company Industry; datalines; John A finance John B finance John C information John D Insurance Henry A finance Henry B finance Henry C Information Henry D Banking ; proc sql; create table links as select *,(select count(*) from have where have.name=a.name and have.industry ne a.industry ) as outside, (select count(*)-1 from have where have.name=a.name and have.industry eq a.industry ) as inside from have as a; quit;
Ksharp
Message was edited by: xia keshan
You are absolutely correct. ksharp! . I corrected the table for future references.
Wow. Thanks a lot guys. Your codes were super helpful to me. Actually, I was writing the code in matlab with lots of for loops and it was horrible. Guess need to learn a lot of sql!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.