Hi All,
I've been struggling to write a neat code for some task but gets very cumbersome and complex each time.
I have a data set like below.
|
Whenever a person is present in two companies, there is a link between these two companies. What I am trying to compute is the number of links between a company with a company from same or another industry. For example, in the above, as long as John is present in Company A and B of finance, C of information and D of Insurance , I want to count 2 outside links and 1 inside link for A. So, my result should be something like this:
Company | Links outside industry | Links inside industry |
A | 2 | 1 |
B | 2 | 1 |
C | 1 | 0 |
D | 1 | 0 |
I feel like I am missing some procs that might get my work to be done easier. Would some one please help me with the basic idea and Procs? Thanks a lot!
You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
proc sql noprint ;
create table xx as
select a.name
, a.industry=b.industry as inside
, a.industry^=b.industry as outside
, a.company as companya
, b.company as companyb
, a.industry as industrya
, b.industry as industryb
from have a
, have b
where a.name = b.name
and not (a.industry = b.industry and a.company=b.company)
;
quit;
proc print;
run;
proc sql noprint ;
create table want as
select companya , sum(inside) as num_inside , sum(outside) as num_outside
from xx
group by 1
order by 1
;
quit;
proc print;
run;
num_ num_
Obs companya inside outside
1 A 1 2
2 B 1 2
3 C 0 3
4 D 0 3
You need to be clearer on how to count. Try this. It will combine the data with itself and generate 0/1 flags based on whether the industry names match. (note you might need to clean your data, I made the records for Finance match).
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
proc sql noprint ;
create table xx as
select a.name
, a.industry=b.industry as inside
, a.industry^=b.industry as outside
, a.company as companya
, b.company as companyb
, a.industry as industrya
, b.industry as industryb
from have a
, have b
where a.name = b.name
and not (a.industry = b.industry and a.company=b.company)
;
quit;
proc print;
run;
proc sql noprint ;
create table want as
select companya , sum(inside) as num_inside , sum(outside) as num_outside
from xx
group by 1
order by 1
;
quit;
proc print;
run;
num_ num_
Obs companya inside outside
1 A 1 2
2 B 1 2
3 C 0 3
4 D 0 3
I agree with Tom, SQL is your friend here, and it prefers clean data. I asumed that links provided by more than one name count only once and that companies associated with more than one industry count only once :
data have;
length Name Company Industry $24;
informat Name Company Industry $upcase.;
input Name Company Industry;
datalines;
John A Finance
John B finance
John C information
John D Insurance
Henry A FINANCE
Henry B finance
Henry C Information
Henry D Banking
;
proc sql;
create table links as
select
Company,
sum(sameIndustry) as insideLinks,
sum(not sameIndustry) as outsideLinks
from (
select distinct
a.Company as Company,
b.Company as otherCompany,
a.Industry=b.Industry as sameIndustry
from
have as a left join
have as b on a.Name=b.Name and a.Company ne b.Company
)
group by Company;
select * from links;
quit;
Company insideLinks outsideLinks
___________________________________________________
A 1 2
B 1 2
C 0 3
D 0 3
PG
I concur with Tom and PG. You need to clarify your rules further to rationalize your output. Tom and PG’s interpretation is clear and consistent, therefore programmatic doable. I always believe that for any one of Proc SQL approaches, you can find its counterpart in data step Hash(), and this one seems no exception:
data have;
input NAME $ Company $ INDUSTRY $20.;
cards;
John A Finance
John B Finance
John C Information
John D Insurance
run;
data want;
if _n_=1 then do;
if 0 then set have(rename=(company=_com industry=_ind));
declare hash h(dataset:'have(rename=(company=_com industry=_ind))', multidata:'y');
h.definekey('name');
h.definedata(all:'y');
h.definedone();
end;
set have;
inside=0;outside=0;
do rc=h.find() by 0 while (rc=0);
if industry=_ind and company ne _com then inside+1;
if industry ne _ind and company ne _com then outside+1;
rc=h.find_next();
end;
drop rc _:;
run;
Haikuo
Your post is not clear , why A=2 while B=0 not 2 too ?
data have; length Name Company Industry $24; informat Name Company Industry $upcase.; input Name Company Industry; datalines; John A finance John B finance John C information John D Insurance Henry A finance Henry B finance Henry C Information Henry D Banking ; proc sql; create table links as select *,(select count(*) from have where have.name=a.name and have.industry ne a.industry ) as outside, (select count(*)-1 from have where have.name=a.name and have.industry eq a.industry ) as inside from have as a; quit;
Ksharp
Message was edited by: xia keshan
You are absolutely correct. ksharp! . I corrected the table for future references.
Wow. Thanks a lot guys. Your codes were super helpful to me. Actually, I was writing the code in matlab with lots of for loops and it was horrible. Guess need to learn a lot of sql!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.