I am trying to define a treatment group based on the following relationship.
What's going on in the diagram?
What am I trying to do?
What did I do in the code?
Below is what I have done so far as well as my sample. This is an extension to a previous problem that @Shmuel and @gamotte have provided extensive help with. I am posting this as a new question since defining the control and treatment group is often used by researchers as well as practitioners and alternative codes may help others as well. I was able to write a code that gets the result, but I would like to know whether there are alternative and more efficient way to get the result.
I'm not going to check your code. I know you can do it yourself.
Just for study of methods, though my results don't fit your WANT results, I attach here an
absolutely different approach. It is more similar to the last table I posted:
Category | Group | Industry | FirmID | Count_Mgrs | Count 2+ | Control | Industry e.g. | |
A | any | any | any | 1 | 0 | 2124 | ||
B | any | any | any | 2+ | see below | 1842 | ||
C | any | any | max=1 | 1 | 1233 | |||
D | any | any | max=2+ | 1 | 0 | 2832 | ||
E | any | any | max=2+ | 2+ | 1 | 1842 |
Pay attention to: my notes, especially the red emphasized:
DATA have;
INPUT Mgrno Group ID Industry Source $; /* line updated */
drop source; /* no need as it is equivalent to Group-Mgrno */
DATALINES;
9999 1 00462610 1842 A
9999 1 01737210 1842 A
9999 1 12690310 2832 A
8888 1 12690310 2832 T
8888 1 12690311 2832 T
8888 1 37158620 2124 T
9999 1 23298321 4238 A
9999 1 53228424 4238 A
8888 1 53228424 4238 T
8888 1 23298321 4238 T
9999 1 12290311 1233 A
8888 1 01737211 1233 T
8888 1 99998881 1233 T
7777 2 12690310 2832 A
3333 2 32690311 2832 T
7777 2 37158620 2124 A
7777 2 38138222 3321 A
3333 2 44134422 3321 T
7777 2 53228424 4238 A
3333 2 77233421 4238 T
3333 2 12290311 1233 T
3333 2 93234320 1333 T
;
RUN;
proc sort data=have;
by Group Industry ID MgrnO;
quit;
data keys (keep=Group Industry ID MgrnO)
temp (drop=c_mgr_ind)
c_per_id (drop=mgrno c_mgr_ind control)
c_per_ind(drop=mgrno ID c_mgr_ID ID control)
;
format Group Industry ID MgrnO; /* force order of variables */
set have;
by Group Industry ID;
retain c_mgr_ID c_mgr_ind;
output keys;
if first.Industry then do;
c_mgr_ID=0; /* count managers per ID */
c_mgr_ind=0; /* Count managers per Industry */
end;
if first.ID and last.ID
then do;
c_mgr_ID=1;
c_mgr_ind +1;
control=0;
output temp;
end;
else do;
c_mgr_ID +1;
if last.ID then do;
output c_per_id;
c_mgr_ind + c_mgr_ID;
c_mgr_ID =0;
end;
end;
if last.industry then output c_per_ind;
run;
data temp1;
merge keys temp(in=intmp);
by Group Industry ID Mgrno;
run;
data temp2;
merge temp1(in=intmp) c_per_id;
by Group Industry ID;
if not intmp then control=.;
run;
data temp3;
merge temp2
c_per_ind;
by Group Industry;
run;
data want_new;
set temp3;
if control=. then do;
if c_mgr_ind = 1 then control=1; else
if int(c_mgr_ind/2) le 1 then control=0; else /* zero or one pair only */
if c_mgr_ind ge 4 then control=1; else control=0;
end;
run;
/*========= checking/comparing to wanted results =======================*/
DATA want;
INPUT Mgrno Group ID Industry Control;
DATALINES;
9999 1 00462610 1842 0
9999 1 01737210 1842 0
9999 1 12690310 2832 0
8888 1 12690310 2832 0
8888 1 12690311 2832 0
8888 1 37158620 2124 0
9999 1 23298321 4238 1
9999 1 53228424 4238 1
8888 1 53228424 4238 1
8888 1 23298321 4238 1
9999 1 12290311 1233 1
8888 1 01737211 1233 1
8888 1 99998881 1233 1
7777 2 12690310 2832 1
3333 2 32690311 2832 1
7777 2 37158620 2124 0
7777 2 38138222 3321 1
3333 2 44134422 3321 1
7777 2 53228424 4238 1
3333 2 77233421 4238 1
3333 2 12290311 1233 0
3333 2 93234320 1333 0
;
RUN;
proc sort data=want;
by Group Industry ID MgrnO;
quit;
proc sort data=want_new;
by Group Industry ID MgrnO;
quit;
proc compare data=want compare=want_new criteria=1.0;
ID Group Industry ID MgrnO;
format ID 9.;
run;
data diff;
merge want (in=in1)
want_new(in=in2 rename=(control=c_control));
by Group Industry ID MgrnO;
if in1 and not in2 then stat='IN1'; else
if in2 and not in1 then stat='IN2';
if not (in1 and in2);
run;
Here is what I have:
DATA have;
INPUT Mgrno Group ID Industry Source $;
DATALINES;
9999 1 00462610 1842 A
9999 1 01737210 1842 A
9999 1 12690310 2832 A
8888 1 12690310 2832 T
8888 1 12690311 2832 T
8888 1 37158620 2124 T
9999 1 23298321 4238 A
9999 1 53228424 4238 A
8888 1 53228424 4238 T
8888 1 23298321 4238 T
9999 1 12290311 1233 A
8888 1 01737211 1233 T
8888 1 99998881 1233 T
7777 2 12690310 2832 A
3333 2 32690311 2832 T
7777 2 37158620 2124 A
7777 2 38138222 3321 A
3333 2 44134422 3321 T
7777 2 53228424 4238 A
3333 2 77233421 4238 T
3333 2 12290311 1233 T
3333 2 93234320 1333 T
;
RUN;
Here is the want dataset:
DATA want;
INPUT Mgrno Group ID Industry Control;
DATALINES;
9999 1 00462610 1842 0
9999 1 01737210 1842 0
9999 1 12690310 2832 0
8888 1 12690310 2832 0
8888 1 12690311 2832 0
8888 1 37158620 2124 0
9999 1 23298321 4238 1
9999 1 53228424 4238 1
8888 1 53228424 4238 1
8888 1 23298321 4238 1
9999 1 12290311 1233 1
8888 1 01737211 1233 1
8888 1 99998881 1233 1
7777 2 12690310 2832 1
3333 2 32690311 2832 1
7777 2 37158620 2124 0
7777 2 38138222 3321 1
3333 2 44134422 3321 1
7777 2 53228424 4238 1
3333 2 77233421 4238 1
3333 2 12290311 1233 0
3333 2 93234320 1333 0
;
RUN;
Here is my code:
proc sort data=work.have nouniquekeys uniqueout=singles out=dup;
by group id;
quit;
proc sort data=work.have out=list_noduplicates nodupkey;
by group id;
quit;
data both;
set dup;
cont=1;
run;
/*The firms controlled by both managers are control*/
proc sql;
create table distinct_both as
select distinct group, industry, 1 as control
from both;
quit;
/*Here I am selecting the Managers with a Source=A since I will compare the
firms this Manager controls with the firms the other Manager controls. T
stands for Target and A for Acquirer.*/
data all_acq;
set work.list_noduplicates;
if Source='A';
run;
/*Here I am selecting the Managers with a Source=T since I will compare the
firms this Manager controls with the firms the other Manager controls*/
data all_tar;
set work.list_noduplicates;
if Source='T';
run;
/*Distinct cases start here, first compare distinct industries of target to
all industries of the acquirer*/
data distinct_tar;
set singles;
if source='T';
distinct_tar=1;
industry_target=industry;
cusip_tar=ncusip;
run;
proc sql;
create table treated_tar_V1 as
select *
from all_acq as a
left join distinct_tar as b
on a.industry=b.industry_target and a.group=b.group;
quit;
/*These are some of the treated firms*/
data treated_tar_V2(keep=group industry);
set treated_tar_V1;
if industry_target^=.;
run;
proc sort data=treated_tar_V2 out=distinct_target nodupkey;
by group industry;
run;
/*Second compare distinct firms controlled only by acquirer to all firms controlled by target*/
data distinct_acq;
set singles;
if source='A';
distinct_acq=1;
industry_acq=industry;
cusip_acq=ncusip;
run;
proc sql;
create table treated_acq_V1 as
select *
from all_tar as a
left join distinct_acq as b
on a.industry=b.industry_acq and a.group=b.group;
quit;
/*These are the treated firms*/
data treated_acq_V2(keep=group industry);
set treated_acq_V1;
if industry_acq^=.;
run;
proc sort data=treated_acq_V2 nodupkey;
by group industry;
run;
data all_treated(keep=group industry treated);
set distinct_both treated_tar_V2 treated_acq_V2;
treated=1;
run;
proc sql;
create table all_treated_distinct as
select distinct group, industry, 1 as treated
from all_treated;
quit;
You can combine the next two steps:
/*Here I am selecting the Managers with a Source=A since I will compare the
firms this Manager controls with the firms the other Manager controls. T
stands for Target and A for Acquirer.*/
data all_acq;
set work.list_noduplicates;
if Source='A';
run;
/*Here I am selecting the Managers with a Source=T since I will compare the
firms this Manager controls with the firms the other Manager controls*/
data all_tar;
set work.list_noduplicates;
if Source='T';
run;
into one step:
data all_acq all_tar;
set work.list_noduplicates;
if Source='A' then output all_acq; else
if Source='T' then output all_tar; else put '**Error: ' source=;
run;
I shall try later to simplify the code, as much as I succeed.
Your replies are always very helpful, @Shmuel. Thank you. I think the same result may be obtained by looking at a given group and selecting the industries where there are only two managers. It actually is much simpler this way than comparing the industries of the two managers within a given group. I will try to write a SQL code to try it out.
@Shmuel, I just found an alternative way that mainly relies on proc summary and proc sql. This code is much easier to follow. First, I identify the industries where there is only one manager, and then identify the industries where there is only one firm. It's actually pretty simple this way although I went backwards by identifying the treated=0 (or control=0) group. The other codes identifies the treated=1 group and both ways produce the same result, which is great! Made me quite happy. 🙂
Update: Apparently, I should have not been so quick with being happy about the code. It turns out that when I use the code below in the actual data, I get very different results. The logic of the original code is exactly as the one in the diagram. In the code below, I tried a different way, but it turns out that this did not work well. I am still not sure where the difference is though.
data have;
set have;
run;
/*Counting the number of firms within a given group-industry pair*/
proc summary nway;
class group industry;
var id;
output out=count_firms(drop=_:)
n=count_firms;
run;
/*Adding it back to the original sample*/
proc sql;
create table work.have1 as
select *
from work.have as a
left join work.count_firms as b
on a.group=b.group & a.industry=b.industry;
quit;
proc sort data=have1 out=dup nodupkey;
by group industry mgrno;
quit;
/*Counting the number of distinct managers within a given group-industry pair*/
proc summary nway;
class group industry;
var MGRNO;
output out=count_managers(drop=_:)
n=count_managers;
run;
/*Adding it back to the original sample*/
proc sql;
create table work.output_V2 as
select *
from work.have1 as a
left join work.count_managers as b
on a.group=b.group & a.industry=b.industry;
quit;
/*Whenever there is a single manager or whenever there is only one firm in a given group-industry , treated=0*/
data work.output_V3;
set work.output_V2;
if count_firms=1 or count_managers=1 then treated=0; else treated=1;
run;
proc sql;
create table all_treated_distinct1 as
select distinct group, industry, treated
from output_V3
where treated=1;
quit;
proc sql;
create table want as
select *
from have as a
left join all_treated_distinct1 as b
on (a.group=b.group) & (a.industry=b.industry);
quit;
proc stdize data = work.want out=work.want reponly missing=0;
var treated;
run;
I'm not going to check your code. I know you can do it yourself.
Just for study of methods, though my results don't fit your WANT results, I attach here an
absolutely different approach. It is more similar to the last table I posted:
Category | Group | Industry | FirmID | Count_Mgrs | Count 2+ | Control | Industry e.g. | |
A | any | any | any | 1 | 0 | 2124 | ||
B | any | any | any | 2+ | see below | 1842 | ||
C | any | any | max=1 | 1 | 1233 | |||
D | any | any | max=2+ | 1 | 0 | 2832 | ||
E | any | any | max=2+ | 2+ | 1 | 1842 |
Pay attention to: my notes, especially the red emphasized:
DATA have;
INPUT Mgrno Group ID Industry Source $; /* line updated */
drop source; /* no need as it is equivalent to Group-Mgrno */
DATALINES;
9999 1 00462610 1842 A
9999 1 01737210 1842 A
9999 1 12690310 2832 A
8888 1 12690310 2832 T
8888 1 12690311 2832 T
8888 1 37158620 2124 T
9999 1 23298321 4238 A
9999 1 53228424 4238 A
8888 1 53228424 4238 T
8888 1 23298321 4238 T
9999 1 12290311 1233 A
8888 1 01737211 1233 T
8888 1 99998881 1233 T
7777 2 12690310 2832 A
3333 2 32690311 2832 T
7777 2 37158620 2124 A
7777 2 38138222 3321 A
3333 2 44134422 3321 T
7777 2 53228424 4238 A
3333 2 77233421 4238 T
3333 2 12290311 1233 T
3333 2 93234320 1333 T
;
RUN;
proc sort data=have;
by Group Industry ID MgrnO;
quit;
data keys (keep=Group Industry ID MgrnO)
temp (drop=c_mgr_ind)
c_per_id (drop=mgrno c_mgr_ind control)
c_per_ind(drop=mgrno ID c_mgr_ID ID control)
;
format Group Industry ID MgrnO; /* force order of variables */
set have;
by Group Industry ID;
retain c_mgr_ID c_mgr_ind;
output keys;
if first.Industry then do;
c_mgr_ID=0; /* count managers per ID */
c_mgr_ind=0; /* Count managers per Industry */
end;
if first.ID and last.ID
then do;
c_mgr_ID=1;
c_mgr_ind +1;
control=0;
output temp;
end;
else do;
c_mgr_ID +1;
if last.ID then do;
output c_per_id;
c_mgr_ind + c_mgr_ID;
c_mgr_ID =0;
end;
end;
if last.industry then output c_per_ind;
run;
data temp1;
merge keys temp(in=intmp);
by Group Industry ID Mgrno;
run;
data temp2;
merge temp1(in=intmp) c_per_id;
by Group Industry ID;
if not intmp then control=.;
run;
data temp3;
merge temp2
c_per_ind;
by Group Industry;
run;
data want_new;
set temp3;
if control=. then do;
if c_mgr_ind = 1 then control=1; else
if int(c_mgr_ind/2) le 1 then control=0; else /* zero or one pair only */
if c_mgr_ind ge 4 then control=1; else control=0;
end;
run;
/*========= checking/comparing to wanted results =======================*/
DATA want;
INPUT Mgrno Group ID Industry Control;
DATALINES;
9999 1 00462610 1842 0
9999 1 01737210 1842 0
9999 1 12690310 2832 0
8888 1 12690310 2832 0
8888 1 12690311 2832 0
8888 1 37158620 2124 0
9999 1 23298321 4238 1
9999 1 53228424 4238 1
8888 1 53228424 4238 1
8888 1 23298321 4238 1
9999 1 12290311 1233 1
8888 1 01737211 1233 1
8888 1 99998881 1233 1
7777 2 12690310 2832 1
3333 2 32690311 2832 1
7777 2 37158620 2124 0
7777 2 38138222 3321 1
3333 2 44134422 3321 1
7777 2 53228424 4238 1
3333 2 77233421 4238 1
3333 2 12290311 1233 0
3333 2 93234320 1333 0
;
RUN;
proc sort data=want;
by Group Industry ID MgrnO;
quit;
proc sort data=want_new;
by Group Industry ID MgrnO;
quit;
proc compare data=want compare=want_new criteria=1.0;
ID Group Industry ID MgrnO;
format ID 9.;
run;
data diff;
merge want (in=in1)
want_new(in=in2 rename=(control=c_control));
by Group Industry ID MgrnO;
if in1 and not in2 then stat='IN1'; else
if in2 and not in1 then stat='IN2';
if not (in1 and in2);
run;
When we use different code we often get different results. It is not easy to be a programmer and not easy to understand why. You need mathematitian logic to understand why. If results are equal then either are both OK or both with same logic error.
You are certainly right, @Shmuel. I agree with you that having the correct logic is key. I think the first code should be correct since in the other one I may have not deleted some firms that were supposed to be deleted. So the logic in the second one may be incorrect.
That is beyond SAS and I believe you are able to struggle the logic better than me.
You are right that the logic factor is beyong SAS, but I would argue that your better at logic than me. 🙂 Plus, your help with finding alternative ways to solve the issues has been very helpful. You are honestly a great SAS mentor and teacher! I benefit a lot from your comments.
@Shmuel, I think the earlier code works correctly so we can ignore the second (simplified) code since it produces incorrect results. I think I am failing to exclude cases that should not be in the final sample. It seems like identifying directly the treatment group is a safer way compared to deleting the control group to identify the treatment (indirect way).
If you prefer and have not much time, I can accept your modification (i.e., simplification) of my code as the answer or just simply wait until you may have something. A colleague of mine wrote another code that produces almost the same outcome as the first code (only off by one observation).
To be frank, I have to state that without your comments I would have not been able to go to the correct direction. So your earlier replies have really helped out to come up with these codes! Thank you!!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.