Solved: Re: Identifying treatment groups

Yegen · Posted 06-18-2017 03:18 AM

I am trying to define a treatment group based on the following relationship.

What's going on in the diagram?

This diagram illustrates the relationship between two Managers. The analysis is at the Group level and there are only two managers within one group.
The big black colored cirlce contains all the firms controlled by that specific manager.
The smaller circles in the big circle are divided based on their industry codes.
Within these small circles, all the firms that belong to that industry are included.
The relationships illustrate whether the firms should be compared or not.

What am I trying to do?

I am comparing the small circles of one Manager to the same circles that the other Manager may or may not control.
Whenever there are comparable industries, then start go to Step 3. If not, then ignore that industry and assign treated=0 for all rows with that particular industry number and group number (all analysis is at the group level).
Check whether the two managers are controlling the same firm, if yes then assign treated =1.
If all firms are different (distinct) in that specific industry (i.e., no firm is controlled by both managers), then still assign treated = 1.
So for Step 3 and 4, it is basically, whenever both managers control firms in the same industry, then assign 1 as treated for all.

What did I do in the code?

Partition all firms in both A and T (I define A as one manager and T as the other, where A means acquiring manager and T means Targeted manager), in A only, and in B only. For those in both A and T, treated=1 for all firms in this subgroup.
For firms in A ONLY, merge its industry with ALL industries that T covers. If merged, treated=1.
Symmetrically, for firms in T ONLY, merge its industry with ALL industries that A covers, if merged, treated=1.
All other firms get treated=0.

Below is what I have done so far as well as my sample. This is an extension to a previous problem that @Shmuel and @gamotte have provided extensive help with. I am posting this as a new question since defining the control and treatment group is often used by researchers as well as practitioners and alternative codes may help others as well. I was able to write a code that gets the result, but I would like to know whether there are alternative and more efficient way to get the result.

Shmuel · Posted 06-19-2017 10:09 AM

I'm not going to check your code. I know you can do it yourself.

Just for study of methods, though my results don't fit your WANT results, I attach here an

absolutely different approach. It is more similar to the last table I posted:

Category	Group	Industry	FirmID	Count_Mgrs	Count 2+	Control	Industry e.g.
A	any	any	any	1		0	2124
B	any	any	any	2+		see below	1842
C	any	any		max=1		1	1233
D	any	any		max=2+	1	0	2832
E	any	any		max=2+	2+	1	1842

Pay attention to: my notes, especially the red emphasized:

DATA have;
    INPUT Mgrno Group ID Industry Source $;  /* line updated */
    drop source;  /* no need as it is equivalent to Group-Mgrno */
    DATALINES; 
    9999 1 00462610 1842 A
    9999 1 01737210 1842 A
    9999 1 12690310 2832 A
    8888 1 12690310 2832 T
    8888 1 12690311 2832 T
    8888 1 37158620 2124 T
    9999 1 23298321 4238 A
    9999 1 53228424 4238 A
    8888 1 53228424 4238 T
    8888 1 23298321 4238 T
    9999 1 12290311 1233 A
    8888 1 01737211 1233 T
    8888 1 99998881 1233 T
    7777 2 12690310 2832 A
    3333 2 32690311 2832 T
    7777 2 37158620 2124 A
    7777 2 38138222 3321 A
    3333 2 44134422 3321 T
    7777 2 53228424 4238 A
    3333 2 77233421 4238 T
    3333 2 12290311 1233 T
    3333 2 93234320 1333 T
;
RUN;

proc sort data=have;
    by Group Industry ID MgrnO;
quit;

data keys (keep=Group Industry ID MgrnO)
     temp (drop=c_mgr_ind)
     c_per_id (drop=mgrno c_mgr_ind control)
     c_per_ind(drop=mgrno ID c_mgr_ID ID control)
   ;
  format Group Industry ID MgrnO; /* force order of variables */
  set have;
    by Group Industry ID;
       retain c_mgr_ID c_mgr_ind;
       output keys;
       if first.Industry then do; 
          c_mgr_ID=0;    /* count managers per ID */
          c_mgr_ind=0;   /* Count managers per Industry */
       end;
       if first.ID and last.ID 
       then do;
          c_mgr_ID=1; 
          c_mgr_ind +1;
          control=0; 
          output temp; 
       end; 
       else do;
          c_mgr_ID +1;
          if last.ID then do;
             output c_per_id; 
             c_mgr_ind + c_mgr_ID;
             c_mgr_ID =0;
          end;
       end;
       if last.industry then output c_per_ind;
run;


data temp1;
 merge keys temp(in=intmp);
 by Group Industry ID Mgrno;
run;

data temp2;
 merge temp1(in=intmp) c_per_id;
 by Group Industry ID;
    if not intmp then control=.;
run;
 
data temp3;
 merge temp2 
       c_per_ind;
 by Group Industry;
run;

data want_new;
 set temp3;
    if control=. then do;
       if c_mgr_ind = 1 then control=1; else
       if int(c_mgr_ind/2) le 1 then control=0; else  /* zero or one pair only */
       if c_mgr_ind ge 4 then control=1; else control=0;
    end;
run; 

/*========= checking/comparing to wanted results =======================*/
DATA want;
 INPUT Mgrno Group ID Industry Control;
 DATALINES; 
 9999 1 00462610 1842 0
 9999 1 01737210 1842 0
 9999 1 12690310 2832 0
 8888 1 12690310 2832 0 
 8888 1 12690311 2832 0 
 8888 1 37158620 2124 0 
 9999 1 23298321 4238 1 
 9999 1 53228424 4238 1 
 8888 1 53228424 4238 1
 8888 1 23298321 4238 1 
 9999 1 12290311 1233 1 
 8888 1 01737211 1233 1 
 8888 1 99998881 1233 1
 7777 2 12690310 2832 1 
 3333 2 32690311 2832 1 
 7777 2 37158620 2124 0 
 7777 2 38138222 3321 1
 3333 2 44134422 3321 1 
 7777 2 53228424 4238 1 
 3333 2 77233421 4238 1 
 3333 2 12290311 1233 0 
 3333 2 93234320 1333 0 
;
RUN;   
       
proc sort data=want;
    by Group Industry ID MgrnO;
quit;

proc sort data=want_new;
    by Group Industry ID MgrnO;
quit;

proc compare data=want compare=want_new criteria=1.0;
 ID Group Industry ID MgrnO;
 format ID 9.;
run;

data diff;
 merge want (in=in1) 
       want_new(in=in2 rename=(control=c_control));
 by Group Industry ID MgrnO;
   if in1 and not in2 then stat='IN1'; else
   if in2 and not in1 then stat='IN2'; 
   if not (in1 and in2);
run;

View solution in original post

Yegen · Posted 06-18-2017 03:20 AM

Here is what I have:

DATA have;
	INPUT Mgrno Group ID Industry Source $;
	DATALINES; 
	9999 1 00462610 1842 A
	9999 1 01737210 1842 A
	9999 1 12690310 2832 A
	8888 1 12690310 2832 T
	8888 1 12690311 2832 T
	8888 1 37158620 2124 T
	9999 1 23298321 4238 A
	9999 1 53228424 4238 A
	8888 1 53228424 4238 T
	8888 1 23298321 4238 T
	9999 1 12290311 1233 A
	8888 1 01737211 1233 T
	8888 1 99998881 1233 T
	7777 2 12690310 2832 A
	3333 2 32690311 2832 T
	7777 2 37158620 2124 A
	7777 2 38138222 3321 A
	3333 2 44134422 3321 T
	7777 2 53228424 4238 A
	3333 2 77233421 4238 T
	3333 2 12290311 1233 T
	3333 2 93234320 1333 T
;
RUN;

Here is the want dataset:

DATA want;
 INPUT Mgrno Group ID Industry Control;
 DATALINES; 
 9999 1 00462610 1842 0
 9999 1 01737210 1842 0
 9999 1 12690310 2832 0
 8888 1 12690310 2832 0 
 8888 1 12690311 2832 0 
 8888 1 37158620 2124 0 
 9999 1 23298321 4238 1 
 9999 1 53228424 4238 1 
 8888 1 53228424 4238 1
 8888 1 23298321 4238 1 
 9999 1 12290311 1233 1 
 8888 1 01737211 1233 1 
 8888 1 99998881 1233 1
 7777 2 12690310 2832 1 
 3333 2 32690311 2832 1 
 7777 2 37158620 2124 0 
 7777 2 38138222 3321 1
 3333 2 44134422 3321 1 
 7777 2 53228424 4238 1 
 3333 2 77233421 4238 1 
 3333 2 12290311 1233 0 
 3333 2 93234320 1333 0 
;
RUN;

Here is my code:

proc sort data=work.have nouniquekeys uniqueout=singles out=dup;
	by group id;
quit;


proc sort data=work.have out=list_noduplicates nodupkey;
	by group id;
quit;



data both;
	set dup;
	cont=1;
run;

/*The firms controlled by both managers are control*/
proc sql;
	create table distinct_both as
	select distinct group, industry, 1 as control
	from both;
quit;


/*Here I am selecting the Managers with a Source=A since I will compare the 
firms this Manager controls with the firms the other Manager controls. T
stands for Target and A for Acquirer.*/
data all_acq;
	set work.list_noduplicates;
	if Source='A';
run;

/*Here I am selecting the Managers with a Source=T since I will compare the 
firms this Manager controls with the firms the other Manager controls*/
data all_tar;
	set work.list_noduplicates;
	if Source='T';
run;



/*Distinct cases start here, first compare distinct industries of target to 
all industries of the acquirer*/

data distinct_tar;
	set singles;
	if source='T';
	distinct_tar=1;
	industry_target=industry;
	cusip_tar=ncusip;
run;

proc sql;
	create table treated_tar_V1 as 
	select * 
	from all_acq as a 
	left join distinct_tar as b 
	on a.industry=b.industry_target and a.group=b.group;
quit;

/*These are some of the treated firms*/
data treated_tar_V2(keep=group industry);
	set treated_tar_V1;
	if industry_target^=.;
run;

proc sort data=treated_tar_V2 out=distinct_target nodupkey;
	by group industry;
run;

/*Second compare distinct firms controlled only by acquirer to all firms controlled by target*/

data distinct_acq;
	set singles;
	if source='A';
	distinct_acq=1;
	industry_acq=industry;
	cusip_acq=ncusip;
run;

proc sql;
	create table treated_acq_V1 as 
	select * 
	from all_tar as a 
	left join distinct_acq as b 
	on a.industry=b.industry_acq and a.group=b.group;
quit;

/*These are the treated firms*/
data treated_acq_V2(keep=group industry);
	set treated_acq_V1;
	if industry_acq^=.;
run;

proc sort data=treated_acq_V2 nodupkey;
	by group industry;
run;


data all_treated(keep=group industry treated);
	set distinct_both treated_tar_V2 treated_acq_V2;
	treated=1;
run;


proc sql;
	create table all_treated_distinct as 
	select distinct group, industry, 1 as treated
	from all_treated;
quit;

Shmuel · Posted 06-18-2017 06:07 AM

You can combine the next two steps:

/*Here I am selecting the Managers with a Source=A since I will compare the 
firms this Manager controls with the firms the other Manager controls. T
stands for Target and A for Acquirer.*/
data all_acq;
	set work.list_noduplicates;
	if Source='A';
run;

/*Here I am selecting the Managers with a Source=T since I will compare the 
firms this Manager controls with the firms the other Manager controls*/
data all_tar;
	set work.list_noduplicates;
	if Source='T';
run;

into one step:


data all_acq   all_tar;
	set work.list_noduplicates;
	if Source='A' then output all_acq; else
        if Source='T' then output all_tar;   else put '**Error: ' source=;
run;

I shall try later to simplify the code, as much as I succeed.

Yegen · Posted 06-18-2017 07:29 PM

Your replies are always very helpful, @Shmuel. Thank you. I think the same result may be obtained by looking at a given group and selecting the industries where there are only two managers. It actually is much simpler this way than comparing the industries of the two managers within a given group. I will try to write a SQL code to try it out.

Yegen · Posted 06-19-2017 12:56 AM

@Shmuel, I just found an alternative way that mainly relies on proc summary and proc sql. This code is much easier to follow. First, I identify the industries where there is only one manager, and then identify the industries where there is only one firm. It's actually pretty simple this way although I went backwards by identifying the treated=0 (or control=0) group. The other codes identifies the treated=1 group and both ways produce the same result, which is great! Made me quite happy. 🙂

Update: Apparently, I should have not been so quick with being happy about the code. It turns out that when I use the code below in the actual data, I get very different results. The logic of the original code is exactly as the one in the diagram. In the code below, I tried a different way, but it turns out that this did not work well. I am still not sure where the difference is though.

data have;
	set have;
run;

/*Counting the number of firms within a given group-industry pair*/
proc summary nway;
	class group industry;
	var id;
	output out=count_firms(drop=_:) 
	n=count_firms;
run;

/*Adding it back to the original sample*/
proc sql;
	create table work.have1 as 
	select * 
	from work.have as a 
	left join work.count_firms as b 
	on a.group=b.group & a.industry=b.industry;
quit;

proc sort data=have1 out=dup nodupkey;
	by group industry mgrno;
quit;

/*Counting the number of distinct managers within a given group-industry pair*/

proc summary nway;
	class group industry;
	var MGRNO;
	output out=count_managers(drop=_:) 
	n=count_managers;
run;

/*Adding it back to the original sample*/
proc sql;
	create table work.output_V2 as 
	select * 
	from work.have1 as a 
	left join work.count_managers as b 
	on a.group=b.group & a.industry=b.industry;
quit;

/*Whenever there is a single manager or whenever there is only one firm in a given group-industry , treated=0*/
data work.output_V3;
	set work.output_V2;
	if count_firms=1 or count_managers=1 then treated=0; else treated=1;
run;

proc sql;
	create table all_treated_distinct1 as 
	select distinct group, industry, treated
	from output_V3
	where treated=1;
quit;

proc sql;
	create table want as 
	select *
	from have as a 
	left join all_treated_distinct1 as b
	on (a.group=b.group) & (a.industry=b.industry);
quit;

proc stdize data = work.want out=work.want reponly missing=0;
	var treated;
run;

Shmuel · Posted 06-19-2017 10:09 AM

I'm not going to check your code. I know you can do it yourself.

Just for study of methods, though my results don't fit your WANT results, I attach here an

absolutely different approach. It is more similar to the last table I posted:

Category	Group	Industry	FirmID	Count_Mgrs	Count 2+	Control	Industry e.g.
A	any	any	any	1		0	2124
B	any	any	any	2+		see below	1842
C	any	any		max=1		1	1233
D	any	any		max=2+	1	0	2832
E	any	any		max=2+	2+	1	1842

Pay attention to: my notes, especially the red emphasized:

DATA have;
    INPUT Mgrno Group ID Industry Source $;  /* line updated */
    drop source;  /* no need as it is equivalent to Group-Mgrno */
    DATALINES; 
    9999 1 00462610 1842 A
    9999 1 01737210 1842 A
    9999 1 12690310 2832 A
    8888 1 12690310 2832 T
    8888 1 12690311 2832 T
    8888 1 37158620 2124 T
    9999 1 23298321 4238 A
    9999 1 53228424 4238 A
    8888 1 53228424 4238 T
    8888 1 23298321 4238 T
    9999 1 12290311 1233 A
    8888 1 01737211 1233 T
    8888 1 99998881 1233 T
    7777 2 12690310 2832 A
    3333 2 32690311 2832 T
    7777 2 37158620 2124 A
    7777 2 38138222 3321 A
    3333 2 44134422 3321 T
    7777 2 53228424 4238 A
    3333 2 77233421 4238 T
    3333 2 12290311 1233 T
    3333 2 93234320 1333 T
;
RUN;

proc sort data=have;
    by Group Industry ID MgrnO;
quit;

data keys (keep=Group Industry ID MgrnO)
     temp (drop=c_mgr_ind)
     c_per_id (drop=mgrno c_mgr_ind control)
     c_per_ind(drop=mgrno ID c_mgr_ID ID control)
   ;
  format Group Industry ID MgrnO; /* force order of variables */
  set have;
    by Group Industry ID;
       retain c_mgr_ID c_mgr_ind;
       output keys;
       if first.Industry then do; 
          c_mgr_ID=0;    /* count managers per ID */
          c_mgr_ind=0;   /* Count managers per Industry */
       end;
       if first.ID and last.ID 
       then do;
          c_mgr_ID=1; 
          c_mgr_ind +1;
          control=0; 
          output temp; 
       end; 
       else do;
          c_mgr_ID +1;
          if last.ID then do;
             output c_per_id; 
             c_mgr_ind + c_mgr_ID;
             c_mgr_ID =0;
          end;
       end;
       if last.industry then output c_per_ind;
run;


data temp1;
 merge keys temp(in=intmp);
 by Group Industry ID Mgrno;
run;

data temp2;
 merge temp1(in=intmp) c_per_id;
 by Group Industry ID;
    if not intmp then control=.;
run;
 
data temp3;
 merge temp2 
       c_per_ind;
 by Group Industry;
run;

data want_new;
 set temp3;
    if control=. then do;
       if c_mgr_ind = 1 then control=1; else
       if int(c_mgr_ind/2) le 1 then control=0; else  /* zero or one pair only */
       if c_mgr_ind ge 4 then control=1; else control=0;
    end;
run; 

/*========= checking/comparing to wanted results =======================*/
DATA want;
 INPUT Mgrno Group ID Industry Control;
 DATALINES; 
 9999 1 00462610 1842 0
 9999 1 01737210 1842 0
 9999 1 12690310 2832 0
 8888 1 12690310 2832 0 
 8888 1 12690311 2832 0 
 8888 1 37158620 2124 0 
 9999 1 23298321 4238 1 
 9999 1 53228424 4238 1 
 8888 1 53228424 4238 1
 8888 1 23298321 4238 1 
 9999 1 12290311 1233 1 
 8888 1 01737211 1233 1 
 8888 1 99998881 1233 1
 7777 2 12690310 2832 1 
 3333 2 32690311 2832 1 
 7777 2 37158620 2124 0 
 7777 2 38138222 3321 1
 3333 2 44134422 3321 1 
 7777 2 53228424 4238 1 
 3333 2 77233421 4238 1 
 3333 2 12290311 1233 0 
 3333 2 93234320 1333 0 
;
RUN;   
       
proc sort data=want;
    by Group Industry ID MgrnO;
quit;

proc sort data=want_new;
    by Group Industry ID MgrnO;
quit;

proc compare data=want compare=want_new criteria=1.0;
 ID Group Industry ID MgrnO;
 format ID 9.;
run;

data diff;
 merge want (in=in1) 
       want_new(in=in2 rename=(control=c_control));
 by Group Industry ID MgrnO;
   if in1 and not in2 then stat='IN1'; else
   if in2 and not in1 then stat='IN2'; 
   if not (in1 and in2);
run;

Shmuel · Posted 06-19-2017 10:12 AM

When we use different code we often get different results. It is not easy to be a programmer and not easy to understand why. You need mathematitian logic to understand why. If results are equal then either are both OK or both with same logic error.

Yegen · Posted 06-21-2017 02:18 PM

You are certainly right, @Shmuel. I agree with you that having the correct logic is key. I think the first code should be correct since in the other one I may have not deleted some firms that were supposed to be deleted. So the logic in the second one may be incorrect.

Shmuel · Posted 06-21-2017 02:29 PM

That is beyond SAS and I believe you are able to struggle the logic better than me.

Yegen · Posted 06-21-2017 05:55 PM

You are right that the logic factor is beyong SAS, but I would argue that your better at logic than me. 🙂 Plus, your help with finding alternative ways to solve the issues has been very helpful. You are honestly a great SAS mentor and teacher! I benefit a lot from your comments.

Yegen · Posted 06-23-2017 07:46 AM

@Shmuel, I think the earlier code works correctly so we can ignore the second (simplified) code since it produces incorrect results. I think I am failing to exclude cases that should not be in the final sample. It seems like identifying directly the treatment group is a safer way compared to deleting the control group to identify the treatment (indirect way).

If you prefer and have not much time, I can accept your modification (i.e., simplification) of my code as the answer or just simply wait until you may have something. A colleague of mine wrote another code that produces almost the same outcome as the first code (only off by one observation).

To be frank, I have to state that without your comments I would have not been able to go to the correct direction. So your earlier replies have really helped out to come up with these codes! Thank you!!

Registration is open

SAS Training: Just a Click Away