BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

I have two datasets. The first dataset (i.e., have1) is from time period T, whereas the second dataset (i.e., have2) is from time period T+1. Some firms from period T are also in T+1. I am trying to construct groups after a right join merger (where have1 is on the right side). All of the analysis is done at the Merger_ID - Firm_ID level and SICH is the industry classification.  

What am I trying to do?

  1. I want to identify the firms that are both in period T and T+1 (i.e., in both have1 and have2). <- Easy SQL code for this step.
  2. I want to construct six groups as follows: <- Here in the second step, I am stuck.

                  Group A: In a given SICH, all firms from period T are also in period T+1.

                  Group B: In a given SICH, only firms with Source=``A” from period T are also in period T+1.

                  Group C: In a given SICH, only firms with Source=``T” from period T are also in period T+1.

                  Group 😧 In a given SICH, at least one firm with Source=”A” and at least one firm with Source=``T’’ from period T are also in period T+1.

                  Group E: In a given SICH, at least one firm with Source=”A” and no firm with Source=``T’’ from period T are also in period T+1.

                  Group 😧 In a given SICH, at least one firm with Source=”T” and no firm with Source=``A’’ from period T are also in period T+1.

 

Following the suggestion of @Shmuel to construct easy to follow diagrams, I have added a diagram that captures the idea of what I am trying to do. I am stuck on how to write the proper code (especially for Step 2), but will post shortly the code I have written so far. Below are the datasets as well.

Diagram


 

DATA PERIOD;
    INPUT Merger_ID Firm_ID SICH Source $ Treated; 
    DATALINES; 
    1 12690310 2832 T 1
    1 12690311 2832 A 1
    1 01737211 1233 T 1
    1 99998881 1233 A 1
    1 77998882 1233 T 1
    1 37158620 2124 A 0 
	1 21233212 3344 T 0
	1 53432422 3344 A 0
	1 99883211 4433 A 1
 	1 72323123 4433 A 1
 	1 31231232 4433 T 1
    1 11198881 7733 T 1
	1 21198882 7733 T 1
	1 31198883 7733 T 1
	1 55598882 7733 A 1
    1 73242348 3333 T 1
	1 83242348 3333 T 1
	1 81321273 3333 A 1
	1 99321273 3333 A 1
    2 11117211 1233 A 1
    2 11118881 1233 T 1
    2 18818881 1233 T 1    
    2 00462610 1842 A 1
    2 01737210 1842 T 1
   	2 12690310 2832 A 0
    2 99928111 2122 T 0
    2 53228424 2834 A 0
    2 23298321 2839 T 0
;
RUN;



DATA PERIOD1;
    INPUT Merger_ID Firm_ID SICH;  
    DATALINES; 
    1 99998881 1233   
    1 12690310 2832
    1 12690311 2832
    1 13217778 2832 
    1 00462610 1842 
    1 01737210 1842
	1 21233212 3344
	1 43324342 3344
	1 53312342 3344
    1 11198881 7733
	1 21198882 7733
    1 73242348 3333 
	1 81321273 3333 
	1 88812312 3333
	1 99883211 4433
    2 11117211 1233
    2 11118881 1233
    2 18818881 1233  
    2 32132122 1233
    2 73128821 4282
;
RUN;


    
        
DATA want;
    INPUT Merger_ID Firm_ID SICH Treated Source $ Group $;  
    DATALINES; 
    1 12690310 2832 1 T A
    1 12690311 2832 1 A A
    1 99998881 1233 1 A B     
	1 21233212 3344 1 T C	
    1 73242348 3333 1 T D
	1 81321273 3333 1 A D		
	1 99883211 4433 1 A E
	1 11198881 7733 1 T F
	1 21198882 7733 1 T F
	2 11117211 1233 1 A A
    2 11118881 1233 1 T A
    2 18818881 1233 1 T A
;
RUN;    
1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

No time to validate all, but this should be close to what you want, and if not should require little tweaking.

 

[Edit: I tested and fixed the code. It works as I expect, but a couple of lines differ from what you want.
 I don't understand how the results you request are derived.

 Please explain why the few lines I don't match are wrong.]

 


proc sql;
  select a.MERGER_ID, a.FIRM_ID ,a.SICH ,a.SOURCE, b.FIRM_ID 
        ,count(a.FIRM_ID)                     as CNT_IN_P
        ,sum(b.FIRM_ID ne .)                  as CNT_IN_P1
        ,sum(a.SOURCE='A')                    as CNTA_IN_P
        ,sum(a.SOURCE='A' and b.FIRM_ID ne .) as CNTA_IN_P1
        ,sum(a.SOURCE='T')                    as CNTT_IN_P
        ,sum(a.SOURCE='T' and b.FIRM_ID ne .) as CNTT_IN_P1
        ,case when calculated CNT_IN_P  =calculated CNT_IN_P1      then 'A' 
              when calculated CNTA_IN_P =calculated CNTA_IN_P1 
               and calculated CNTT_IN_P=0                          then 'B'
              when calculated CNTT_IN_P =calculated CNTT_IN_P1 
               and calculated CNTA_IN_P=0                          then 'C'
              when calculated CNTA_IN_P1 and calculated CNTT_IN_P1 
                                                                   then 'D'
              when calculated CNTA_IN_P1                           then 'E'
              when calculated CNTT_IN_P1                           then 'F'
              else                                                      'X'
         end as GROUP
    from PERIOD       a
           left join 
         PERIOD1      b
           on  a.FIRM_ID  = b.FIRM_ID 
           and a.SICH     = b.SICH
           and a.MERGER_ID= b.MERGER_ID
   group by a.MERGER_ID, a.SICH
   having calculated GROUP ne 'X' 
      and b.FIRM_ID ne .
  order by a.MERGER_ID, a.SICH, a.FIRM_ID;
quit;

 

 

View solution in original post

9 REPLIES 9
Yegen
Pyrite | Level 9

Below I am just adding the firms from "have2" (i.e., period T+1 firms) to "have1" (i.e., period T firms). Before that, I add an identifier in "have2" so that I can track which firms actually survive in "have2". I think that will help to construct the groups. But here I am stuck since I am not sure how to compare within a given industry (i.e., SICH) whether there are some missing "survivors." I will try out a code and submit a follow-up reply.

data have2;
	set have2;
	survive=1;
run;

proc sql;
	create table step1 as 
	select *
	from have1 as a 
	left join have2 as b 
	on (a.merger_id=b.merger_id) & (a.firm_id=b.firm_id);
quit;


proc stdize data = work.step1 out=work.step1 reponly missing=0;
	var survive;
run; 

Possibly something as follows may work, but I am not sure how to write the conditions in the "when statement". Please excuse that I am out of the "case" statement condition (I know I should have an "end as" statement), but just for illustrative purposes see below:

 

proc sql;
	create table step2 as 
	select *, case	
		when ... Group='A'                
		when ... Group='B'                
		...                
		when ... Group='E'		
		group by merger_id, sich
	from step1;
quit;
art297
Opal | Level 21

Please post the code you ran to obtain the 14 records selected for your want file.

 

Art, CEO, AnalystFinder.com

 

Yegen
Pyrite | Level 9

Thanks for your reply, @art297. I have manually identified those rows. However, the code below is what I have tried. 

Update 1: Thank to you @art297, I have just noticed that 3 rows in the previous "want" dataset were incorrectly identified. I have updated the "want" dataset. Sorry for this "human error". 

Update 2: I think there is a major bug after I have made a minor change in the code. Before making a few minor changes, I was able to get very close except of missing two groups. 

 

 

 


proc sql;
	create table step1 as 
	select *
	from have1 as a 
	right join have2 as b 
	on (a.merger_id=b.merger_id) & (a.firm_id=b.firm_id);
quit;

data step1;
	set step1;
	survive=1;
run;


proc stdize data = work.step1 out=work.step1 reponly missing=0;
	var survive;
run; 


*Now I will count the firms for a given Merger_ID-SICH, including only A and only T ones;

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_all n=count_all;
run;

data only_t;
	set step1;
	if source='T';
run;

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_t n=count_t;
run;

data only_a;
	set step1;
	if source='A';
run;

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_a n=count_a;
run;

proc sql;
	create table step2_count_V1 as 
	select a.*, b.count_all
	from step1 as a 
	left join count_all as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc sql;
	create table step2_count_V2 as 
	select a.*, b.count_t
	from step2_count_V1 as a 
	left join count_t as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc sql;
	create table step2_count_V3 as 
	select a.*, b.count_a
	from step2_count_V2 as a 
	left join count_a as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc stdize data = work.step2_count_V3 out=work.step2_count_V3 reponly missing=0;
	var count_all count_t count_a;
run; 



*Now I will count the surviving firms for a given Merger_ID-SICH, including only A and only T ones;
proc sql;
	create table have2_source as 
	select a.*, b.Source
	from have2 as a 
	left join have1 as b 
	on (a.merger_id=b.merger_id) & (a.firm_id=b.firm_id);
run;

data source_notmissing;
	set have2_source;
	if source^=' ';
run;
	

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_surv_all n=count_surv_all;
run;

data only_surv_t;
	set source_notmissing;
	if source='T';
run;

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_surv_t n=count_surv_t;
run;

data only_surv_a;
	set source_notmissing;
	if source='A';
run;

proc summary nway;
	class merger_id SICH;
	var firm_id;
	output out=count_surv_a n=count_surv_a;
run;

proc sql;
	create table surv_count_V1 as 
	select a.*, b.count_surv_all
	from source_notmissing as a 
	left join count_surv_all as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc sql;
	create table surv_count_V2 as 
	select a.*, b.count_surv_t
	from surv_count_V1 as a 
	left join count_surv_t as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc sql;
	create table surv_count_V3 as 
	select a.*, b.count_surv_a
	from surv_count_V2 as a 
	left join count_surv_a as b 
	on (a.merger_id=b.merger_id) & (a.sich=b.sich);
quit;

proc stdize data = work.surv_count_V3 out=work.surv_count_V3 reponly missing=0;
	var count_surv_all count_surv_t count_surv_a;
run; 



proc sql;
	create table step2_with_counts as 
	select *
	from step2_count_V3 as a 
	left join surv_count_V3 as b
	on (a.merger_id=b.merger_id) & (a.firm_id=b.firm_id) & (a.source=b.source);
quit;

data step3;
	set step2_with_counts;
	if survive=1;
run;


proc sql;
	create table final as
	select *, case 
		when (count_all=count_surv_all) then 'A'
		when (count_all^=count_surv_all) & (count_a=count_surv_a) then 'B'
		when (count_all^=count_surv_all) & (count_t=count_surv_t) then 'C'
		when (count_all^=count_surv_all) & count_surv_t>0 & count_surv_a>0 then 'D'
		when (count_all^=count_surv_all) & count_surv_t=0 & count_surv_a>0 then 'E'
		when (count_all^=count_surv_all) & count_surv_t>0 & count_surv_a=0 then 'F'
		end as Group
	from step3 
	group by merger_id, SICH;
quit;
art297
Opal | Level 21

Why was the following record selected?:

1 99883211 4433 1 A E

 

Art, CEO, AnalystFinder.com

 

Yegen
Pyrite | Level 9

Thanks for this question, @art297. After deleting one observation from have2 (the other firm_id in the same industry) that I copied accidently to follow the diagram (I was off by two observations as you have pointed out, so a big thank you to you!), I think it would make sense now. I have updated the the two datasets above and verified all rows again. Now I think all looks good. Sorry for the confusion, but you are right that with the earlier dataset, firm_id = 99883211 should have not been assigned to "E". I wanted to cover all possible cases by creating a toy sample, but that was one of the bugs that I just fixed. Now, I think it would make sense with the above sample. 

ChrisNZ
Tourmaline | Level 20

No time to validate all, but this should be close to what you want, and if not should require little tweaking.

 

[Edit: I tested and fixed the code. It works as I expect, but a couple of lines differ from what you want.
 I don't understand how the results you request are derived.

 Please explain why the few lines I don't match are wrong.]

 


proc sql;
  select a.MERGER_ID, a.FIRM_ID ,a.SICH ,a.SOURCE, b.FIRM_ID 
        ,count(a.FIRM_ID)                     as CNT_IN_P
        ,sum(b.FIRM_ID ne .)                  as CNT_IN_P1
        ,sum(a.SOURCE='A')                    as CNTA_IN_P
        ,sum(a.SOURCE='A' and b.FIRM_ID ne .) as CNTA_IN_P1
        ,sum(a.SOURCE='T')                    as CNTT_IN_P
        ,sum(a.SOURCE='T' and b.FIRM_ID ne .) as CNTT_IN_P1
        ,case when calculated CNT_IN_P  =calculated CNT_IN_P1      then 'A' 
              when calculated CNTA_IN_P =calculated CNTA_IN_P1 
               and calculated CNTT_IN_P=0                          then 'B'
              when calculated CNTT_IN_P =calculated CNTT_IN_P1 
               and calculated CNTA_IN_P=0                          then 'C'
              when calculated CNTA_IN_P1 and calculated CNTT_IN_P1 
                                                                   then 'D'
              when calculated CNTA_IN_P1                           then 'E'
              when calculated CNTT_IN_P1                           then 'F'
              else                                                      'X'
         end as GROUP
    from PERIOD       a
           left join 
         PERIOD1      b
           on  a.FIRM_ID  = b.FIRM_ID 
           and a.SICH     = b.SICH
           and a.MERGER_ID= b.MERGER_ID
   group by a.MERGER_ID, a.SICH
   having calculated GROUP ne 'X' 
      and b.FIRM_ID ne .
  order by a.MERGER_ID, a.SICH, a.FIRM_ID;
quit;

 

 

Yegen
Pyrite | Level 9

@ChrisNZ, thank you!!! That was such a helpful reply, the code works very well after chaning just a few lines. Would you mind updating your code above so that I can accept it as the answer since with the modification your code identifies the correct groups.  Here is your code with a few (really) minor changes:

proc sql;
  create table outcome as
  select a.MERGER_ID, a.FIRM_ID ,a.SICH ,a.SOURCE, b.FIRM_ID 
        ,count(a.FIRM_ID)                     as CNT_IN_P
        ,sum(b.FIRM_ID ne .)                  as CNT_IN_P1
        ,sum(a.SOURCE='A')                    as CNTA_IN_P
        ,sum(a.SOURCE='A' and b.FIRM_ID ne .) as CNTA_IN_P1
        ,sum(a.SOURCE='T')                    as CNTT_IN_P
        ,sum(a.SOURCE='T' and b.FIRM_ID ne .) as CNTT_IN_P1
        ,case when calculated CNT_IN_P  =calculated CNT_IN_P1      then 'A' 
              
              when calculated CNTA_IN_P =calculated CNTA_IN_P1 
               and calculated CNTT_IN_P1=0                          then 'B'
              
              when calculated CNTT_IN_P =calculated CNTT_IN_P1 
               and calculated CNTA_IN_P1=0                          then 'C'
               
              when calculated CNT_IN_P > calculated CNT_IN_P1 
               and calculated CNTA_IN_P1 > 0 and calculated CNTT_IN_P1 > 0
                                                                   then 'D'
              when calculated CNT_IN_P > calculated CNT_IN_P1             
               and calculated CNTA_IN_P1 > 0 and calculated CNTT_IN_P1 = 0
             													   then 'E'
              when calculated CNT_IN_P > calculated CNT_IN_P1             
               and calculated CNTA_IN_P1 = 0 and calculated CNTT_IN_P1 > 0                           
              													   then 'F'
              else                                                      'X'
         end as GROUP
    from PERIOD       a
           left join 
         PERIOD1      b
           on  a.FIRM_ID  = b.FIRM_ID 
           and a.SICH     = b.SICH
           and a.MERGER_ID= b.MERGER_ID
   group by a.MERGER_ID, a.SICH
   having calculated GROUP ne 'X' 
      and b.FIRM_ID ne .
  order by a.MERGER_ID, a.SICH, a.FIRM_ID;
quit;


It works really well, and thank to you I have just learned how to use "calculated" in the case statement. I haven't used this handy statement before so it's great to have it in my "SQL toolkit". 

Edit: I have only made a few minor changes in the "case when" statement. Otherwise, it is exactly the same code as you have.  

ChrisNZ
Tourmaline | Level 20

Glad it helped. 🙂

I find that deriving intermediate values, and then using those to derive final flags makes it much easier to debug and arrive at the desired result.

A post doesn't have to be exactly right to be an answer. As long as it helps you and shows you how to arrive at the solution, that's fine.

Yegen
Pyrite | Level 9

This was super helpful. Quite frankly, I was amazed on how efficiently the code is written (my code was super long, but you summed it up in just one SQL statement). Smiley Happy Our SAS community is amazing and super helpful! I really appreciate all of your replies. This is such a wonderful websites and allows one to learn SAS much faster! I am loving it. 

Thanks again. 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2641 views
  • 6 likes
  • 3 in conversation