Solved: Re: Compute mean by Industry excluding ith firm

LanMin · Posted 11-02-2013 01:07 AM

Dear all,

this might be an easy question for those experienced programmers.

thanks in advance for your consideration!

a small sample (real data is huge in size) is given below. I want to create two variables: (1) industry_id : if firms share the same SIC (based on first three digits), it belongs to the same industry. in my example, we would have industry_id=1 for sic 3100, 3101 ; industry_id=2 for sic 3300, 3303 ; (2) compute industry mean return based on industry_id and year, but such mean must exclude ith firm,

for example, firm id 1001, 1002, 1003 belong to industry_id=1 , for year 1990, firmid 1001, mean_ret =(.7+.2)/2=.45,

for year 1991, firmid 1001, mean_ret =(-0.5+.4)/2=-.05

please help

firm_id	SIC	Year	Return
1001	3100	1990	0.01
1001	3100	1991	0.2
1001	3100	1992	0.3
1001	3100	1993	0.7
1001	3100	1994	-0.5
1001	3100	1995	0.2
1001	3100	1996	0.4
1002	3101	1990	0.7
1002	3101	1991	-0.5
1002	3101	1992	0.2
1002	3101	1993	0.4
1002	3101	1994	0.4
1002	3101	1995	0.2
1002	3101	1996	0.2
1003	3101	1990	0.2
1003	3101	1991	0.4
1003	3101	1992	0.7
1003	3101	1993	-0.5
1003	3101	1994	0.2
1003	3101	1995	0.2
1003	3101	1996	0.2
1006	3300	1990	-0.5
1006	3300	1991	0.9
1006	3300	1992	0.2
1006	3300	1993	0.09
1006	3300	1994	0.09
1006	3300	1995	0.22
1006	3300	1996	0.1
1004	3303	1990	0.4
1004	3303	1991	0.2
1004	3303	1992	0.2
1004	3303	1993	-0.5
1004	3303	1994	0.9
1004	3303	1995	0.22
1004	3303	1996	0.1

-Lan

Reeza · Posted 11-02-2013 04:24 PM

A modified self join should work.

data have;

input firm_id SIC Year Return;

id=substr(put(sic,4.),1,3);

cards;

1001 3100 1990 0.01

1001 3100 1991 0.2

1001 3100 1992 0.3

1001 3100 1993 0.7

1001 3100 1994 -0.5

1001 3100 1995 0.2

1001 3100 1996 0.4

1002 3101 1990 0.7

1002 3101 1991 -0.5

1002 3101 1992 0.2

1002 3101 1993 0.4

1002 3101 1994 0.4

1002 3101 1995 0.2

1002 3101 1996 0.2

1003 3101 1990 0.2

1003 3101 1991 0.4

1003 3101 1992 0.7

1003 3101 1993 -0.5

1003 3101 1994 0.2

1003 3101 1995 0.2

1003 3101 1996 0.2

1006 3300 1990 -0.5

1006 3300 1991 0.9

1006 3300 1992 0.2

1006 3300 1993 0.09

1006 3300 1994 0.09

1006 3300 1995 0.22

1006 3300 1996 0.1

1004 3303 1990 0.4

1004 3303 1991 0.2

1004 3303 1992 0.2

1004 3303 1993 -0.5

1004 3303 1994 0.9

1004 3303 1995 0.22

1004 3303 1996 0.1

;

run;

PROC SQL;

CREATE TABLE WANT AS

SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

FROM HAVE AS A

LEFT JOIN HAVE AS B

ON A.ID=B.ID

AND A.YEAR=B.YEAR

AND A.FIRM_ID NE B.FIRM_ID

GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

View solution in original post

Patrick · Posted 11-02-2013 03:47 AM

I haven't understood what you meant by "but such mean must exclude ith firm" and haven't seen something like that in your example - so this is not covered in below (untested) SQL.

For the industry_id: You say that this are the first 3 digits of the SIC. If so the I believe it's best to create an industry_id containing exactly that - the first 3 digits of SIC.

proc sql;

create table want as

select

firm_id,

SIC,

Year,

Return,

substrn(SIC,1,3) as industry_id,

avg(return) as mean_ret

from have

group by substrn(SIC,1,3), year

;

quit;

LanMin · Posted 11-02-2013 09:18 AM

thank you, Patrick!

"but such mean must exclude ith firm" what I am asking here is that when I compute mean return by industry for firm i, we will use on firm i's peers return divided by the industry size minus 1 (i.e. excluding i).

how do I exclude firm i ?

-Lan

Jagadishkatam · Posted 11-02-2013 09:20 AM

alternatively by datastep,

data have;

input firm_id SIC$ Year Return;

id=substr(sic,1,3);

cards;

1001 3100 1990 0.01

1001 3100 1991 0.2

1001 3100 1992 0.3

1001 3100 1993 0.7

1001 3100 1994 -0.5

1001 3100 1995 0.2

1001 3100 1996 0.4

1002 3101 1990 0.7

1002 3101 1991 -0.5

1002 3101 1992 0.2

1002 3101 1993 0.4

1002 3101 1994 0.4

1002 3101 1995 0.2

1002 3101 1996 0.2

1003 3101 1990 0.2

1003 3101 1991 0.4

1003 3101 1992 0.7

1003 3101 1993 -0.5

1003 3101 1994 0.2

1003 3101 1995 0.2

1003 3101 1996 0.2

1006 3300 1990 -0.5

1006 3300 1991 0.9

1006 3300 1992 0.2

1006 3300 1993 0.09

1006 3300 1994 0.09

1006 3300 1995 0.22

1006 3300 1996 0.1

1004 3303 1990 0.4

1004 3303 1991 0.2

1004 3303 1992 0.2

1004 3303 1993 -0.5

1004 3303 1994 0.9

1004 3303 1995 0.22

1004 3303 1996 0.1

;

run;

proc sort data = have;

by id year;

run;

proc means data=have mean nway;

class id year;

var return;

output out=means(drop=_type_ _freq_) mean= /autoname;

run;

data want;

merge means have;

by id year;

run;

proc sort data=want;

by firm_id SIC;

run;

Thanks,

Jagadish

Thanks,
Jag

Jagadishkatam · Posted 11-02-2013 09:30 AM

Hi Lan,

i saw your post to Patrick after i submitted the above code, So accordingly i modified the my code. Please try

data have;

input firm_id SIC Year Return;

id=substr(put(sic,4.),1,3);

cards;

1001 3100 1990 0.01

1001 3100 1991 0.2

1001 3100 1992 0.3

1001 3100 1993 0.7

1001 3100 1994 -0.5

1001 3100 1995 0.2

1001 3100 1996 0.4

1002 3101 1990 0.7

1002 3101 1991 -0.5

1002 3101 1992 0.2

1002 3101 1993 0.4

1002 3101 1994 0.4

1002 3101 1995 0.2

1002 3101 1996 0.2

1003 3101 1990 0.2

1003 3101 1991 0.4

1003 3101 1992 0.7

1003 3101 1993 -0.5

1003 3101 1994 0.2

1003 3101 1995 0.2

1003 3101 1996 0.2

1006 3300 1990 -0.5

1006 3300 1991 0.9

1006 3300 1992 0.2

1006 3300 1993 0.09

1006 3300 1994 0.09

1006 3300 1995 0.22

1006 3300 1996 0.1

1004 3303 1990 0.4

1004 3303 1991 0.2

1004 3303 1992 0.2

1004 3303 1993 -0.5

1004 3303 1994 0.9

1004 3303 1995 0.22

1004 3303 1996 0.1

;

run;

proc sort data = have;

by year firm_id ;

run;

data have_;

set have;

by year firm_id ;

if first.year then flag=1;

run;

proc means data=have_(where=(flag^=1)) mean nway;

class id year;

var return;

output out=means(drop=_type_ _freq_) mean= /autoname;

run;

proc sort data = have;

by id year ;

run;

data want;

merge means have;

by id year;

run;

proc sort data=want;

by firm_id SIC;

run;

Thanks,

Jagadish

Thanks,
Jag

LanMin · Posted 11-02-2013 10:03 AM

Jagadish,

thanks so much ! !

I should have provided an expected output, what I need for the mean return is going to be each firm-industry-year (see example below), it is my mistake not to make it clear-that is why your code produces mean return per industry-year

want: i computed 2 of them as illustration

the first one: 0.45 = [0.7 (from firm id 1002, year 1990)+0.2(from firm id 1003, year 1990) ] / 2=.45

firm_id	SIC	Year	Return	Industry_ID	Mean_return
1001	3100	1990	0.01	310	0.45
1001	3100	1991	0.2	310	-0.05
1001	3100	1992	0.3	310
1001	3100	1993	0.7	310
1001	3100	1994	-0.5	310
1001	3100	1995	0.2	310
1001	3100	1996	0.4	310
1002	3101	1990	0.7	310
1002	3101	1991	-0.5	310
1002	3101	1992	0.2	310
1002	3101	1993	0.4	310
1002	3101	1994	0.4	310
1002	3101	1995	0.2	310
1002	3101	1996	0.2	310
1003	3101	1990	0.2	310
1003	3101	1991	0.4	310
1003	3101	1992	0.7	310
1003	3101	1993	-0.5	310
1003	3101	1994	0.2	310
1003	3101	1995	0.2	310
1003	3101	1996	0.2	310
1006	3300	1990	-0.5	330
1006	3300	1991	0.9	330
1006	3300	1992	0.2	330
1006	3300	1993	0.09	330
1006	3300	1994	0.09	330
1006	3300	1995	0.22	330
1006	3300	1996	0.1	330
1004	3303	1990	0.4	330
1004	3303	1991	0.2	330
1004	3303	1992	0.2	330
1004	3303	1993	-0.5	330
1004	3303	1994	0.9	330
1004	3303	1995	0.22	330
1004	3303	1996	0.1	330

could you help modify your code?

-lan

Reeza · Posted 11-02-2013 04:24 PM

A modified self join should work.

data have;

input firm_id SIC Year Return;

id=substr(put(sic,4.),1,3);

cards;

1001 3100 1990 0.01

1001 3100 1991 0.2

1001 3100 1992 0.3

1001 3100 1993 0.7

1001 3100 1994 -0.5

1001 3100 1995 0.2

1001 3100 1996 0.4

1002 3101 1990 0.7

1002 3101 1991 -0.5

1002 3101 1992 0.2

1002 3101 1993 0.4

1002 3101 1994 0.4

1002 3101 1995 0.2

1002 3101 1996 0.2

1003 3101 1990 0.2

1003 3101 1991 0.4

1003 3101 1992 0.7

1003 3101 1993 -0.5

1003 3101 1994 0.2

1003 3101 1995 0.2

1003 3101 1996 0.2

1006 3300 1990 -0.5

1006 3300 1991 0.9

1006 3300 1992 0.2

1006 3300 1993 0.09

1006 3300 1994 0.09

1006 3300 1995 0.22

1006 3300 1996 0.1

1004 3303 1990 0.4

1004 3303 1991 0.2

1004 3303 1992 0.2

1004 3303 1993 -0.5

1004 3303 1994 0.9

1004 3303 1995 0.22

1004 3303 1996 0.1

;

run;

PROC SQL;

CREATE TABLE WANT AS

SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

FROM HAVE AS A

LEFT JOIN HAVE AS B

ON A.ID=B.ID

AND A.YEAR=B.YEAR

AND A.FIRM_ID NE B.FIRM_ID

GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

LanMin · Posted 11-02-2013 10:12 PM

Many thanks to Reeza, Jagadishkatam, and Patrick !!!

-lan

Jagadishkatam · Posted 11-03-2013 12:58 AM

Hi Lan,

I left my answer incomplete in my earlier post, please add the below code to my previous code and you will get the desired output.

proc sort data=want;

by year firm_id;

run;

data want_;

set want;

by year firm_id;

if first.year then Return_Mean=Return_Mean;

else Return_Mean=.;

run;

Thanks,

Jagadish

Thanks,
Jag

LanMin · Posted 11-03-2013 08:36 AM

Thank you, Jagadish. I will try it.

LanMin · Posted 11-03-2013 08:40 AM

Reeza,

I used your code for my real data which contains 2717655 observations. I got "ERROR SORT EXECUTION FAILURE" my guess is sas ran out of memory. my thought is to break the data into smaller sets by industry id, then run your code, after that, merge all the data back into a big dataset.

Do you have an efficient way of doing this?

this is the part of the code I got error - my guess is sas ran out of memory, it look over an hour for sas to run.

* note: id=substr(put(sic,4.),1,3);

PROC SQL;

CREATE TABLE OUT AS

SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

FROM HAVE AS A

LEFT JOIN HAVE AS B

ON A.ID=B.ID

AND A.YEAR=B.YEAR

AND A.FIRM_ID NE B.FIRM_ID

GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

thanks very much,

-Lan

data_null__ · Posted 11-03-2013 09:23 AM

Isyour input data is not sorted by or index on any of the key variables. What is the data type of SIC? Do you have a lot of extra variable in the input data that you did not mention?

Tal · Posted 11-03-2013 09:09 PM

hi guys,

i have a question on this problem.

So if i sort the dataset given above by indt_id year firm_id to get what i have shown here and run the following code i am not getting the right avg

Whats wrong with my code ?

data sorted;

set have;

by indt_id year firm_id;

count=0;

if not first.year then do;

count=count+1;

sum+return;end;

if last.year then do;

avg=sum/count;output; end;

proc print;run;

Obs firm_id SIC Year Return Indt_id

1 1001 3100 1990 0.01 310

2 1002 3101 1990 0.70 310

3 1003 3101 1990 0.20 310

4 1001 3100 1991 0.20 310

5 1002 3101 1991 -0.50 310

6 1003 3101 1991 0.40 310

7 1001 3100 1992 0.30 310

8 1002 3101 1992 0.20 310

9 1003 3101 1992 0.70 310

10 1001 3100 1993 0.70 310

11 1002 3101 1993 0.40 310

12 1003 3101 1993 -0.50 310

13 1001 3100 1994 -0.50 310

14 1002 3101 1994 0.40 310

15 1003 3101 1994 0.20 310

16 1001 3100 1995 0.20 310

17 1002 3101 1995 0.20 310

18 1003 3101 1995 0.20 310

19 1001 3100 1996 0.40 310

20 1002 3101 1996 0.20 310

21 1003 3101 1996 0.20 310

22 1004 3303 1990 0.40 330

23 1006 3300 1990 -0.50 330

24 1004 3303 1991 0.20 330

25 1006 3300 1991 0.90 330

26 1004 3303 1992 0.20 330

27 1006 3300 1992 0.20 330

28 1004 3303 1993 -0.50 330

29 1006 3300 1993 0.09 330

30 1004 3303 1994 0.90 330

31 1006 3300 1994 0.09 330

32 1004 3303 1995 0.22 330

33 1006 3300 1995 0.22 330

34 1004 3303 1996 0.10 330

35 1006 3300 1996 0.10 330

data_null__ · Posted 11-04-2013 07:12 AM

If you can create SIC that groups properly $3 (SIC2) and sort the data and make a reasonable guess at the maximum number of FIRMS per YEAR this data step should work with your actual data. I don't know about performance.

data firm;
   input firm:$4. SIC:$4. Year Return;
   length sic2 $3; 
   sic2 = sic;
   cards; 
1001    3100    1990    0.01
1001    3100    1991    0.2
1001    3100    1992    0.3
1001    3100    1993    0.7
1001    3100    1994    -0.5
1001    3100    1995    0.2
1001    3100    1996    0.4
1002    3101    1990    0.7
1002    3101    1991    -0.5
1002    3101    1992    0.2
1002    3101    1993    0.4
1002    3101    1994    0.4
1002    3101    1995    0.2
1002    3101    1996    0.2
1003    3101    1990    0.2
1003    3101    1991    0.4
1003    3101    1992    0.7
1003    3101    1993    -0.5
1003    3101    1994    0.2
1003    3101    1995    0.2
1003    3101    1996    0.2
1006    3300    1990    -0.5
1006    3300    1991    0.9
1006    3300    1992    0.2
1006    3300    1993    0.09
1006    3300    1994    0.09
1006    3300    1995    0.22
1006    3300    1996    0.1
1004    3303    1990    0.4
1004    3303    1991    0.2
1004    3303    1992    0.2
1004    3303    1993    -0.5
1004    3303    1994    0.9
1004    3303    1995    0.22
1004    3303    1996    0.1
; 
run; 
proc sort; 
   by sic2 year firm;
   run; 
proc print; 
   run; 

%let d=5; *max firms per year;
data mean;
   i = 0; 
   do until(last.year);
      set firm;
      by sic2 year;
      array s[&d] $4 _temporary_; 
      array f[&d] $4 _temporary_; 
      array r[&d] 8  _temporary_; 
      i + 1; 
      s = sic;
      f = firm;
      r = return;
      end; 
   do i = 1 to i;
      return = r;
      r   = .; 
      sic    = s;
      firm   = f;
      mean   = mean(of r);
      
output; 
      r   = return;
      end; 
   call missing(of s f
 r
);
   drop i;
   run; 
proc print; 
   run; 
      

Message was edited by: data _null_ I forgot to add CALL MISSING when I switched to temporary arrays.

Tal · Posted 11-05-2013 11:02 AM

Thank you sir.but if i want to use dows without arrays after sorting the set by sic2 year firm_id and running the dow code to get the mean for which by “ sic2 year firm_id” group excluding the 1^st record of each group i don’t get the expected results.

Count and sum don’t get to missing when a new “by sic2 year firm_id” group starts, why?

Registration is open

SAS Training: Just a Click Away