Dear all,
this might be an easy question for those experienced programmers.
thanks in advance for your consideration!
a small sample (real data is huge in size) is given below. I want to create two variables: (1) industry_id : if firms share the same SIC (based on first three digits), it belongs to the same industry. in my example, we would have industry_id=1 for sic 3100, 3101 ; industry_id=2 for sic 3300, 3303 ; (2) compute industry mean return based on industry_id and year, but such mean must exclude ith firm,
for example, firm id 1001, 1002, 1003 belong to industry_id=1 , for year 1990, firmid 1001, mean_ret =(.7+.2)/2=.45,
for year 1991, firmid 1001, mean_ret =(-0.5+.4)/2=-.05
please help
firm_id | SIC | Year | Return |
1001 | 3100 | 1990 | 0.01 |
1001 | 3100 | 1991 | 0.2 |
1001 | 3100 | 1992 | 0.3 |
1001 | 3100 | 1993 | 0.7 |
1001 | 3100 | 1994 | -0.5 |
1001 | 3100 | 1995 | 0.2 |
1001 | 3100 | 1996 | 0.4 |
1002 | 3101 | 1990 | 0.7 |
1002 | 3101 | 1991 | -0.5 |
1002 | 3101 | 1992 | 0.2 |
1002 | 3101 | 1993 | 0.4 |
1002 | 3101 | 1994 | 0.4 |
1002 | 3101 | 1995 | 0.2 |
1002 | 3101 | 1996 | 0.2 |
1003 | 3101 | 1990 | 0.2 |
1003 | 3101 | 1991 | 0.4 |
1003 | 3101 | 1992 | 0.7 |
1003 | 3101 | 1993 | -0.5 |
1003 | 3101 | 1994 | 0.2 |
1003 | 3101 | 1995 | 0.2 |
1003 | 3101 | 1996 | 0.2 |
1006 | 3300 | 1990 | -0.5 |
1006 | 3300 | 1991 | 0.9 |
1006 | 3300 | 1992 | 0.2 |
1006 | 3300 | 1993 | 0.09 |
1006 | 3300 | 1994 | 0.09 |
1006 | 3300 | 1995 | 0.22 |
1006 | 3300 | 1996 | 0.1 |
1004 | 3303 | 1990 | 0.4 |
1004 | 3303 | 1991 | 0.2 |
1004 | 3303 | 1992 | 0.2 |
1004 | 3303 | 1993 | -0.5 |
1004 | 3303 | 1994 | 0.9 |
1004 | 3303 | 1995 | 0.22 |
1004 | 3303 | 1996 | 0.1 |
-Lan
A modified self join should work.
data have;
input firm_id SIC Year Return;
id=substr(put(sic,4.),1,3);
cards;
1001 3100 1990 0.01
1001 3100 1991 0.2
1001 3100 1992 0.3
1001 3100 1993 0.7
1001 3100 1994 -0.5
1001 3100 1995 0.2
1001 3100 1996 0.4
1002 3101 1990 0.7
1002 3101 1991 -0.5
1002 3101 1992 0.2
1002 3101 1993 0.4
1002 3101 1994 0.4
1002 3101 1995 0.2
1002 3101 1996 0.2
1003 3101 1990 0.2
1003 3101 1991 0.4
1003 3101 1992 0.7
1003 3101 1993 -0.5
1003 3101 1994 0.2
1003 3101 1995 0.2
1003 3101 1996 0.2
1006 3300 1990 -0.5
1006 3300 1991 0.9
1006 3300 1992 0.2
1006 3300 1993 0.09
1006 3300 1994 0.09
1006 3300 1995 0.22
1006 3300 1996 0.1
1004 3303 1990 0.4
1004 3303 1991 0.2
1004 3303 1992 0.2
1004 3303 1993 -0.5
1004 3303 1994 0.9
1004 3303 1995 0.22
1004 3303 1996 0.1
;
run;
PROC SQL;
CREATE TABLE WANT AS
SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN
FROM HAVE AS A
LEFT JOIN HAVE AS B
ON A.ID=B.ID
AND A.YEAR=B.YEAR
AND A.FIRM_ID NE B.FIRM_ID
GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;
QUIT;
I haven't understood what you meant by "but such mean must exclude ith firm" and haven't seen something like that in your example - so this is not covered in below (untested) SQL.
For the industry_id: You say that this are the first 3 digits of the SIC. If so the I believe it's best to create an industry_id containing exactly that - the first 3 digits of SIC.
proc sql;
create table want as
select
firm_id,
SIC,
Year,
Return,
substrn(SIC,1,3) as industry_id,
avg(return) as mean_ret
from have
group by substrn(SIC,1,3), year
;
quit;
thank you, Patrick!
"but such mean must exclude ith firm" what I am asking here is that when I compute mean return by industry for firm i, we will use on firm i's peers return divided by the industry size minus 1 (i.e. excluding i).
how do I exclude firm i ?
-Lan
alternatively by datastep,
data have;
input firm_id SIC$ Year Return;
id=substr(sic,1,3);
cards;
1001 3100 1990 0.01
1001 3100 1991 0.2
1001 3100 1992 0.3
1001 3100 1993 0.7
1001 3100 1994 -0.5
1001 3100 1995 0.2
1001 3100 1996 0.4
1002 3101 1990 0.7
1002 3101 1991 -0.5
1002 3101 1992 0.2
1002 3101 1993 0.4
1002 3101 1994 0.4
1002 3101 1995 0.2
1002 3101 1996 0.2
1003 3101 1990 0.2
1003 3101 1991 0.4
1003 3101 1992 0.7
1003 3101 1993 -0.5
1003 3101 1994 0.2
1003 3101 1995 0.2
1003 3101 1996 0.2
1006 3300 1990 -0.5
1006 3300 1991 0.9
1006 3300 1992 0.2
1006 3300 1993 0.09
1006 3300 1994 0.09
1006 3300 1995 0.22
1006 3300 1996 0.1
1004 3303 1990 0.4
1004 3303 1991 0.2
1004 3303 1992 0.2
1004 3303 1993 -0.5
1004 3303 1994 0.9
1004 3303 1995 0.22
1004 3303 1996 0.1
;
run;
proc sort data = have;
by id year;
run;
proc means data=have mean nway;
class id year;
var return;
output out=means(drop=_type_ _freq_) mean= /autoname;
run;
data want;
merge means have;
by id year;
run;
proc sort data=want;
by firm_id SIC;
run;
Thanks,
Jagadish
Hi Lan,
i saw your post to Patrick after i submitted the above code, So accordingly i modified the my code. Please try
data have;
input firm_id SIC Year Return;
id=substr(put(sic,4.),1,3);
cards;
1001 3100 1990 0.01
1001 3100 1991 0.2
1001 3100 1992 0.3
1001 3100 1993 0.7
1001 3100 1994 -0.5
1001 3100 1995 0.2
1001 3100 1996 0.4
1002 3101 1990 0.7
1002 3101 1991 -0.5
1002 3101 1992 0.2
1002 3101 1993 0.4
1002 3101 1994 0.4
1002 3101 1995 0.2
1002 3101 1996 0.2
1003 3101 1990 0.2
1003 3101 1991 0.4
1003 3101 1992 0.7
1003 3101 1993 -0.5
1003 3101 1994 0.2
1003 3101 1995 0.2
1003 3101 1996 0.2
1006 3300 1990 -0.5
1006 3300 1991 0.9
1006 3300 1992 0.2
1006 3300 1993 0.09
1006 3300 1994 0.09
1006 3300 1995 0.22
1006 3300 1996 0.1
1004 3303 1990 0.4
1004 3303 1991 0.2
1004 3303 1992 0.2
1004 3303 1993 -0.5
1004 3303 1994 0.9
1004 3303 1995 0.22
1004 3303 1996 0.1
;
run;
proc sort data = have;
by year firm_id ;
run;
data have_;
set have;
by year firm_id ;
if first.year then flag=1;
run;
proc means data=have_(where=(flag^=1)) mean nway;
class id year;
var return;
output out=means(drop=_type_ _freq_) mean= /autoname;
run;
proc sort data = have;
by id year ;
run;
data want;
merge means have;
by id year;
run;
proc sort data=want;
by firm_id SIC;
run;
Thanks,
Jagadish
Jagadish,
thanks so much ! !
I should have provided an expected output, what I need for the mean return is going to be each firm-industry-year (see example below), it is my mistake not to make it clear-that is why your code produces mean return per industry-year
want: i computed 2 of them as illustration
the first one: 0.45 = [0.7 (from firm id 1002, year 1990)+0.2(from firm id 1003, year 1990) ] / 2=.45
firm_id | SIC | Year | Return | Industry_ID | Mean_return |
1001 | 3100 | 1990 | 0.01 | 310 | 0.45 |
1001 | 3100 | 1991 | 0.2 | 310 | -0.05 |
1001 | 3100 | 1992 | 0.3 | 310 | |
1001 | 3100 | 1993 | 0.7 | 310 | |
1001 | 3100 | 1994 | -0.5 | 310 | |
1001 | 3100 | 1995 | 0.2 | 310 | |
1001 | 3100 | 1996 | 0.4 | 310 | |
1002 | 3101 | 1990 | 0.7 | 310 | |
1002 | 3101 | 1991 | -0.5 | 310 | |
1002 | 3101 | 1992 | 0.2 | 310 | |
1002 | 3101 | 1993 | 0.4 | 310 | |
1002 | 3101 | 1994 | 0.4 | 310 | |
1002 | 3101 | 1995 | 0.2 | 310 | |
1002 | 3101 | 1996 | 0.2 | 310 | |
1003 | 3101 | 1990 | 0.2 | 310 | |
1003 | 3101 | 1991 | 0.4 | 310 | |
1003 | 3101 | 1992 | 0.7 | 310 | |
1003 | 3101 | 1993 | -0.5 | 310 | |
1003 | 3101 | 1994 | 0.2 | 310 | |
1003 | 3101 | 1995 | 0.2 | 310 | |
1003 | 3101 | 1996 | 0.2 | 310 | |
1006 | 3300 | 1990 | -0.5 | 330 | |
1006 | 3300 | 1991 | 0.9 | 330 | |
1006 | 3300 | 1992 | 0.2 | 330 | |
1006 | 3300 | 1993 | 0.09 | 330 | |
1006 | 3300 | 1994 | 0.09 | 330 | |
1006 | 3300 | 1995 | 0.22 | 330 | |
1006 | 3300 | 1996 | 0.1 | 330 | |
1004 | 3303 | 1990 | 0.4 | 330 | |
1004 | 3303 | 1991 | 0.2 | 330 | |
1004 | 3303 | 1992 | 0.2 | 330 | |
1004 | 3303 | 1993 | -0.5 | 330 | |
1004 | 3303 | 1994 | 0.9 | 330 | |
1004 | 3303 | 1995 | 0.22 | 330 | |
1004 | 3303 | 1996 | 0.1 | 330 |
could you help modify your code?
-lan
A modified self join should work.
data have;
input firm_id SIC Year Return;
id=substr(put(sic,4.),1,3);
cards;
1001 3100 1990 0.01
1001 3100 1991 0.2
1001 3100 1992 0.3
1001 3100 1993 0.7
1001 3100 1994 -0.5
1001 3100 1995 0.2
1001 3100 1996 0.4
1002 3101 1990 0.7
1002 3101 1991 -0.5
1002 3101 1992 0.2
1002 3101 1993 0.4
1002 3101 1994 0.4
1002 3101 1995 0.2
1002 3101 1996 0.2
1003 3101 1990 0.2
1003 3101 1991 0.4
1003 3101 1992 0.7
1003 3101 1993 -0.5
1003 3101 1994 0.2
1003 3101 1995 0.2
1003 3101 1996 0.2
1006 3300 1990 -0.5
1006 3300 1991 0.9
1006 3300 1992 0.2
1006 3300 1993 0.09
1006 3300 1994 0.09
1006 3300 1995 0.22
1006 3300 1996 0.1
1004 3303 1990 0.4
1004 3303 1991 0.2
1004 3303 1992 0.2
1004 3303 1993 -0.5
1004 3303 1994 0.9
1004 3303 1995 0.22
1004 3303 1996 0.1
;
run;
PROC SQL;
CREATE TABLE WANT AS
SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN
FROM HAVE AS A
LEFT JOIN HAVE AS B
ON A.ID=B.ID
AND A.YEAR=B.YEAR
AND A.FIRM_ID NE B.FIRM_ID
GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;
QUIT;
Many thanks to Reeza, Jagadishkatam, and Patrick !!!
-lan
Hi Lan,
I left my answer incomplete in my earlier post, please add the below code to my previous code and you will get the desired output.
proc sort data=want;
by year firm_id;
run;
data want_;
set want;
by year firm_id;
if first.year then Return_Mean=Return_Mean;
else Return_Mean=.;
run;
Thanks,
Jagadish
Thank you, Jagadish. I will try it.
Reeza,
I used your code for my real data which contains 2717655 observations. I got "ERROR SORT EXECUTION FAILURE" my guess is sas ran out of memory. my thought is to break the data into smaller sets by industry id, then run your code, after that, merge all the data back into a big dataset.
Do you have an efficient way of doing this?
this is the part of the code I got error - my guess is sas ran out of memory, it look over an hour for sas to run.
* note: id=substr(put(sic,4.),1,3);
PROC SQL;
CREATE TABLE OUT AS
SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN
FROM HAVE AS A
LEFT JOIN HAVE AS B
ON A.ID=B.ID
AND A.YEAR=B.YEAR
AND A.FIRM_ID NE B.FIRM_ID
GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;
QUIT;
thanks very much,
-Lan
Isyour input data is not sorted by or index on any of the key variables. What is the data type of SIC? Do you have a lot of extra variable in the input data that you did not mention?
hi guys,
i have a question on this problem.
So if i sort the dataset given above by indt_id year firm_id to get what i have shown here and run the following code i am not getting the right avg
Whats wrong with my code ?
data sorted;
set have;
by indt_id year firm_id;
count=0;
if not first.year then do;
count=count+1;
sum+return;end;
if last.year then do;
avg=sum/count;output; end;
proc print;run;
Obs firm_id SIC Year Return Indt_id
1 1001 3100 1990 0.01 310
2 1002 3101 1990 0.70 310
3 1003 3101 1990 0.20 310
4 1001 3100 1991 0.20 310
5 1002 3101 1991 -0.50 310
6 1003 3101 1991 0.40 310
7 1001 3100 1992 0.30 310
8 1002 3101 1992 0.20 310
9 1003 3101 1992 0.70 310
10 1001 3100 1993 0.70 310
11 1002 3101 1993 0.40 310
12 1003 3101 1993 -0.50 310
13 1001 3100 1994 -0.50 310
14 1002 3101 1994 0.40 310
15 1003 3101 1994 0.20 310
16 1001 3100 1995 0.20 310
17 1002 3101 1995 0.20 310
18 1003 3101 1995 0.20 310
19 1001 3100 1996 0.40 310
20 1002 3101 1996 0.20 310
21 1003 3101 1996 0.20 310
22 1004 3303 1990 0.40 330
23 1006 3300 1990 -0.50 330
24 1004 3303 1991 0.20 330
25 1006 3300 1991 0.90 330
26 1004 3303 1992 0.20 330
27 1006 3300 1992 0.20 330
28 1004 3303 1993 -0.50 330
29 1006 3300 1993 0.09 330
30 1004 3303 1994 0.90 330
31 1006 3300 1994 0.09 330
32 1004 3303 1995 0.22 330
33 1006 3300 1995 0.22 330
34 1004 3303 1996 0.10 330
35 1006 3300 1996 0.10 330
If you can create SIC that groups properly $3 (SIC2) and sort the data and make a reasonable guess at the maximum number of FIRMS per YEAR this data step should work with your actual data. I don't know about performance.
Message was edited by: data _null_ I forgot to add CALL MISSING when I switched to temporary arrays.
Thank you sir.but if i want to use dows without arrays after sorting the set by sic2 year firm_id and running the dow code to get the mean for which by “ sic2 year firm_id” group excluding the 1st record of each group i don’t get the expected results.
Count and sum don’t get to missing when a new “by sic2 year firm_id” group starts, why?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.