BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
LanMin
Fluorite | Level 6

Dear all,

this might be an easy question for those experienced programmers.

thanks in advance for your consideration!

a small sample (real data is huge in size)  is given below. I want to create two variables:  (1) industry_id : if firms share the same SIC (based on first three digits), it belongs to the same industry. in my example, we would have industry_id=1 for sic 3100, 3101 ; industry_id=2 for sic 3300, 3303 ; (2) compute industry mean return based on industry_id and year, but such mean must exclude ith firm,

for example, firm id 1001, 1002, 1003 belong to industry_id=1 , for year 1990, firmid 1001, mean_ret =(.7+.2)/2=.45,

for year 1991, firmid 1001, mean_ret =(-0.5+.4)/2=-.05

please help

firm_idSICYearReturn
1001310019900.01
1001310019910.2
1001310019920.3
1001310019930.7
100131001994-0.5
1001310019950.2
1001310019960.4
1002310119900.7
100231011991-0.5
1002310119920.2
1002310119930.4
1002310119940.4
1002310119950.2
1002310119960.2
1003310119900.2
1003310119910.4
1003310119920.7
100331011993-0.5
1003310119940.2
1003310119950.2
1003310119960.2
100633001990-0.5
1006330019910.9
1006330019920.2
1006330019930.09
1006330019940.09
1006330019950.22
1006330019960.1
1004330319900.4
1004330319910.2
1004330319920.2
100433031993-0.5
1004330319940.9
1004330319950.22
1004330319960.1

-Lan

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

A modified self join should work.


data have;

    input firm_id    SIC    Year    Return;

    id=substr(put(sic,4.),1,3);

cards;

1001    3100    1990    0.01

1001    3100    1991    0.2

1001    3100    1992    0.3

1001    3100    1993    0.7

1001    3100    1994    -0.5

1001    3100    1995    0.2

1001    3100    1996    0.4

1002    3101    1990    0.7

1002    3101    1991    -0.5

1002    3101    1992    0.2

1002    3101    1993    0.4

1002    3101    1994    0.4

1002    3101    1995    0.2

1002    3101    1996    0.2

1003    3101    1990    0.2

1003    3101    1991    0.4

1003    3101    1992    0.7

1003    3101    1993    -0.5

1003    3101    1994    0.2

1003    3101    1995    0.2

1003    3101    1996    0.2

1006    3300    1990    -0.5

1006    3300    1991    0.9

1006    3300    1992    0.2

1006    3300    1993    0.09

1006    3300    1994    0.09

1006    3300    1995    0.22

1006    3300    1996    0.1

1004    3303    1990    0.4

1004    3303    1991    0.2

1004    3303    1992    0.2

1004    3303    1993    -0.5

1004    3303    1994    0.9

1004    3303    1995    0.22

1004    3303    1996    0.1

;

run;

PROC SQL;

  CREATE TABLE WANT AS

  SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

  FROM HAVE AS A

  LEFT JOIN HAVE AS B

  ON A.ID=B.ID

  AND A.YEAR=B.YEAR

  AND A.FIRM_ID NE B.FIRM_ID

  GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

View solution in original post

21 REPLIES 21
Patrick
Opal | Level 21

I haven't understood what you meant by "but such mean must exclude ith firm" and haven't seen something like that in your example - so this is not covered in below (untested) SQL.

For the industry_id: You say that this are the first 3 digits of the SIC. If so the I believe it's best to create an industry_id containing exactly that - the first 3 digits of SIC.

proc sql;

create table want as

   select

    firm_id,

    SIC,

    Year,

    Return,

    substrn(SIC,1,3) as industry_id,

    avg(return) as mean_ret

   from have

   group by substrn(SIC,1,3), year

   ;

quit;

LanMin
Fluorite | Level 6

thank you, Patrick!

"but such mean must exclude ith firm" what I am asking here is that when I compute mean return by industry for firm i, we will use on firm i's peers return divided by the industry size minus 1 (i.e. excluding i).

how do I exclude firm i ?

-Lan

Jagadishkatam
Amethyst | Level 16

alternatively by datastep,

data have;

    input firm_id    SIC$    Year    Return;

    id=substr(sic,1,3);

cards;

1001    3100    1990    0.01

1001    3100    1991    0.2

1001    3100    1992    0.3

1001    3100    1993    0.7

1001    3100    1994    -0.5

1001    3100    1995    0.2

1001    3100    1996    0.4

1002    3101    1990    0.7

1002    3101    1991    -0.5

1002    3101    1992    0.2

1002    3101    1993    0.4

1002    3101    1994    0.4

1002    3101    1995    0.2

1002    3101    1996    0.2

1003    3101    1990    0.2

1003    3101    1991    0.4

1003    3101    1992    0.7

1003    3101    1993    -0.5

1003    3101    1994    0.2

1003    3101    1995    0.2

1003    3101    1996    0.2

1006    3300    1990    -0.5

1006    3300    1991    0.9

1006    3300    1992    0.2

1006    3300    1993    0.09

1006    3300    1994    0.09

1006    3300    1995    0.22

1006    3300    1996    0.1

1004    3303    1990    0.4

1004    3303    1991    0.2

1004    3303    1992    0.2

1004    3303    1993    -0.5

1004    3303    1994    0.9

1004    3303    1995    0.22

1004    3303    1996    0.1

;

run;

proc sort data = have;

    by id year;

run;

proc means data=have mean nway;

class id year;

var return;

output out=means(drop=_type_ _freq_) mean= /autoname;

run;

data  want;

    merge means have;

    by id year;

run;

proc sort data=want;

    by firm_id    SIC;

run;

Thanks,

Jagadish

Thanks,
Jag
Jagadishkatam
Amethyst | Level 16

Hi Lan,

i saw your post to Patrick after i submitted the above code, So accordingly i modified the my code. Please try

data have;

    input firm_id    SIC    Year    Return;

    id=substr(put(sic,4.),1,3);

cards;

1001    3100    1990    0.01

1001    3100    1991    0.2

1001    3100    1992    0.3

1001    3100    1993    0.7

1001    3100    1994    -0.5

1001    3100    1995    0.2

1001    3100    1996    0.4

1002    3101    1990    0.7

1002    3101    1991    -0.5

1002    3101    1992    0.2

1002    3101    1993    0.4

1002    3101    1994    0.4

1002    3101    1995    0.2

1002    3101    1996    0.2

1003    3101    1990    0.2

1003    3101    1991    0.4

1003    3101    1992    0.7

1003    3101    1993    -0.5

1003    3101    1994    0.2

1003    3101    1995    0.2

1003    3101    1996    0.2

1006    3300    1990    -0.5

1006    3300    1991    0.9

1006    3300    1992    0.2

1006    3300    1993    0.09

1006    3300    1994    0.09

1006    3300    1995    0.22

1006    3300    1996    0.1

1004    3303    1990    0.4

1004    3303    1991    0.2

1004    3303    1992    0.2

1004    3303    1993    -0.5

1004    3303    1994    0.9

1004    3303    1995    0.22

1004    3303    1996    0.1

;

run;

proc sort data = have;

    by year firm_id ;

run;

data have_;

    set have;

    by     year firm_id ;

    if first.year then flag=1;

run;

proc means data=have_(where=(flag^=1)) mean nway;

class id year;

var return;

output out=means(drop=_type_ _freq_) mean= /autoname;

run;

proc sort data = have;

    by id year  ;

run;

data  want;

    merge means have;

    by id year;

run;

proc sort data=want;

    by firm_id    SIC;

run;

Thanks,

Jagadish

Thanks,
Jag
LanMin
Fluorite | Level 6

Jagadish,

thanks so much ! !

I should have provided an expected output, what I need for the mean return is going to be each firm-industry-year (see example below), it is my mistake not to make it clear-that is why your code produces  mean return per industry-year

want: i computed 2 of them as illustration

the first one: 0.45 = [0.7 (from firm id 1002, year 1990)+0.2(from firm id 1003, year 1990) ] / 2=.45

firm_idSICYearReturnIndustry_IDMean_return
1001310019900.013100.45
1001310019910.2310-0.05
1001310019920.3310
1001310019930.7310
100131001994-0.5310
1001310019950.2310
1001310019960.4310
1002310119900.7310
100231011991-0.5310
1002310119920.2310
1002310119930.4310
1002310119940.4310
1002310119950.2310
1002310119960.2310
1003310119900.2310
1003310119910.4310
1003310119920.7310
100331011993-0.5310
1003310119940.2310
1003310119950.2310
1003310119960.2310
100633001990-0.5330
1006330019910.9330
1006330019920.2330
1006330019930.09330
1006330019940.09330
1006330019950.22330
1006330019960.1330
1004330319900.4330
1004330319910.2330
1004330319920.2330
100433031993-0.5330
1004330319940.9330
1004330319950.22330
1004330319960.1330

could you help modify your code?

-lan

Reeza
Super User

A modified self join should work.


data have;

    input firm_id    SIC    Year    Return;

    id=substr(put(sic,4.),1,3);

cards;

1001    3100    1990    0.01

1001    3100    1991    0.2

1001    3100    1992    0.3

1001    3100    1993    0.7

1001    3100    1994    -0.5

1001    3100    1995    0.2

1001    3100    1996    0.4

1002    3101    1990    0.7

1002    3101    1991    -0.5

1002    3101    1992    0.2

1002    3101    1993    0.4

1002    3101    1994    0.4

1002    3101    1995    0.2

1002    3101    1996    0.2

1003    3101    1990    0.2

1003    3101    1991    0.4

1003    3101    1992    0.7

1003    3101    1993    -0.5

1003    3101    1994    0.2

1003    3101    1995    0.2

1003    3101    1996    0.2

1006    3300    1990    -0.5

1006    3300    1991    0.9

1006    3300    1992    0.2

1006    3300    1993    0.09

1006    3300    1994    0.09

1006    3300    1995    0.22

1006    3300    1996    0.1

1004    3303    1990    0.4

1004    3303    1991    0.2

1004    3303    1992    0.2

1004    3303    1993    -0.5

1004    3303    1994    0.9

1004    3303    1995    0.22

1004    3303    1996    0.1

;

run;

PROC SQL;

  CREATE TABLE WANT AS

  SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

  FROM HAVE AS A

  LEFT JOIN HAVE AS B

  ON A.ID=B.ID

  AND A.YEAR=B.YEAR

  AND A.FIRM_ID NE B.FIRM_ID

  GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

LanMin
Fluorite | Level 6

Many thanks to Reeza, Jagadishkatam, and Patrick !!!

-lan

Jagadishkatam
Amethyst | Level 16

Hi Lan,

I left my answer incomplete in my earlier post, please add the below code to my previous code and you will get the desired output.

proc sort data=want;

    by year firm_id;

run;

data want_;

    set want;

    by year firm_id;

    if first.year then Return_Mean=Return_Mean;

    else Return_Mean=.;

run;

Thanks,

Jagadish

Thanks,
Jag
LanMin
Fluorite | Level 6

Thank you, Jagadish. I will try it.

LanMin
Fluorite | Level 6

Reeza,

I used your code for my real data which contains  2717655 observations. I got "ERROR SORT EXECUTION FAILURE" my guess is sas ran out of memory. my thought is to break the data into smaller sets by industry id, then run your code, after that, merge all the data back into a big dataset.

Do you have an efficient way of doing this?

this is  the part of the code I got error - my guess is sas ran out of memory, it look over an hour for sas to run.

* note: id=substr(put(sic,4.),1,3);

PROC SQL;

  CREATE TABLE OUT AS

  SELECT A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN, AVG(B.RETURN) AS AVG_RETURN

  FROM HAVE AS A

  LEFT JOIN HAVE AS B

  ON A.ID=B.ID

  AND A.YEAR=B.YEAR

  AND A.FIRM_ID NE B.FIRM_ID

  GROUP BY A.FIRM_ID, A.SIC, A.ID, A.YEAR, A.RETURN;

QUIT;

thanks very much,

-Lan

data_null__
Jade | Level 19

Isyour input data is not sorted by or index on any of the key variables.  What is the data type of SIC?  Do you have a lot of extra variable in the input data that you did not mention?

Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

hi guys,

i have a question on this problem.

So if i sort the dataset given above  by indt_id year firm_id to get what i have shown here and run the following code i am not getting the right avg

Whats  wrong with my code ? Smiley Happy

data sorted;

set have;

by indt_id year firm_id;

count=0;

if not first.year then do;

count=count+1;

sum+return;end;

if last.year then do;

avg=sum/count;output; end;

proc print;run;

                                            Obs    firm_id    SIC     Year    Return    Indt_id

                                              1     1001      3100    1990      0.01      310

                                              2     1002      3101    1990      0.70      310

                                              3     1003      3101    1990      0.20      310

                                              4     1001      3100    1991      0.20      310

                                              5     1002      3101    1991     -0.50      310

                                              6     1003      3101    1991      0.40      310

                                              7     1001      3100    1992      0.30      310

                                              8     1002      3101    1992      0.20      310

                                              9     1003      3101    1992      0.70      310

                                             10     1001      3100    1993      0.70      310

                                             11     1002      3101    1993      0.40      310

                                             12     1003      3101    1993     -0.50      310

                                             13     1001      3100    1994     -0.50      310

                                             14     1002      3101    1994      0.40      310

                                             15     1003      3101    1994      0.20      310

                                             16     1001      3100    1995      0.20      310

                                             17     1002      3101    1995      0.20      310

                                             18     1003      3101    1995      0.20      310

                                             19     1001      3100    1996      0.40      310

                                             20     1002      3101    1996      0.20      310

                                             21     1003      3101    1996      0.20      310

                                             22     1004      3303    1990      0.40      330

                                             23     1006      3300    1990     -0.50      330

                                             24     1004      3303    1991      0.20      330

                                             25     1006      3300    1991      0.90      330

                                             26     1004      3303    1992      0.20      330

                                             27     1006      3300    1992      0.20      330

                                             28     1004      3303    1993     -0.50      330

                                             29     1006      3300    1993      0.09      330

                                             30     1004      3303    1994      0.90      330

                                             31     1006      3300    1994      0.09      330

                                             32     1004      3303    1995      0.22      330

                                             33     1006      3300    1995      0.22      330

                                             34     1004      3303    1996      0.10      330

                                             35     1006      3300    1996      0.10      330

data_null__
Jade | Level 19

If you can create SIC that groups properly $3 (SIC2) and sort the data and make a reasonable guess at the maximum number of FIRMS per YEAR this data step should work with your actual data.  I don't know about performance.

data firm;
   input firm:$4. SIC:$4. Year Return;
   length sic2 $3;
   sic2 = sic;
  
cards;
1001    3100    1990    0.01
1001    3100    1991    0.2
1001    3100    1992    0.3
1001    3100    1993    0.7
1001    3100    1994    -0.5
1001    3100    1995    0.2
1001    3100    1996    0.4
1002    3101    1990    0.7
1002    3101    1991    -0.5
1002    3101    1992    0.2
1002    3101    1993    0.4
1002    3101    1994    0.4
1002    3101    1995    0.2
1002    3101    1996    0.2
1003    3101    1990    0.2
1003    3101    1991    0.4
1003    3101    1992    0.7
1003    3101    1993    -0.5
1003    3101    1994    0.2
1003    3101    1995    0.2
1003    3101    1996    0.2
1006    3300    1990    -0.5
1006    3300    1991    0.9
1006    3300    1992    0.2
1006    3300    1993    0.09
1006    3300    1994    0.09
1006    3300    1995    0.22
1006    3300    1996    0.1
1004    3303    1990    0.4
1004    3303    1991    0.2
1004    3303    1992    0.2
1004    3303    1993    -0.5
1004    3303    1994    0.9
1004    3303    1995    0.22
1004    3303    1996    0.1
;
run;
proc sort;
  
by sic2 year firm;
   run;
proc print;
  
run;

%let d=5; *max firms per year;
data mean;
   i = 0;
  
do until(last.year);
      set firm;
      by sic2 year;
      array s[&d] $4 _temporary_;
     
array f[&d] $4 _temporary_;
     
array r[&d] 8  _temporary_;
      i +
1;
      s = sic;
      f = firm;
      r = return;
     
end;
  
do i = 1 to i;
      return = r;
      r   = .;
      sic    = s;
      firm   = f;
      mean   = mean(of r
  • );
         
  • output;
          r   = return;
         
    end;
      
    call missing(of s
  • f
  • r
  • );
  •    drop i;
       run;
    proc print;
      
    run;
         

    Message was edited by: data _null_ I forgot to add CALL MISSING when I switched to temporary arrays.

    Tal
    Pyrite | Level 9 Tal
    Pyrite | Level 9

    Thank you sir.but if i want to use dows without arrays after sorting the set by sic2 year firm_id and running the dow code to get the mean for which by “ sic2 year firm_id” group excluding the 1st record of each group i don’t get the expected results.

    Count and sum don’t get to missing when a new “by sic2 year firm_id” group starts, why?

    SAS Innovate 2025: Save the Date

     SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

    Save the date!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    SAS Training: Just a Click Away

     Ready to level-up your skills? Choose your own adventure.

    Browse our catalog!

    Discussion stats
    • 21 replies
    • 4498 views
    • 7 likes
    • 8 in conversation