Re: +1 to +3 year average, and -2 to -4 year average

kt_uwa1990 · Posted 11-06-2012 02:20 AM

Hi all

If someone could help me calculate these specific averages I would be most grateful.

I have attached my clean data in a csv file to this post, as you will find the number of observations is not too cumbersome.

Basically my objective is for each firm, identified by the permno, I need to calculate the following equation:

(Average CFO in years +1 to +3) - (Average CFO in years -4 to -2) = Difference

Where difference should be the new column created. The CFO information is provided, I just need to determine how to write code for these averages.

Anyone with the brain power please help me with this = )

Ksharp · Posted 11-06-2012 03:51 AM

If you understand the code I wrote before, it would be easy.

proc import datafile='c:\step2.csv' out=test_vol dbms=csv replace;getnames=yes;run;

proc sql;
create table temp as
 select h.*,
      (select avg(cfo) from test_vol where lpermno = h.lpermno and year between h.year+1 and h.year+3 )-
      (select avg(cfo) from test_vol where lpermno = h.lpermno and year between h.year-4 and h.year-2) as  difference
  from test_vol as h;quit;

Ksharp

kt_uwa1990 · Posted 11-06-2012 05:51 AM

Hi Ksharp

Yes I do understand your code however this takes a very long time to execute for me (7+ hours). Although this could be due to a weak processing power on my PC. May I ask are you able to obtain results using the code above?

If so, could you please post the results in a CSV file and I can have a look?

Many thanks

Ksharp · Posted 11-06-2012 06:04 AM

That is because your data is a lttle big (260,000+ ).

if you can pick up some obs from it, you will see the result .

A fast way is spliting it into lots of sub-table. then use the code for each sub-table respectively.

data test_vol;

set test_vol(obs=10000);

run;

消息编辑者为：xia keshan

kt_uwa1990 · Posted 11-06-2012 06:54 AM

Ok I will try smaller sample periods. You will notice this sample differs to my last post, the dates in the sample above are integers (rather than sas dates). With that in mind, will your code still work?

data_null__ · Posted 11-06-2012 07:15 AM

It doesn't seem like you have all that much data to me. I wrote it using LAGS and LEADS computed as LAGS after descending sort on YEAR and it only takes a second or so including all that sorting.

You could try similar using this to get started

http://support.sas.com/kb/24/694.html

kt_uwa1990 · Posted 11-06-2012 07:28 AM

Hi data_null_

I would like to say cool picture haha, I like cats.

Anyway, to business. Really, your code only took few seconds? Sounds brilliant! Could I please see the code for lags and leads? I apologize for being so brash, but I'm on a rather tight deadline with this analysis.

Katy

data_null__ · Posted 11-06-2012 07:44 AM

I like cats too.

This program uses the nieve method to compute LEADS. Do you have access to SAS/ETS? If you do I think this can all be done with one call to PROC EXPAND. Anyway see if you can get this to run on your computer. The sort at the end may not be necessary depending on what your next step is.

FILENAME FT47F001 'path to csv';

data cfo /view=cfo;
   infile FT74F001 firstobs=2 dsd termstr=crlf;
   input lpermno $ year cfo;
   run;
data lags;
   set cfo;
   by lpermno;
   array _l

lag1-lag4;

   lag1 = lag1(cfo);
   lag2 = lag2(cfo);
   lag3 = lag3(cfo);
   lag4 = lag4(cfo);
   if first.lpermno then c=0;
   c+1;
   do _n_ = c to dim(_l);
      _l[_n_] = .;
      end;
   drop c lag1;
   run;
proc sort data=lags;
   by lpermno descending year;
   run;
data leads;
   set lags;
   by lpermno;
   array _l

lead1-lead3;

   lead1 = lag1(cfo);
   lead2 = lag2(cfo);
   lead3 = lag3(cfo);
   if first.lpermno then c=0;
   c+1;
   do _n_ = c to dim(_l);
      _l[_n_] = .;
      end;
/*(Average CFO in years +1 to +3) - (Average CFO in years -4 to -2) = Difference*/
   difference = mean(of lead1-lead3) - mean(of lag4-lag2);
   drop c;
   run;
proc sort data=leads;
   by lpermno year;
   run;

kt_uwa1990 · Posted 11-06-2012 08:06 AM

Hi data_null

Thank you! The code worked swimmingly! Extremely fast too! The code produced two databases: lags and leads. From my interpretation, the difference column provided in the leads database is what I'm after. Is it safe to delete lags db?

Also, I have provided a link below to a funny blog about a missing cat. I hope you find it as funny as I did!

"yeah thats not what I was looking for at all."

data_null__ · Posted 11-06-2012 08:39 AM

Yes the data set LAGS can be discarded.

Ksharp · Posted 11-06-2012 08:44 AM

One thing you should consider is when there are some unconsecutive year in your data, NULL's code would not work.

1998

2002 <-missing 1999 2000 2001

2003

2005

2006

Haikuo · Posted 11-06-2012 09:23 AM

That case, you can use proc timeseries(if you have ETS) to fill the gap first, then use DN's code. If you don't have ETS, you can always fill it up using datastep in needed.

This is where I see not the SQL's strength.

Haikuo

data_null__ · Posted 11-06-2012 09:23 AM

It is true that if the data should be expanded to work properly then my program will need to be modified to expand the data for each LPERMNO. There are also duplicate years. Messy.

data cfoV / view=cfoV;
   infile FT74F001 firstobs=2 dsd termstr=crlf;
   input lpermno $ year cfo;
   run; 
proc sort out=cfo2 nodupkey; 
   by lpermno year;
   run; 
proc summary data=cfo2;
   by lpermno;
   output out=span min(year)=min max(year)=max; 
   run; 
data frameV / view=framev;
   set span;
   do year = min to max;
      output; 
      end; 
   run; 
data cfo2;
   merge framev cfo2;
   by lpermno year;
   run; 
data lags;
   set cfo2;
   by lpermno;
   array _l lag1-lag4;
   lag1 = lag1(cfo);
   lag2 = lag2(cfo);
   lag3 = lag3(cfo);
   lag4 = lag4(cfo);
   if first.lpermno then c=0; 
   c+1; 
   do _n_ = c to dim(_l);
      _l[_n_] = .; 
      end; 
   drop c lag1;
   run; 
proc sort data=lags;
   by lpermno descending year;
   run; 
data leads;
   set lags;
   by lpermno;
   array _l lead1-lead3;
   lead1 = lag1(cfo);
   lead2 = lag2(cfo);
   lead3 = lag3(cfo);
   if first.lpermno then c=0; 
   c+1; 
   do _n_ = c to dim(_l);
      _l[_n_] = .; 
      end; 
/*(Average CFO in years +1 to +3)  - (Average CFO in years -4 to -2) = Difference*/
   difference = mean(of lead1-lead3) - mean(of lag4-lag2);
   drop c;
   run; 
proc sort data=leads;
   by lpermno year;
   run;

kt_uwa1990 · Posted 11-06-2012 10:41 AM

Thanks everyone.

Just one more step, I have merged the desired data back to my larger database. Mainly I want to compare the differences of individual observations against their industries. Please see attachment.

First, I need to calculate the median 'difference' for each industry. Note the industry classification is represented by a 2-3 digit number in the sicind column.

If the permnos (or individual observation) difference is greater than its industry's median difference (indiff) then a 1 appears in a new column entitled 'Maturity'.

If the permnos (or individual observation) difference is less than its industry's median difference (indfiff) then a 0 appears in a new column entitled 'Maturity'

I was thinking of using something like the code below, what do you guys think?

proc summary data = stage3;

by sicind;

var difference; output out = test median = difference;

run;

Ksharp · Posted 11-06-2012 08:51 AM

If you want fastest speed, I think HashTable can do it. But That will produce horrible long code .I am afraid that I have not so much time to code it.

If I were you ,I will use HashTable splite it into lots of small tables ,then call execute() to get the result and using proc append to combine them all together.

Ksharp

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away