Solved: Re: calculate a new variable using variables * cases

JOLSAS · Posted 08-11-2011 09:37 PM

I have a data set looking like this:

year	shr	opt
1992	2.4	6
1992	2.3	.
1992	.	86
1993	.	6.5
1993	.	.
1993	2.3	0
1993	2.6	0
1993	3.5	8.9

I want to output a new table looking like this:

year	new variable
1992
1993

where the new variable is the average of shr * opt in all the years.

e.g. the new variable for 1992 equals (2.4 * 6 + 2.3 * . + . * 86)/3

and the new variable for 1993 equals ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5

I want to output the new variable for each year.

Any idea how to do this?

Thanks.

DBailey · Posted 08-12-2011 04:02 PM

By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.

proc sql;

select

gvkey,

year,

sum(shr*opt)/count(*) as NewVar

from work.detail_data

group by gvkey, year;

quit;

View solution in original post

art297 · Posted 08-12-2011 01:03 AM

There are definitely fancier ways of doing this but, hopefully, this will be more useful in showing you how such results are calculated.

data have;

input year shr opt;

cards;

1992 2.4 6

1992 2.3 .

1992 . 86

1993 . 6.5

1993 . .

1993 2.3 0

1993 2.6 0

1993 3.5 8.9

;

run;

data temp (keep=year data);

set have;

if shr > 0 then do;

data=shr;

output;

end;

if opt > 0 then do;

data=opt;

output;

end;

run;

proc means data=temp nway mean;

var data;

class year;

output out=want (drop=_:)

mean=average n=count;

run;

DBailey · Posted 08-12-2011 09:13 AM

I believe this code is pretty straightforward:

data work.detail_data;

input

Year

shr

opt;

datalines;

1992 2.4 6

1992 2.3 .

1992 . 86

1993 . 6.5

1993 2.3 0

1993 2.6 0

1993 3.5 8.9

;

run;

proc sql;

select

year,

avg(shr*opt) as NewVar

from work.detail_data

group by year;

quit;

It produces this output:

Year	NewVar
1992	14.4
1993	10.38333

art297 · Posted 08-12-2011 09:21 AM

I obviously didn't notice the asterisks in the definition. However, you should be able to modify that yourself and now have two methods and can choose depending upon your preferences.

But, now that I know you want to multiply the two values, what effect to you want missing values to have? What if both values are missing? I.e., should missing and/or zero values be included, should they cancel out the results for the entire equation, and/or should they be used or dismissed in the numerator? .. in the denominator?

DBailey · Posted 08-12-2011 09:34 AM

I think you're right. The standard average function doesn't calculate the same way the OP's formula does. To accomplish that, I think he would have to change my formula from avg(shr*opt) to sum(shr*opt)/count(*).

JOLSAS · Posted 08-12-2011 03:18 PM

In the example, ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5 would equal to (. + . + 0 + 0 + 3.5 * 8.9)/5 = 3.5 * 8.9 / 5 .

Does this make sense? Basically, if either of the two existing variables is missing, this makes the product of that line a missing value. But I think this wouldn't affect the entire equation. In the case of zeores, the product would be zero. But since I don't know for sure if the equation for each year contains how many missing values/zeroes/"good" values, I have to count all of them.

DBailey · Posted 08-12-2011 03:43 PM

I think the normal SAS average function excludes null values from the denominator as well. So, your calculation would end up being divided by 3 instead of 5. If that's what you want, then you could use it. If not, then you'll have to calculate it as sum(shr*opt)/count(*) which would equate to 3.5*8.9.5

JOLSAS · Posted 08-12-2011 03:57 PM

Thanks. Including null values is what I want. I ran your code, and it gave me very good results.

Now the thing is, a few different companies have these year-data. It;s like this:

gvkey	year	shr	opt
105	1992	2.4	6
105	1992	2.3	.
105	1992	.	86
105	1993	.	.
105	1993	2.3	0
105	1993	3.5	8.9
258	1992	.	3.5
258	1992	3.5	0
258	1993	6	8.9
258	1994	.	.
258	1994	0	2.3

If only group by year, the sas would return me average values for all the company in each year. What if I want average value of each company in each year?

In the case above, the output would be like:

gvkey	year	newvar
105	1992
105	1993
258	1992
258	1993
258	1994

Any ideas? Thank you.

BTW, sometimes sql does lots of things with relatively simple codes! A good idea to use sql in this case!

DBailey · Posted 08-12-2011 04:02 PM

By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.

proc sql;

select

gvkey,

year,

sum(shr*opt)/count(*) as NewVar

from work.detail_data

group by gvkey, year;

quit;

JOLSAS · Posted 08-12-2011 04:10 PM

I thought it was something like this! I tried using & but it gave me error.

Thank you. It did give me what I want. Do you have any idea how to output the result in a new table?

Is it something like

proc sql;

create table work.newtable as select * from ........?

Thanks.

art297 · Posted 08-12-2011 04:12 PM

JolSAS,

Yes, sometimes SQL can do things that are difficult in a datastep. Of course, sometimes one can do things in a datastep that are difficult if not impossible to do with SQL. However, my bigger concern if I were you, is what are the two values you are multiplying. With your chosen computational technique you may be throwing some babies out with the bath water. Similarly, one of the values may be meant as a weight. The forum doesn't know what the data represent.

art297 · Posted 08-12-2011 04:14 PM

Yes, you only need to insert the line

create table work.newtable as

before your select statement

If you don't want the output shown on your screen, change the proc sql; line to

proc sql noprint;

JOLSAS · Posted 08-12-2011 04:34 PM

The shr is shares of options owned and the opt is value realized from option exercising.

Yes, shr is a weight and I'm trying to compute the weighted average over the years. The reason that one company, in one year, could have multiple lines, is because one company, in one year, have multiple executives. One executive could have no share of options, while another could have some. Therefore, there are numbers and missing values. As a matter of fact, I cannot be 100% sure on this either. I think the best I can do is to try both sum/count and the average function, and see which produces the better result. In fact, I just asked my professor for his opinion.

SAS Innovate 2025: Call for Content