I have a data set looking like this:
year | shr | opt |
1992 | 2.4 | 6 |
1992 | 2.3 | . |
1992 | . | 86 |
1993 | . | 6.5 |
1993 | . | . |
1993 | 2.3 | 0 |
1993 | 2.6 | 0 |
1993 | 3.5 | 8.9 |
I want to output a new table looking like this:
year | new variable |
1992 | |
1993 |
where the new variable is the average of shr * opt in all the years.
e.g. the new variable for 1992 equals (2.4 * 6 + 2.3 * . + . * 86)/3
and the new variable for 1993 equals ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5
I want to output the new variable for each year.
Any idea how to do this?
Thanks.
By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.
proc sql;
select
gvkey,
year,
sum(shr*opt)/count(*) as NewVar
from work.detail_data
group by gvkey, year;
quit;
There are definitely fancier ways of doing this but, hopefully, this will be more useful in showing you how such results are calculated.
data have;
input year shr opt;
cards;
1992 2.4 6
1992 2.3 .
1992 . 86
1993 . 6.5
1993 . .
1993 2.3 0
1993 2.6 0
1993 3.5 8.9
;
run;
data temp (keep=year data);
set have;
if shr > 0 then do;
data=shr;
output;
end;
if opt > 0 then do;
data=opt;
output;
end;
run;
proc means data=temp nway mean;
var data;
class year;
output out=want (drop=_:)
mean=average n=count;
run;
I believe this code is pretty straightforward:
data work.detail_data;
input
Year
shr
opt;
datalines;
1992 2.4 6
1992 2.3 .
1992 . 86
1993 . 6.5
1993 2.3 0
1993 2.6 0
1993 3.5 8.9
;
run;
proc sql;
select
year,
avg(shr*opt) as NewVar
from work.detail_data
group by year;
quit;
It produces this output:
Year | NewVar |
---|---|
1992 | 14.4 |
1993 | 10.38333 |
I obviously didn't notice the asterisks in the definition. However, you should be able to modify that yourself and now have two methods and can choose depending upon your preferences.
But, now that I know you want to multiply the two values, what effect to you want missing values to have? What if both values are missing? I.e., should missing and/or zero values be included, should they cancel out the results for the entire equation, and/or should they be used or dismissed in the numerator? .. in the denominator?
I think you're right. The standard average function doesn't calculate the same way the OP's formula does. To accomplish that, I think he would have to change my formula from avg(shr*opt) to sum(shr*opt)/count(*).
In the example, ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5 would equal to (. + . + 0 + 0 + 3.5 * 8.9)/5 = 3.5 * 8.9 / 5 .
Does this make sense? Basically, if either of the two existing variables is missing, this makes the product of that line a missing value. But I think this wouldn't affect the entire equation. In the case of zeores, the product would be zero. But since I don't know for sure if the equation for each year contains how many missing values/zeroes/"good" values, I have to count all of them.
I think the normal SAS average function excludes null values from the denominator as well. So, your calculation would end up being divided by 3 instead of 5. If that's what you want, then you could use it. If not, then you'll have to calculate it as sum(shr*opt)/count(*) which would equate to 3.5*8.9.5
Thanks. Including null values is what I want. I ran your code, and it gave me very good results.
Now the thing is, a few different companies have these year-data. It;s like this:
gvkey | year | shr | opt |
105 | 1992 | 2.4 | 6 |
105 | 1992 | 2.3 | . |
105 | 1992 | . | 86 |
105 | 1993 | . | . |
105 | 1993 | 2.3 | 0 |
105 | 1993 | 3.5 | 8.9 |
258 | 1992 | . | 3.5 |
258 | 1992 | 3.5 | 0 |
258 | 1993 | 6 | 8.9 |
258 | 1994 | . | . |
258 | 1994 | 0 | 2.3 |
If only group by year, the sas would return me average values for all the company in each year. What if I want average value of each company in each year?
In the case above, the output would be like:
gvkey | year | newvar |
105 | 1992 | |
105 | 1993 | |
258 | 1992 | |
258 | 1993 | |
258 | 1994 |
Any ideas? Thank you.
BTW, sometimes sql does lots of things with relatively simple codes! A good idea to use sql in this case!
By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.
proc sql;
select
gvkey,
year,
sum(shr*opt)/count(*) as NewVar
from work.detail_data
group by gvkey, year;
quit;
I thought it was something like this! I tried using & but it gave me error.
Thank you. It did give me what I want. Do you have any idea how to output the result in a new table?
Is it something like
proc sql;
create table work.newtable as select * from ........?
Thanks.
JolSAS,
Yes, sometimes SQL can do things that are difficult in a datastep. Of course, sometimes one can do things in a datastep that are difficult if not impossible to do with SQL. However, my bigger concern if I were you, is what are the two values you are multiplying. With your chosen computational technique you may be throwing some babies out with the bath water. Similarly, one of the values may be meant as a weight. The forum doesn't know what the data represent.
Yes, you only need to insert the line
create table work.newtable as
before your select statement
If you don't want the output shown on your screen, change the proc sql; line to
proc sql noprint;
The shr is shares of options owned and the opt is value realized from option exercising.
Yes, shr is a weight and I'm trying to compute the weighted average over the years. The reason that one company, in one year, could have multiple lines, is because one company, in one year, have multiple executives. One executive could have no share of options, while another could have some. Therefore, there are numbers and missing values. As a matter of fact, I cannot be 100% sure on this either. I think the best I can do is to try both sum/count and the average function, and see which produces the better result. In fact, I just asked my professor for his opinion.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.