turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Data Management
- /
- Forum
- /
- calculate a new variable using variables * cases

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-11-2011 09:37 PM

I have a data set looking like this:

year | shr | opt |

1992 | 2.4 | 6 |

1992 | 2.3 | . |

1992 | . | 86 |

1993 | . | 6.5 |

1993 | . | . |

1993 | 2.3 | 0 |

1993 | 2.6 | 0 |

1993 | 3.5 | 8.9 |

I want to output a new table looking like this:

year | new variable |

1992 | |

1993 |

where the new variable is the average of shr * opt in all the years.

e.g. the new variable for 1992 equals (2.4 * 6 + 2.3 * . + . * 86)/3

and the new variable for 1993 equals ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5

I want to output the new variable for each year.

Any idea how to do this?

Thanks.

Accepted Solutions

Solution

08-12-2011
04:02 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:02 PM

**By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.**

** **

** **

**proc** **sql**;

select

gvkey,

year,

sum(shr*opt)/count(*) as NewVar

from work.detail_data

group by gvkey, year;

**quit**;

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 01:03 AM

There are definitely fancier ways of doing this but, hopefully, this will be more useful in showing you how such results are calculated.

data have;

input year shr opt;

cards;

1992 2.4 6

1992 2.3 .

1992 . 86

1993 . 6.5

1993 . .

1993 2.3 0

1993 2.6 0

1993 3.5 8.9

;

run;

data temp (keep=year data);

set have;

if shr > 0 then do;

data=shr;

output;

end;

if opt > 0 then do;

data=opt;

output;

end;

run;

proc means data=temp nway mean;

var data;

class year;

output out=want (drop=_

mean=average n=count;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 09:13 AM

I believe this code is pretty straightforward:

**data** work.detail_data;

input

Year

shr

opt;

datalines;

1992 2.4 6

1992 2.3 .

1992 . 86

1993 . 6.5

1993 2.3 0

1993 2.6 0

1993 3.5 8.9

;

**run**;

**proc** **sql**;

select

year,

avg(shr*opt) as NewVar

from work.detail_data

group by year;

**quit**;

It produces this output:

Year | NewVar |
---|---|

1992 | 14.4 |

1993 | 10.38333 |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 09:21 AM

I obviously didn't notice the asterisks in the definition. However, you should be able to modify that yourself and now have two methods and can choose depending upon your preferences.

But, now that I know you want to multiply the two values, what effect to you want missing values to have? What if both values are missing? I.e., should missing and/or zero values be included, should they cancel out the results for the entire equation, and/or should they be used or dismissed in the numerator? .. in the denominator?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 09:34 AM

I think you're right. The standard average function doesn't calculate the same way the OP's formula does. To accomplish that, I think he would have to change my formula from avg(shr*opt) to sum(shr*opt)/count(*).

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 03:18 PM

In the example, ( . * 6.5 + . * . + 2.3 * 0 + 2.6 * 0 + 3.5 * 8.9)/5 would equal to (. + . + 0 + 0 + 3.5 * 8.9)/5 = 3.5 * 8.9 / 5 .

Does this make sense? Basically, if either of the two existing variables is missing, this makes the product of that line a missing value. But I think this wouldn't affect the entire equation. In the case of zeores, the product would be zero. But since I don't know for sure if the equation for each year contains how many missing values/zeroes/"good" values, I have to count all of them.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 03:43 PM

I think the normal SAS average function excludes null values from the denominator as well. So, your calculation would end up being divided by 3 instead of 5. If that's what you want, then you could use it. If not, then you'll have to calculate it as sum(shr*opt)/count(*) which would equate to 3.5*8.9.5

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 03:57 PM

Thanks. Including null values is what I want. I ran your code, and it gave me very good results.

Now the thing is, a few different companies have these year-data. It;s like this:

gvkey | year | shr | opt |

105 | 1992 | 2.4 | 6 |

105 | 1992 | 2.3 | . |

105 | 1992 | . | 86 |

105 | 1993 | . | . |

105 | 1993 | 2.3 | 0 |

105 | 1993 | 3.5 | 8.9 |

258 | 1992 | . | 3.5 |

258 | 1992 | 3.5 | 0 |

258 | 1993 | 6 | 8.9 |

258 | 1994 | . | . |

258 | 1994 | 0 | 2.3 |

If only group by year, the sas would return me average values for all the company in each year. What if I want average value of each company in each year?

In the case above, the output would be like:

gvkey | year | newvar |

105 | 1992 | |

105 | 1993 | |

258 | 1992 | |

258 | 1993 | |

258 | 1994 |

Any ideas? Thank you.

BTW, sometimes sql does lots of things with relatively simple codes! A good idea to use sql in this case!

Solution

08-12-2011
04:02 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:02 PM

**By the way...including null values like you're doing effectively sets the value for that row to zero and includes it. That's why the normal average function excludes null values as they aren't technically the same as zero.**

** **

** **

**proc** **sql**;

select

gvkey,

year,

sum(shr*opt)/count(*) as NewVar

from work.detail_data

group by gvkey, year;

**quit**;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:10 PM

I thought it was something like this! I tried using & but it gave me error.

Thank you. It did give me what I want. Do you have any idea how to output the result in a new table?

Is it something like

proc sql;

create table work.newtable as select * from ........?

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:12 PM

JolSAS,

Yes, sometimes SQL can do things that are difficult in a datastep. Of course, sometimes one can do things in a datastep that are difficult if not impossible to do with SQL. However, my bigger concern if I were you, is what are the two values you are multiplying. With your chosen computational technique you may be throwing some babies out with the bath water. Similarly, one of the values may be meant as a weight. The forum doesn't know what the data represent.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:14 PM

Yes, you only need to insert the line

create table work.newtable as

before your select statement

If you don't want the output shown on your screen, change the proc sql; line to

proc sql noprint;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

08-12-2011 04:34 PM

The shr is shares of options owned and the opt is value realized from option exercising.

Yes, shr is a weight and I'm trying to compute the weighted average over the years. The reason that one company, in one year, could have multiple lines, is because one company, in one year, have multiple executives. One executive could have no share of options, while another could have some. Therefore, there are numbers and missing values. As a matter of fact, I cannot be 100% sure on this either. I think the best I can do is to try both sum/count and the average function, and see which produces the better result. In fact, I just asked my professor for his opinion.