Re: Calculating variance of each ID within a data step

CheerfulChu · Posted 09-26-2016 07:27 PM

Dear Experts

I have a series of data as shown below. I would like to calculate the variance for each ID. However, the last two values for each ID is used for checking that it is valid and hence exclude for calculating the variance. It must be used within a data step because this part of processing is part of a very large data step.

xPrice

ID Price yyyymmdd

AAPL 10.6 20160105

AAPL 10.5 20160104

AAPL 10.3 20160103

AAPL 10.2 20160102

AAPL 10.1 20160101

IBM 11.6 20160105

IBM 11.5 20160104

IBM 11.3 20160103

IBM 11.2 20160102

IBM 11.1 20160101

I have written a short code:

data xPrice1; set xPrice; v = var(price, lag(price)); run;

This one is faulty:

1) I have to specify all the parameters within "var". In actual implementation, each ID will have 100+ values and I cant write 100+ lags to compute the variance and I need the last two values to be used for checking that the ID has existed before computing.

2) If I use proc means, I cant check the last 2 values and I need it to be in data step processing because it acts as an "interface" where a small part of the program (part of a data step) is visible for users while hiding the rest of the code.

Thanks

LL

Reeza · Posted 09-26-2016 07:50 PM

If it's in a data step you need to go back to base definitions and can't use SAS functions.

Astounding · Posted 09-26-2016 08:11 PM

Does it actually help you to compare one observation to the previous one? Don't you need 3 observations in this example (the first 3 of 5, then ignore the last 2)?

In general, there are ways to pursue this. For example, you could start with two loops that each read the data:

data want;

array exist {100} list of 100 names;

array lagged {100};

id_count=0;

do until (last.id);

do _n_=1 to 100;

lagged{_n_} = exist{_n_};

end;

set have;

by id;

id_count + 1;

end;

id_count2 = 0;

do until (last.id);

set have;

by id;

id_count2 + 1;

if id_count2 <= id_count - 2 then do;

**** whatever;

end;

else do;

**** whatever;

end;

run;

CheerfulChu · Posted 09-26-2016 10:08 PM

So there is no way to calculate the variance for (xi-2) where xi = total number of observations for asset i?

The reason for calculating the last two is to check that the asset has existed way before and not just launched. If there are missing values in between the data, when calculating the variance, the output would be ".". So I can ignore these assets

Thank you for your help

LL

Astounding · Posted 09-26-2016 11:16 PM

You can calculate variance for a group of observations, but you haven't written the code to do that.

If you wanted the mean for a group of observations, you would calculate the sum and N for the group, and at the end of that calculate the mean.

SImilarly, if you want the variance for a group of observations, you can calculate the sum, N, and sum of squared values for the group, and at the end of that calculate the variance.

The comment statements (****whatever;) indicate that something has to happen there but I didn't spell out what has to happen. It wasn't clear yet what needed to be calculated based on which observations.

Ksharp · Posted 09-27-2016 05:57 AM

Make a weight variable wt , assign wt=. to the last obs of each group,
assign wt=1 to others.


data have;
input ID $     Price   yyyymmdd $;
cards;
AAPL 10.6    20160105
AAPL 10.5    20160104
AAPL 10.3    20160103
AAPL 10.2    20160102
AAPL 10.1    20160101
IBM    11.6    20160105
IBM    11.5    20160104
IBM    11.3    20160103
IBM    11.2    20160102
IBM    11.1    20160101
;
run;
data temp;
 merge have have(keep=id rename=(id=_id) firstobs=3);
 if id=_id then wt=1;
  else wt=.;
run;
proc means data=temp var;
 by id;
 weight wt;
 var price;
run;

Astounding · Posted 09-27-2016 07:49 AM

OK, starting over then, let's approach this from a different angle. What do you expect the final result to look like?

CheerfulChu · Posted 09-27-2016 03:53 PM

Dear Experts,

I have to write within the data step.

The actual program

data XXX

...

"My program here"

...

run;

If I have only 3 data points and do summation. I will write like the following (kind of interfacing):

"My program here"

%str( 
  priceT = price + lag(price) + lag2(price);
);

Now I have alot of data points and I need to find the variance of the prices and to make sure there are 2 extra data points for making sure the asset exists first:

"My program here" will look something like

%str(

if price( yyyymmdd =20160102) ^= . and price( yyyymmdd =20160101) ^= . then

varPrice = var(price(yyyymmdd =20160105), price(yyyymmdd =20160104), price(yyyymmdd =20160103));

);

This will give var(AAPL[10.6, 10.5, 10.3]) = 0.02333 and var(IBM[11.6, 11.5, 11.3]) = 0.02333.

I cant use proc means within a data-step.

Many Thanks

Astounding · Posted 09-27-2016 04:44 PM

Suppose we actually perform the required calculations, and figure that the variance for AAPL is 0.02333. What happens with that number? Does it become a new variable in your original data set? Does it get populated for every AAPL observation, or just for the first three? (Or perhaps just for the third AAPL observation?)

CheerfulChu · Posted 09-27-2016 05:11 PM

Hi Astounding,

This actual program

data XXX

....

"My program here"

....

...

run;

It will generate a new dataset with each asset having its own variance value. It will read in all the observations for AAPL excluding the last two observations and compute the variance. For this example, there are 5 observations for AAPL. Only the first 3 are used to compute the variance and the last two observations are used as filtering. The filter works like if the last two observations do not exist, variance for APPL will be deleted.

If it is difficult,

maybe you can teach me how to write a loop for the lagN?

Example:

I want to compute variance for 4 observations:

VarP = std(price, lag(price), lag2(price), lag3(price));

The algorithm is something like

{

N= 4;

xString = "price";

for i = 2: N

xString = concatenate( xString, ",", lag"i"(price));

end

VarP = std(xString);

}

Many thanks

Astounding · Posted 09-27-2016 05:17 PM

From the point of view of a programmer, there are many ways to compute variance (all of which get the same numeric result). You don't need LAG, you don't need to hard-code dates, you don't need to know how many records there will be. But the missing ingredient to select a method is to understand what we are trying to achieve here. Assuming we could compute the variance (no matter what the methods would be, and no matter how we identify the proper subset of observations to use), is the goal to add another column to your data, and populate it on every single observation? If that's not it, then what is the goal?

CheerfulChu · Posted 09-27-2016 05:25 PM

You are right.

It will create one more column for the data step.

Since it is doing rolling, there would be "." in the new column. However, the subsequent processing will only keep the latest and only one variance for each asset.

The subsequent processing in the other part of the program will remove "." and get the variance for the latest yyyymmdd and delete the rest. This part is already written.

So I try not to mess up the rest of the program and only modify the "interface" which is a string that is within the data step.

Thanks

Astounding · Posted 09-27-2016 05:35 PM

OK, we're starting to get somewhere now. RIght now, there are 5 observations for AAPL. For those 5 observations, what should the new column be at the end of the DATA step?

(Keep in mind that you have reversed the dates, so they start with the most recent and end with the oldest.)

CheerfulChu · Posted 09-27-2016 06:18 PM

First it will look like this. This is where I will change the code to process the variance.

Asset Price YYYYMMDD				varPrice
AAPL 20.6    20160105				.
AAPL 1.5    20160104				.
AAPL 3.3    20160103				10.54624736
AAPL 4.2    20160102				1.374772708
AAPL 12.1    20160101				4.841831609
IBM    19.6    20160105				.
IBM    7.5    20160104				.
IBM    3.3    20160103				8.463056973
IBM    3.2    20160102				2.454248018
IBM    2.1    20160101				0.665832812

The rest of the data steps will make the above becomes. This is already written.

Asset Price YYYYMMDD				varPrice
AAPL 3.3    20160103				10.54624736
IBM    3.3    20160103				8.463056973

Astounding · Posted 09-27-2016 07:35 PM

It's looking like the variance is computed based on the 3 most recent prices. I trust you have that code and are happy with the calculations. If you're not using the 3 most recent, we can revisit that (but it becomes more complex). Here's what you need to add to get the result in your example.

First, your DATA step must include a BY statement:

by asset;

There can be more variables in the BY statement, but the first one must be asset.

Second, you have to add this early:

if first.asset then asset_counter=1;

else asset_counter + 1;

Then calculate the variance on every observation. It's not that you need to keep the variance on the first two, but you need to make sure that the LAG function executes on every observation or else the calculations will be wrong.

Then reset the first two by adding:

if asset_counter <=2 then varPrice=.;

That should be the full set of pieces to add to your DATA step, possibly adding also:

drop asset_counter;

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away