DATA Step, Macro, Functions and more

Calculating variance of each ID within a data step

Reply
Contributor
Posts: 40

Calculating variance of each ID within a data step

Dear Experts

 

I have a series of data as shown below. I would like to calculate the variance for each ID. However, the last two values for each ID is used for checking that it is valid and hence exclude for calculating the variance. It must be used within a data step because this part of processing is part of a very large data step.

 

xPrice

 

ID      Price   yyyymmdd

AAPL 10.6    20160105

AAPL 10.5    20160104

AAPL 10.3    20160103

AAPL 10.2    20160102

AAPL 10.1    20160101

IBM    11.6    20160105

IBM    11.5    20160104

IBM    11.3    20160103

IBM    11.2    20160102

IBM    11.1    20160101

 

I have written a short code:

data xPrice1; set xPrice; v = var(price, lag(price)); run;

 

 

This one is faulty:

1) I have to specify all the parameters within "var". In actual implementation, each ID will have 100+ values and I cant write 100+ lags to compute the variance and I need the last two values to be used for checking that the ID has existed before computing.

 

2) If I use proc means, I cant check the last 2 values and I need it to be in data step processing because it acts as an "interface" where a small part of the program (part of a data step) is visible for users while hiding the rest of the code.

 

Thanks

LL

Super User
Posts: 19,870

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

If it's in a data step you need to go back to base definitions and can't use SAS functions. 

 

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

Does it actually help you to compare one observation to the previous one?  Don't you need 3 observations in this example (the first 3 of 5, then ignore the last 2)?

 

In general, there are ways to pursue this.  For example, you could start with two loops that each read the data:

 

 

data want;

array exist {100} list of 100 names;

array lagged {100};

id_count=0;

do until (last.id);

   do _n_=1 to 100;

      lagged{_n_} = exist{_n_};

   end;

   set have;

   by id;

   id_count + 1;

end;

id_count2 = 0;

do until (last.id);

   set have;

   by id;

   id_count2 + 1;

   if id_count2 <= id_count - 2 then do;

      **** whatever;

   end;

   else do;

      **** whatever;

   end;

end;

run;

Contributor
Posts: 40

Re: Calculating variance of each ID within a data step

Posted in reply to Astounding

So there is no way to calculate the variance for (xi-2) where xi = total number of observations for asset i? 

 

The reason for calculating the last two is to check that the asset has existed way before and not just launched. If there are missing values in between the data, when calculating the variance, the output would be ".". So I can ignore these assets

 

Thank you for your help

LL

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

You can calculate variance for a group of observations, but you haven't written the code to do that.

 

If you wanted the mean for a group of observations, you would calculate the sum and N for the group, and at the end of that calculate the mean.

 

SImilarly, if you want the variance for a group of observations, you can calculate the sum, N, and sum of squared values for the group, and at the end of that calculate the variance.

 

The comment statements (****whateverSmiley Wink indicate that something has to happen there but I didn't spell out what has to happen.  It wasn't clear yet what needed to be calculated based on which observations.

Super User
Posts: 10,046

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu
Make a weight variable wt , assign wt=. to the last obs of each group,
assign wt=1 to others.


data have;
input ID $     Price   yyyymmdd $;
cards;
AAPL 10.6    20160105
AAPL 10.5    20160104
AAPL 10.3    20160103
AAPL 10.2    20160102
AAPL 10.1    20160101
IBM    11.6    20160105
IBM    11.5    20160104
IBM    11.3    20160103
IBM    11.2    20160102
IBM    11.1    20160101
;
run;
data temp;
 merge have have(keep=id rename=(id=_id) firstobs=3);
 if id=_id then wt=1;
  else wt=.;
run;
proc means data=temp var;
 by id;
 weight wt;
 var price;
run;

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

OK, starting over then, let's approach this from a different angle.  What do you expect the final result to look like?

Contributor
Posts: 40

Re: Calculating variance of each ID within a data step

[ Edited ]
Posted in reply to Astounding

Dear Experts,

 

I have to write within the data step.

 

The actual program

data XXX

... 

...

...

"My program here"

...

...

...

run;

 

If I have only 3 data points and do summation. I will write like the following (kind of interfacing):

"My program here" 

 

%str( 
  priceT = price + lag(price) + lag2(price);
);

Now I have alot of data points and I need to find the variance of the prices and to make sure there are 2 extra data points for making sure the asset exists first:

"My program here" will look something like

%str(

            if price( yyyymmdd =20160102) ^= . and price( yyyymmdd =20160101)   ^= . then

             varPrice = var(price(yyyymmdd =20160105), price(yyyymmdd =20160104), price(yyyymmdd =20160103));

);

 

This will give var(AAPL[10.6, 10.5, 10.3]) = 0.02333 and var(IBM[11.6, 11.5, 11.3]) = 0.02333. 

 

I cant use proc means within a data-step.

 

Many Thanks

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

Suppose we actually perform the required calculations, and figure that the variance for AAPL is 0.02333.  What happens with that number?  Does it become a new variable in your original data set?  Does it get populated for every AAPL observation, or just for the first three?  (Or perhaps just for the third AAPL observation?)

Contributor
Posts: 40

Re: Calculating variance of each ID within a data step

Posted in reply to Astounding

Hi Astounding,

 

This actual program

data XXX

....

....

....

"My program here"

....

...

...

run;

 

It will generate a new dataset with each asset having its own variance value. It will read in all the observations for AAPL excluding the last two observations and compute the variance. For this example, there are 5 observations for AAPL. Only the first 3 are used to compute the variance and the last two observations are used as filtering. The filter works like if the last two observations do not exist, variance for APPL will be deleted. 

 

If it is difficult, 

maybe you can teach me how to write a loop for the lagN?

Example:

I want to compute variance for 4 observations:

VarP = std(price, lag(price), lag2(price), lag3(price));

 

The algorithm is something like

{

N= 4;

xString = "price";

for i = 2: N

   xString = concatenate( xString, ",", lag"i"(price));

end 

 

VarP = std(xString);

}

 

Many thanks

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

From the point of view of a programmer, there are many ways to compute variance (all of which get the same numeric result).  You don't need LAG, you don't need to hard-code dates, you don't need to know how many records there will be.  But the missing ingredient to select a method is to understand what we are trying to achieve here.  Assuming we could compute the variance (no matter what the methods would be, and no matter how we identify the proper subset of observations to use), is the goal to add another column to your data, and populate it on every single observation?  If that's not it, then what is the goal?

Contributor
Posts: 40

Re: Calculating variance of each ID within a data step

Posted in reply to Astounding

You are right.

 

It will create one more column for the data step.

 

Since it is doing rolling, there would be "." in the new column. However, the subsequent processing will only keep the latest and only one variance for each asset. 

 

The subsequent processing in the other part of the program will remove "." and get the variance for the latest yyyymmdd and delete the rest. This part is already written.

 

So I try not to mess up the rest of the program and only modify the "interface" which is a string that is within the data step.

 

Thanks

Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

OK, we're starting to get somewhere now.  RIght now, there are 5 observations for AAPL.  For those 5 observations, what should the new column be at the end of the DATA step?

 

(Keep in mind that you have reversed the dates, so they start with the most recent and end with the oldest.)

Contributor
Posts: 40

Re: Calculating variance of each ID within a data step

[ Edited ]
Posted in reply to Astounding

First it will look like this. This is where I will change the code to process the variance.

 

Asset Price YYYYMMDD				varPrice
AAPL 20.6    20160105				.
AAPL 1.5    20160104				.
AAPL 3.3    20160103				10.54624736
AAPL 4.2    20160102				1.374772708
AAPL 12.1    20160101				4.841831609
IBM    19.6    20160105				.
IBM    7.5    20160104				.
IBM    3.3    20160103				8.463056973
IBM    3.2    20160102				2.454248018
IBM    2.1    20160101				0.665832812

 

The rest of the data steps will make the above becomes. This is already written.

 

Asset Price YYYYMMDD				varPrice
AAPL 3.3    20160103				10.54624736
IBM    3.3    20160103				8.463056973
Super User
Posts: 5,518

Re: Calculating variance of each ID within a data step

Posted in reply to CheerfulChu

It's looking like the variance is computed based on the 3 most recent prices.  I trust you have that code and are happy with the calculations.  If you're not using the 3 most recent, we can revisit that (but it becomes more complex).  Here's what you need to add to get the result in your example.

 

First, your DATA step must include a BY statement:

 

by asset;

 

There can be more variables in the BY statement, but the first one must be asset.

 

Second, you have to add this early:

 

if first.asset then asset_counter=1;

else asset_counter + 1;

 

Then calculate the variance on every observation.  It's not that you need to keep the variance on the first two, but you need to make sure that the LAG function executes on every observation or else the calculations will be wrong.

 

Then reset the first two by adding:

 

if asset_counter <=2 then varPrice=.;

 

That should be the full set of pieces to add to your DATA step, possibly adding also:

 

drop asset_counter;

Ask a Question
Discussion stats
  • 15 replies
  • 1259 views
  • 2 likes
  • 4 in conversation