Dear Experts
I have a series of data as shown below. I would like to calculate the variance for each ID. However, the last two values for each ID is used for checking that it is valid and hence exclude for calculating the variance. It must be used within a data step because this part of processing is part of a very large data step.
xPrice
ID Price yyyymmdd
AAPL 10.6 20160105
AAPL 10.5 20160104
AAPL 10.3 20160103
AAPL 10.2 20160102
AAPL 10.1 20160101
IBM 11.6 20160105
IBM 11.5 20160104
IBM 11.3 20160103
IBM 11.2 20160102
IBM 11.1 20160101
I have written a short code:
data xPrice1; set xPrice; v = var(price, lag(price)); run;
This one is faulty:
1) I have to specify all the parameters within "var". In actual implementation, each ID will have 100+ values and I cant write 100+ lags to compute the variance and I need the last two values to be used for checking that the ID has existed before computing.
2) If I use proc means, I cant check the last 2 values and I need it to be in data step processing because it acts as an "interface" where a small part of the program (part of a data step) is visible for users while hiding the rest of the code.
Thanks
LL
If it's in a data step you need to go back to base definitions and can't use SAS functions.
Does it actually help you to compare one observation to the previous one? Don't you need 3 observations in this example (the first 3 of 5, then ignore the last 2)?
In general, there are ways to pursue this. For example, you could start with two loops that each read the data:
data want;
array exist {100} list of 100 names;
array lagged {100};
id_count=0;
do until (last.id);
do _n_=1 to 100;
lagged{_n_} = exist{_n_};
end;
set have;
by id;
id_count + 1;
end;
id_count2 = 0;
do until (last.id);
set have;
by id;
id_count2 + 1;
if id_count2 <= id_count - 2 then do;
**** whatever;
end;
else do;
**** whatever;
end;
end;
run;
So there is no way to calculate the variance for (xi-2) where xi = total number of observations for asset i?
The reason for calculating the last two is to check that the asset has existed way before and not just launched. If there are missing values in between the data, when calculating the variance, the output would be ".". So I can ignore these assets
Thank you for your help
LL
You can calculate variance for a group of observations, but you haven't written the code to do that.
If you wanted the mean for a group of observations, you would calculate the sum and N for the group, and at the end of that calculate the mean.
SImilarly, if you want the variance for a group of observations, you can calculate the sum, N, and sum of squared values for the group, and at the end of that calculate the variance.
The comment statements (****whatever;) indicate that something has to happen there but I didn't spell out what has to happen. It wasn't clear yet what needed to be calculated based on which observations.
Make a weight variable wt , assign wt=. to the last obs of each group, assign wt=1 to others. data have; input ID $ Price yyyymmdd $; cards; AAPL 10.6 20160105 AAPL 10.5 20160104 AAPL 10.3 20160103 AAPL 10.2 20160102 AAPL 10.1 20160101 IBM 11.6 20160105 IBM 11.5 20160104 IBM 11.3 20160103 IBM 11.2 20160102 IBM 11.1 20160101 ; run; data temp; merge have have(keep=id rename=(id=_id) firstobs=3); if id=_id then wt=1; else wt=.; run; proc means data=temp var; by id; weight wt; var price; run;
OK, starting over then, let's approach this from a different angle. What do you expect the final result to look like?
Dear Experts,
I have to write within the data step.
The actual program
data XXX
...
...
...
"My program here"
...
...
...
run;
If I have only 3 data points and do summation. I will write like the following (kind of interfacing):
"My program here"
%str( priceT = price + lag(price) + lag2(price); );
Now I have alot of data points and I need to find the variance of the prices and to make sure there are 2 extra data points for making sure the asset exists first:
"My program here" will look something like
%str(
if price( yyyymmdd =20160102) ^= . and price( yyyymmdd =20160101) ^= . then
varPrice = var(price(yyyymmdd =20160105), price(yyyymmdd =20160104), price(yyyymmdd =20160103));
);
This will give var(AAPL[10.6, 10.5, 10.3]) = 0.02333 and var(IBM[11.6, 11.5, 11.3]) = 0.02333.
I cant use proc means within a data-step.
Many Thanks
Suppose we actually perform the required calculations, and figure that the variance for AAPL is 0.02333. What happens with that number? Does it become a new variable in your original data set? Does it get populated for every AAPL observation, or just for the first three? (Or perhaps just for the third AAPL observation?)
Hi Astounding,
This actual program
data XXX
....
....
....
"My program here"
....
...
...
run;
It will generate a new dataset with each asset having its own variance value. It will read in all the observations for AAPL excluding the last two observations and compute the variance. For this example, there are 5 observations for AAPL. Only the first 3 are used to compute the variance and the last two observations are used as filtering. The filter works like if the last two observations do not exist, variance for APPL will be deleted.
If it is difficult,
maybe you can teach me how to write a loop for the lagN?
Example:
I want to compute variance for 4 observations:
VarP = std(price, lag(price), lag2(price), lag3(price));
The algorithm is something like
{
N= 4;
xString = "price";
for i = 2: N
xString = concatenate( xString, ",", lag"i"(price));
end
VarP = std(xString);
}
Many thanks
From the point of view of a programmer, there are many ways to compute variance (all of which get the same numeric result). You don't need LAG, you don't need to hard-code dates, you don't need to know how many records there will be. But the missing ingredient to select a method is to understand what we are trying to achieve here. Assuming we could compute the variance (no matter what the methods would be, and no matter how we identify the proper subset of observations to use), is the goal to add another column to your data, and populate it on every single observation? If that's not it, then what is the goal?
You are right.
It will create one more column for the data step.
Since it is doing rolling, there would be "." in the new column. However, the subsequent processing will only keep the latest and only one variance for each asset.
The subsequent processing in the other part of the program will remove "." and get the variance for the latest yyyymmdd and delete the rest. This part is already written.
So I try not to mess up the rest of the program and only modify the "interface" which is a string that is within the data step.
Thanks
OK, we're starting to get somewhere now. RIght now, there are 5 observations for AAPL. For those 5 observations, what should the new column be at the end of the DATA step?
(Keep in mind that you have reversed the dates, so they start with the most recent and end with the oldest.)
First it will look like this. This is where I will change the code to process the variance.
Asset Price YYYYMMDD varPrice AAPL 20.6 20160105 . AAPL 1.5 20160104 . AAPL 3.3 20160103 10.54624736 AAPL 4.2 20160102 1.374772708 AAPL 12.1 20160101 4.841831609 IBM 19.6 20160105 . IBM 7.5 20160104 . IBM 3.3 20160103 8.463056973 IBM 3.2 20160102 2.454248018 IBM 2.1 20160101 0.665832812
The rest of the data steps will make the above becomes. This is already written.
Asset Price YYYYMMDD varPrice AAPL 3.3 20160103 10.54624736 IBM 3.3 20160103 8.463056973
It's looking like the variance is computed based on the 3 most recent prices. I trust you have that code and are happy with the calculations. If you're not using the 3 most recent, we can revisit that (but it becomes more complex). Here's what you need to add to get the result in your example.
First, your DATA step must include a BY statement:
by asset;
There can be more variables in the BY statement, but the first one must be asset.
Second, you have to add this early:
if first.asset then asset_counter=1;
else asset_counter + 1;
Then calculate the variance on every observation. It's not that you need to keep the variance on the first two, but you need to make sure that the LAG function executes on every observation or else the calculations will be wrong.
Then reset the first two by adding:
if asset_counter <=2 then varPrice=.;
That should be the full set of pieces to add to your DATA step, possibly adding also:
drop asset_counter;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.