Hi,
I have a dataset which has missing values in a column. If the missing value is in the first cell of the column it will be equal to the next value available. If the missing value is in 3rd, 4th, 5th cell, then the 3rd value will be equal to (2/3)rd of 2nd val and (1/3)rd of 6th cell, 4th value is 50% of each and so on. can someone help?
thank you.
I had to struggle with your problem description, but here's what I cam up with;
In a series beginning with SEQ=1 through some upper limit you can have missing values for YIELD_CHNG. These missing values, which can be single or consecutive, fall in 3 types of "holes"
Each series begins with SEQ=1, but the series do not have any identifier, so the program below generates a temporary identifier - _seq_series, which can be used as a by-variable in the proc transpose.
Here is code that should work as per my description of your problem:
data have;
input BUILD_DT :date9. PERIOD :$8. SEQ DU_CHNG YIELD_CHNG Expected_Output;
format build_dt date9. ;
/*LOGIC TO FILL THE VALUE*/
datalines;
30-Dec-19 3_MONTH 1 0.134371656 . 3.689629437
30-Dec-19 2_YEAR 2 0.215742831 . 3.689629437
30-Dec-19 20_YEAR 3 1.095695432 3.689629437 3.689629437
30-Dec-19 6_MONTH 1 0.15222889 . 0.65523605
30-Dec-19 1_YEAR 2 0.178037173 0.65523605 0.65523605
30-Dec-19 3_YEAR 3 0.253467586 . 0.934979885
30-Dec-19 4_YEAR 4 0.297898502 1.214723719 1.214723719
30-Dec-19 5_YEAR 5 0.348219383 . 1.476339748
30-Dec-19 7_YEAR 6 0.464638383 1.737955777 1.737955777
30-Dec-19 10_YEAR 7 0.652956401 . 1.737955777
30-Dec-19 3_MONTH 1 -0.638004856 . -0.371114946
30-Dec-19 6_MONTH 2 -0.633071866 . -0.371114946
30-Dec-19 1_YEAR 3 -0.659459184 -0.371114946 -0.371114946
30-Dec-19 2_YEAR 4 -0.70961743 -0.214424417 -0.214424417
30-Dec-19 3_YEAR 5 -0.73634249 -0.110308306 -0.110308306
30-Dec-19 4_YEAR 6 -0.738953144 0.052784926 0.052784926
30-Dec-19 5_YEAR 7 -0.721453755 0.198813346 0.198813346
30-Dec-19 7_YEAR 8 -0.652077083 0.4472606 0.4472606
30-Dec-19 10_YEAR 9 -0.516890161 1.011691712 1.011691712
30-Dec-19 20_YEAR 10 -0.212190089 2.050067817 2.050067817
30-Dec-19 3_MONTH 1 0.086031742 . 0.361956941
30-Dec-19 6_MONTH 2 0.10557385 . 0.361956941
30-Dec-19 1_YEAR 3 0.12990416 0.361956941 0.361956941
run;
data need /view=need;
set have (keep=seq yield_chng);
if seq=1 then _seq_series+1;
run;
proc transpose data=need out=trnspose prefix=_yc ;
by _seq_series;
id seq;
var yield_chng;
run;
data want (drop=_:);
set have ;
if seq=1 then do; /* Read and correct the transposed set of yield_chng values */
set trnspose (keep=_yc:);
array _yc {*} _yc: ;
/* fill in leading missings */
_x=coalesce(of _yc{*}); /* Get first non-missing */
do _s=1 by 1 while (_yc{_s}=.);
_yc{_s}=_x;
end;
do _s=dim(_yc) by -1 while (_yc{_s}=.); /* Get index of last non-missing*/
end;
if _s<dim(_yc) then do _s=_s+1 to dim(_yc); /* Fill in trailing missings */
_yc{_s}=_yc{_s-1};
end;
/* Fill in interior holes - straight line interpolation */
do while (nmiss(of _yc{*})>0);
do _before_s=1 by 1 until(_yc{_before_s+1}=.); /* find SEQ before the hole*/
end;
do _after_s=_before_s+1 by 1 while (_yc{_after_s}=.); /* find SEQ after the hole */
end;
_slope= (_yc{_after_s}-_yc{_before_s}) / (_after_s-_before_s) ;
do _s=_before_s+1 to _after_s-1; /* Traverse the hole */
_yc{_s} = _yc{_before_s} + _slope * (_s - _before_s);
end;
end;
end;
if yield_chng=. then yield_chng=_yc{seq};
run;
The NEED dataset is created just to generate the _seq_group identifier. The proc transpose generates one obs per _seq_group with the variabes named _YC1, _YC2, etc. corresponding to the SEQ values in the pre-transposed data set.
The data want step reads in the original data. When SEQ=1 it reads the transposed values _YC, which are characterized as an array, indexed by SEQ , and corrects missing values per the statement I made above. Then, for the rest of the original data, it's just a matter of copying the appropriate element of the array into YIELD_CHNG.
BUILD_DT | PERIOD | SEQ | DU_CHNG | YIELD_CHNG | Expected Output | LOGIC TO FILL THE VALUE |
30-Dec-19 | 3_MONTH | 1 | 0.134371656 | . | 3.689629437 | same as first available data |
30-Dec-19 | 2_YEAR | 2 | 0.215742831 | . | 3.689629437 | same as first available data |
30-Dec-19 | 20_YEAR | 3 | 1.095695432 | 3.689629437 | 3.689629437 | ALREADY AVAILABLE |
30-Dec-19 | 6_MONTH | 1 | 0.15222889 | . | 0.65523605 | same as first available data |
30-Dec-19 | 1_YEAR | 2 | 0.178037173 | 0.65523605 | 0.65523605 | ALREADY AVAILABLE |
30-Dec-19 | 3_YEAR | 3 | 0.253467586 | . | 0.934979885 | 50%of previous value + 50% of next value |
30-Dec-19 | 4_YEAR | 4 | 0.297898502 | 1.214723719 | 1.214723719 | ALREADY AVAILABLE |
30-Dec-19 | 5_YEAR | 5 | 0.348219383 | . | 1.476339748 | 50%of previous value + 50% of next value |
30-Dec-19 | 7_YEAR | 6 | 0.464638383 | 1.737955777 | 1.737955777 | ALREADY AVAILABLE |
30-Dec-19 | 10_YEAR | 7 | 0.652956401 | . | 1.737955777 | same as the last data |
30-Dec-19 | 3_MONTH | 1 | -0.638004856 | . | -0.371114946 | same as first available data |
30-Dec-19 | 6_MONTH | 2 | -0.633071866 | . | -0.371114946 | same as first available data |
30-Dec-19 | 1_YEAR | 3 | -0.659459184 | -0.371114946 | -0.371114946 | ALREADY AVAILABLE |
30-Dec-19 | 2_YEAR | 4 | -0.70961743 | -0.214424417 | -0.214424417 | ALREADY AVAILABLE |
30-Dec-19 | 3_YEAR | 5 | -0.73634249 | -0.110308306 | -0.110308306 | ALREADY AVAILABLE |
30-Dec-19 | 4_YEAR | 6 | -0.738953144 | 0.052784926 | 0.052784926 | ALREADY AVAILABLE |
30-Dec-19 | 5_YEAR | 7 | -0.721453755 | 0.198813346 | 0.198813346 | ALREADY AVAILABLE |
30-Dec-19 | 7_YEAR | 8 | -0.652077083 | 0.4472606 | 0.4472606 | ALREADY AVAILABLE |
30-Dec-19 | 10_YEAR | 9 | -0.516890161 | 1.011691712 | 1.011691712 | ALREADY AVAILABLE |
30-Dec-19 | 20_YEAR | 10 | -0.212190089 | 2.050067817 | 2.050067817 | ALREADY AVAILABLE |
30-Dec-19 | 3_MONTH | 1 | 0.086031742 | . | 0.361956941 | same as first available data |
30-Dec-19 | 6_MONTH | 2 | 0.10557385 | . | 0.361956941 | same as first available data |
30-Dec-19 | 1_YEAR | 3 | 0.12990416 | 0.361956941 | 0.361956941 | ALREADY AVAILABLE |
It looks like this data should have an ID variable, which would change everytime SEQ=1. Does it?
All the data is for a single date, 31dec 2019. as of now, there is no id variable here.
Is this done for more than one column?
Is it done once for the whole dataset, or once per id group in the dataset?
it is required to be done only for 1 column. the logic should pick the 'seq' column, if the first value in the sequence is missing, it should populated the next available missing value. if the sequence is of 7 numbers and values between 2-7 are missing, then
and so on.
I had to struggle with your problem description, but here's what I cam up with;
In a series beginning with SEQ=1 through some upper limit you can have missing values for YIELD_CHNG. These missing values, which can be single or consecutive, fall in 3 types of "holes"
Each series begins with SEQ=1, but the series do not have any identifier, so the program below generates a temporary identifier - _seq_series, which can be used as a by-variable in the proc transpose.
Here is code that should work as per my description of your problem:
data have;
input BUILD_DT :date9. PERIOD :$8. SEQ DU_CHNG YIELD_CHNG Expected_Output;
format build_dt date9. ;
/*LOGIC TO FILL THE VALUE*/
datalines;
30-Dec-19 3_MONTH 1 0.134371656 . 3.689629437
30-Dec-19 2_YEAR 2 0.215742831 . 3.689629437
30-Dec-19 20_YEAR 3 1.095695432 3.689629437 3.689629437
30-Dec-19 6_MONTH 1 0.15222889 . 0.65523605
30-Dec-19 1_YEAR 2 0.178037173 0.65523605 0.65523605
30-Dec-19 3_YEAR 3 0.253467586 . 0.934979885
30-Dec-19 4_YEAR 4 0.297898502 1.214723719 1.214723719
30-Dec-19 5_YEAR 5 0.348219383 . 1.476339748
30-Dec-19 7_YEAR 6 0.464638383 1.737955777 1.737955777
30-Dec-19 10_YEAR 7 0.652956401 . 1.737955777
30-Dec-19 3_MONTH 1 -0.638004856 . -0.371114946
30-Dec-19 6_MONTH 2 -0.633071866 . -0.371114946
30-Dec-19 1_YEAR 3 -0.659459184 -0.371114946 -0.371114946
30-Dec-19 2_YEAR 4 -0.70961743 -0.214424417 -0.214424417
30-Dec-19 3_YEAR 5 -0.73634249 -0.110308306 -0.110308306
30-Dec-19 4_YEAR 6 -0.738953144 0.052784926 0.052784926
30-Dec-19 5_YEAR 7 -0.721453755 0.198813346 0.198813346
30-Dec-19 7_YEAR 8 -0.652077083 0.4472606 0.4472606
30-Dec-19 10_YEAR 9 -0.516890161 1.011691712 1.011691712
30-Dec-19 20_YEAR 10 -0.212190089 2.050067817 2.050067817
30-Dec-19 3_MONTH 1 0.086031742 . 0.361956941
30-Dec-19 6_MONTH 2 0.10557385 . 0.361956941
30-Dec-19 1_YEAR 3 0.12990416 0.361956941 0.361956941
run;
data need /view=need;
set have (keep=seq yield_chng);
if seq=1 then _seq_series+1;
run;
proc transpose data=need out=trnspose prefix=_yc ;
by _seq_series;
id seq;
var yield_chng;
run;
data want (drop=_:);
set have ;
if seq=1 then do; /* Read and correct the transposed set of yield_chng values */
set trnspose (keep=_yc:);
array _yc {*} _yc: ;
/* fill in leading missings */
_x=coalesce(of _yc{*}); /* Get first non-missing */
do _s=1 by 1 while (_yc{_s}=.);
_yc{_s}=_x;
end;
do _s=dim(_yc) by -1 while (_yc{_s}=.); /* Get index of last non-missing*/
end;
if _s<dim(_yc) then do _s=_s+1 to dim(_yc); /* Fill in trailing missings */
_yc{_s}=_yc{_s-1};
end;
/* Fill in interior holes - straight line interpolation */
do while (nmiss(of _yc{*})>0);
do _before_s=1 by 1 until(_yc{_before_s+1}=.); /* find SEQ before the hole*/
end;
do _after_s=_before_s+1 by 1 while (_yc{_after_s}=.); /* find SEQ after the hole */
end;
_slope= (_yc{_after_s}-_yc{_before_s}) / (_after_s-_before_s) ;
do _s=_before_s+1 to _after_s-1; /* Traverse the hole */
_yc{_s} = _yc{_before_s} + _slope * (_s - _before_s);
end;
end;
end;
if yield_chng=. then yield_chng=_yc{seq};
run;
The NEED dataset is created just to generate the _seq_group identifier. The proc transpose generates one obs per _seq_group with the variabes named _YC1, _YC2, etc. corresponding to the SEQ values in the pre-transposed data set.
The data want step reads in the original data. When SEQ=1 it reads the transposed values _YC, which are characterized as an array, indexed by SEQ , and corrects missing values per the statement I made above. Then, for the rest of the original data, it's just a matter of copying the appropriate element of the array into YIELD_CHNG.
Thank you very much.
I am actually fetching all these values from a database and am looking forward to create a dynamic macro. I believe in my previous question, the leading and trailing holes were clear.
However, the interior hole depends on 2 things:
So, if we have 2 cases, with the interior hole with first case with 2 missing values at position 2 and 3 and second case with missing values from 2 through 6 like below
Case | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
1 | 0.65523605 | 1.214723719 | 1.476339748 | 1.737955777 | 1.737955777 | ||
2 | 0.65523605 | 1.737955777 | 1.737955777 |
then, following will be the logic of filling the values
Case | Logic to fill for case 1 ------> | 2/3 of previous and 1/3 of next available value | 1/3 of previous and 2/3 of next available value | ||||
1 | 0.65523605 | 0.84173194 | 1.028227829 | 1.214723719 | 1.476339748 | 1.737955777 | 1.737955777 |
2 | 0.65523605 | 0.925915982 | 1.088323941 | 1.304867886 | 1.467275845 | 1.737955777 | 1.737955777 |
Logic to fill for case 2 ------> | 3/4 of previous and 1/4 of next available value | 3/5 of previous and 2/5 of next available value | 2/5 of previous and 3/5 of next available value | 1/4 of previous and 3/4 of next available value |
Will the code work on any such case?
@rohitdev_ds wrote:
Thank you very much.
I am actually fetching all these values from a database and am looking forward to create a dynamic macro. I believe in my previous question, the leading and trailing holes were clear.
However, the interior hole depends on 2 things:
- count of empty holes and
- position of the interior hole that is getting filled
So, if we have 2 cases, with the interior hole with first case with 2 missing values at position 2 and 3 and second case with missing values from 2 through 6 like below
Case 1 2 3 4 5 6 7 1 0.65523605 1.214723719 1.476339748 1.737955777 1.737955777 2 0.65523605 1.737955777 1.737955777
then, following will be the logic of filling the values
Case Logic to fill for case 1 ------> 2/3 of previous and 1/3 of next available value 1/3 of previous and 2/3 of next available value 1 0.65523605 0.84173194 1.028227829 1.214723719 1.476339748 1.737955777 1.737955777 2 0.65523605 0.925915982 1.088323941 1.304867886 1.467275845 1.737955777 1.737955777 Logic to fill for case 2 ------> 3/4 of previous and 1/4 of next available value 3/5 of previous and 2/5 of next available value 2/5 of previous and 3/5 of next available value 1/4 of previous and 3/4 of next available value
Will the code work on any such case?
I suspect you were premature in marking my response as a solution.
Yes, i agree. I prematurely agreed to the solution. I need more help here. Trying to raise the concern again.
I have a column 'RATE' with me which has some missing values and i need to populate them. To fill the missing values, I need to extrapolate some values and interpolate some.
1. For a particular 'CURRENCY' the first missing values and last missing values will be exactly the same as the nearest starting or ending value for the same CURRENCY.
2. For middle missing values, I need to calculate 2 different weights.
Weight 1 = (Nearest higher PERIOD_YRS - PERIOD_YRS) / ((Nearest higher PERIOD - (Nearest lower PERIOD_YRS).
Weight 2 = 1 - Weight1
Expected value = (Weight 1 * NEAREST LOWER PERIOD_YRS) + (Weight 2 * NEAREST HIGHER PERIOD_YRS)
Here is the sample data
CURRENCY | SEQ | PERIOD | COUNTER_IN_GROUP | PERIOD_YRS | RATE | EXP RESULT | Weight1 | Weight2 |
AUD | 1 | 3_MONTH | 1 | 0.25 | 3.689629437 | |||
AUD | 1 | 2_YEAR | 2 | 2 | 3.689629437 | |||
AUD | 1 | 20_YEAR | 3 | 20 | 3.689629437 | 3.689629437 | ||
CAD | 2 | 6_MONTH | 1 | 0.5 | 0.65523605 | |||
CAD | 2 | 1_YEAR | 2 | 1 | 0.65523605 | 0.65523605 | ||
CAD | 2 | 3_YEAR | 3 | 3 | 1.028227829 | 0.333333333 | 0.666666667 | |
CAD | 2 | 4_YEAR | 4 | 4 | 1.214723719 | 1.214723719 | ||
CAD | 2 | 5_YEAR | 5 | 5 | 1.389134405 | 0.666666667 | 0.333333333 | |
CAD | 2 | 7_YEAR | 6 | 7 | 1.737955777 | 1.737955777 | ||
CAD | 2 | 10_YEAR | 7 | 10 | 1.737955777 |
Weight 1 = (4YR - 3YR)/(4YR - 1YR)
Weight 2 = (1 - Weight 1)
@rohitdev_ds wrote:
Yes, i agree. I prematurely agreed to the solution. I need more help here. Trying to raise the concern again.
I have a column 'RATE' with me which has some missing values and i need to populate them. To fill the missing values, I need to extrapolate some values and interpolate some.
1. For a particular 'CURRENCY' the first missing values and last missing values will be exactly the same as the nearest starting or ending value for the same CURRENCY.
2. For middle missing values, I need to calculate 2 different weights.
Weight 1 = (Nearest higher PERIOD_YRS - PERIOD_YRS) / ((Nearest higher PERIOD - (Nearest lower PERIOD_YRS).
Weight 2 = 1 - Weight1
Expected value = (Weight 1 * NEAREST LOWER PERIOD_YRS) + (Weight 2 * NEAREST HIGHER PERIOD_YRS)
Here is the sample data
CURRENCY SEQ PERIOD COUNTER_IN_GROUP PERIOD_YRS RATE EXP RESULT Weight1 Weight2 AUD 1 3_MONTH 1 0.25 3.689629437 AUD 1 2_YEAR 2 2 3.689629437 AUD 1 20_YEAR 3 20 3.689629437 3.689629437 CAD 2 6_MONTH 1 0.5 0.65523605 CAD 2 1_YEAR 2 1 0.65523605 0.65523605 CAD 2 3_YEAR 3 3 1.028227829 0.333333333 0.666666667 CAD 2 4_YEAR 4 4 1.214723719 1.214723719 CAD 2 5_YEAR 5 5 1.389134405 0.666666667 0.333333333 CAD 2 7_YEAR 6 7 1.737955777 1.737955777 CAD 2 10_YEAR 7 10 1.737955777
Weight 1 = (4YR - 3YR)/(4YR - 1YR)
Weight 2 = (1 - Weight 1)
Now I have to ask whether you actually examined the straight-line interpolation that I offered previously for internal holes? It exactly matches the rule you describe above.
What it doesn't match is your earlier example of weights for case 2 quoted here (see the red text😞
However, the interior hole depends on 2 things:
- count of empty holes and
- position of the interior hole that is getting filled
So, if we have 2 cases, with the interior hole with first case with 2 missing values at position 2 and 3 and second case with missing values from 2 through 6 like below
Case 1 2 3 4 5 6 7 1 0.65523605 1.214723719 1.476339748 1.737955777 1.737955777 2 0.65523605 ? ? ? ? 1.737955777 1.737955777
then, following will be the logic of filling the values
Case Logic to fill for case 1 ------> 2/3 of previous and 1/3 of next available value 1/3 of previous and 2/3 of next available value 1 0.65523605 0.84173194 1.028227829 1.214723719 1.476339748 1.737955777 1.737955777 2 0.65523605 0.925915982 1.088323941 1.304867886 1.467275845 1.737955777 1.737955777 Logic to fill for case 2 ------> 3/4 of previous and 1/4 of next available value 3/5 of previous and 2/5 of next available value 2/5 of previous and 3/5 of next available value 1/4 of previous and 3/4 of next available value
You have weights of (3/4,1/4),(3/5,2/5),(2/5,3/5),(1/4,3/4). This is absolutely NOT analogous to your sample weight rule statement. Using that rule, your weights would be
weight1 = (6-current_seq)/(6-1). With weight2=1-weight1 your rule would produce this sequence of weight pairs:
(4/5,1/5) for seq2; (3/5,2/5) for seq3; (2/5,3/5) for seq4; and (1/5,4/5) for seq5.
This is becoming frustrating, because I explicitly asked you if the weights in red above were really the weights you wanted, which I guess you didn't assess.
At least you have now provided a rule (which fits my original interpretation of a straight line). But the rule does not generate the weights that you used in the earlier examples. In other words, either you have changed rules in this example, or you didn't apply your own rule to confirm your initial example weights that you wanted respondents to reproduce.
Help us help you - there is usually a good reason for our questions.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.