I have a dataset with many missing values. One remedy is to fill forward the values after checking some variables. Here is an example of the data:
data example;
input Obs:$3. fund:$3. Stock:$3. Date:YYMMDD8. CUSIP:$8. ISIN:$5. number_of_shares: BEST12. ;
datalines;
1 SPY ABC 20190301 12345678 US123 15000
2 SPY ABC 20190302 12345678 US123 15100
3 SPY ABC 20190303 . . 15100
4 SPY ABC 20190304 . . 15000
5 SPY ABC 20190305 . . 15200
6 SPY ABC 20190306 . . 15231
7 SPY ABC 20190307 . . 14912
8 SPY ABC 20190308 . . 14872
9 SPY ABC 20190309 12345678 US123 15340
10 SPY ABC 20190310 12345678 US123 15000
11 SPY XYZ 20190301 88888888 CA888 500
12 SPY XYZ 20190302 88888888 CA888 510
13 SPY XYZ 20190303 . . 503
14 SPY XYZ 20190304 . . 515
15 SPY XYZ 20190305 . . 524
16 SPY XYZ 20190306 . . 552
17 SPY XYZ 20190307 . . .
18 SPY XYZ 20190308 . . .
19 SPY XYZ 20190309 . . .
20 SPY XYZ 20190310 . . 555
21 SPY XYZ 20190311 . . 555
22 SPY XYZ 20190312 . . 532
;
run;
So, as you can see, there are missing value for CUSIP and ISIN. I would like to fill forward only if 2 criteria are met: (i) the next available value for CUSIP (or ISIN) is the same as the previously available value for CUSIP (or ISIN), and (ii) the value of the variable number_of_shares does not change by 10%.
So in the case of stock ABC, as CUSIP in obs 2 is the same as CUSIP in obs 9, and the values in number_of_shares are similar so fill forward CUSIP for obs 3,4,5,6,7,8. same with ISIN.
In the case of stock XYZ, If the next available value for CUSIP is the same as the value in obs 12, then fill forward the value for CUSIP and ISIN. But in this case, fill forward the value of the number_of_shares for obs 17,18,19 too (i.e. they will take 552). However, if the next available value for CUSIP is not the same as the one in obs 12 then don't do these fill-forward.
Edit: So what I want looks like this:
data etf_example2;
input Obs:$3. fund:$3. Stock:$3. Date:YYMMDD8. CUSIP:$8. ISIN:$5. number_of_shares: BEST12. ;
datalines;
1 SPY ABC 20190301 12345678 US123 15000
2 SPY ABC 20190302 12345678 US123 15100
3 SPY ABC 20190303 12345678 US123 15100
4 SPY ABC 20190304 12345678 US123 15000
5 SPY ABC 20190305 12345678 US123 15200
6 SPY ABC 20190306 12345678 US123 15231
7 SPY ABC 20190307 12345678 US123 14912
8 SPY ABC 20190308 12345678 US123 14872
9 SPY ABC 20190309 12345678 US123 15340
10 SPY ABC 20190310 12345678 US123 15000
11 SPY XYZ 20190301 88888888 CA888 500
12 SPY XYZ 20190302 88888888 CA888 510
13 SPY XYZ 20190303 88888888 CA888 503
14 SPY XYZ 20190304 88888888 CA888 515
15 SPY XYZ 20190305 88888888 CA888 524
16 SPY XYZ 20190306 88888888 CA888 552
17 SPY XYZ 20190307 88888888 CA888 552
18 SPY XYZ 20190308 88888888 CA888 552
19 SPY XYZ 20190309 88888888 CA888 552
20 SPY XYZ 20190310 88888888 CA888 555
21 SPY XYZ 20190311 88888888 CA888 555
22 SPY XYZ 20190312 88888888 CA888 532
;
run;
So, given your posted example data (and thank you for that 🙂 ), what does your desired result look like?
Always nice to have something to aim for
I edited the question to add the desired output
This is the most weird logic i have seen for a long time.
You can use the retain-statement (with new variables) to save the values during data-step iteration. You will need to implement look-ahead for the first rule. There are multiple documents describing this pattern, i am sure you find useful paper.
Yes, but most look-ahead options are used in DATA step that read or merge in the same dataset with a specified number of lead. The issue here is that we don't know how many step(obs) to look ahead. Also it might not be optimal for the large dataset (500gb) which I have
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.