Hello,
Using SAS 9.4, I have a data set of 7000 rows and 200 variables. I have an ID and each ID has multiple observations. The data set has the date the data last recorded. Across the 200 variables not all data was updated for each new date last recorded. By the ID, I would like to fill all variables in the last row (sorted to have the most recently updated date last) with missing data (the data could be character or numeric) with the non-missing data from the previous rows. In the code in the first 2 rows in each ID are an example of how the data could look and the 3rd row in the ID is how I would want the finished product to look. My question is if there is a reasonably efficient way to accomplish this while only filling values into cells with missing values? Thank you
data want;
input id sex $ race $ age date_last_completed: mmddyy10.;
format date_last_completed mmddyy10.;
datalines;
1 Male White 50 .
1 . White . 03/02/2024
1 Male White 50 03/02/2024
2 Female . . 03/03/2024
2 . Black 50 .
2 Female Black 50 03/03/2024
;
run;
The easiest way is to use the UPDATE statement. That needs two datasets, current values and transactions. But you can use an empty version of your data as the current values and then treat all of the data as the transactions.
To avoid collapsing to a single observation per BY group just add an OUTPUT statement.
data have;
input id sex $ race $ age date_last_completed :mmddyy10.;
format date_last_completed mmddyy10.;
datalines;
1 Male White 50 .
1 . White . 03/02/2024
1 Male White 50 03/02/2024
2 Female . . 03/03/2024
2 . Black 50 .
2 Female Black 50 03/03/2024
;
proc print;
run;
data want;
update have(obs=0) have;
by id;
output;
run;
proc print;
run;
Without the explicit OUTPUT statement you would get :
A caveat about a process like this where you have AGE and, I suspect in real data, different dates. Age is date dependent in common usage. So if you pull an Age from a previous record you might be off of the actual age for a different date. The longer the interval between the current record and the previous with age populated the more likely to have such occur.
It may be that you don't have much concern over age at this point but it should be considered.
@ballardw wrote:
A caveat about a process like this where you have AGE and, I suspect in real data, different dates. Age is date dependent in common usage. So if you pull an Age from a previous record you might be off of the actual age for a different date. The longer the interval between the current record and the previous with age populated the more likely to have such occur.
It may be that you don't have much concern over age at this point but it should be considered.
In the situation you describe you would exclude AGE from the UPDATE and use SET statement for variables that you don't want to carry forward.
update have(obs=0 keep=ID) have(drop=age);
by ID;
set have(keep=age);
use keep and drop combined with SAS-Variable-Lists as needed to make this more concise.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.