Hello
I have a question on how to duplicate data for subjects where data is missing on certain variables. I have a dataset where some subjects have 2 rows. Most information is unique and populated but some fields such as sex and weight may have missing data for the second record (see table 1). There is no pattern to the missingness with some subjects having sex and/or weight already populated. I am trying to replace the missing values with available values for these subjects. However, if a value exists, I would like to keep the populated values (see table 2).
Table 1. Data with missingness
SUBJID | AGE | SEX | WEIGHT | COHORT |
1 | 37 | F | 65.5 | 1 |
1 |
|
|
| 2 |
2 | 40 | M | 70 | 1 |
2 | 41 | M |
| 2 |
3 |
| M | 80.5 | 1 |
4 | 32 | M | 75.3 | 1 |
4 | 33 |
| 75.2 | 2 |
5 | 50 | F | 68 | 2 |
6 | 45 | F | 75.5 | 1 |
7 | 30 | M | 68.5 | 1 |
7 |
|
|
| 2 |
Table 2. Final table with data updated in red.
SUBJID | AGE | SEX | WEIGHT | COHORT |
1 | 37 | F | 65.5 | 1 |
1 | 37 | F | 65.5 | 2 |
2 | 40 | M | 70 | 1 |
2 | 41 | M | 70 | 2 |
3 |
| M | 80.5 | 1 |
4 | 32 | M | 75.3 | 1 |
4 | 32 | M | 75.2 | 2 |
5 | 50 | F | 68 | 2 |
6 | 45 | F | 75.5 | 1 |
7 | 30 | M | 68.5 | 1 |
7 | 30 | M | 68.5 | 2 |
I have tried to use modify (see below) – which is fine where the duplicate data is the same in the transaction dataset as the Master, but for subjects 2 (age) and 4 (weight) where there is a different value, the values in the transaction dataset update the Master. I have done some reading and it appears a condition is easy when related to the transaction dataset, but I would like the condition to apply to the Master - if data is missing in Master then replace with transaction otherwise keep value in master.
Other options are welcome.
Thanks in advance for your time!
data master; input SUBJID $ AGE SEX $ WEIGHT COHORT; datalines; 1 37 F 65.5 1 1 . . . 2 2 40 M 70 1 2 41 M . 2 3 . M 80.5 1 4 32 M 75.3 1 4 33 . 75.2 2 5 50 F 68 2 6 45 F 75.5 1 7 30 M 68.5 1 7 . . . 2 ; run; data trans; input SUBJID $ AGE SEX $ WEIGHT COHORT; datalines; 1 37 F 65.5 1 2 40 M 70 1 3 . M 80.5 1 4 32 M 75.3 1 6 45 F 75.5 1 7 30 M 68.5 1 ; run;
data trans; set master; keep subjid age sex weight; where COHORT=1; run; data master; modify master trans; by subjid; run;
The UPDATE statement is for apply transactions to an ORIGINAL dataset. When the value in the transaction is MISSING the original value is not replaced.
It is designed to have one observation per BY group in the ORIGINAL dataset.
It is designed to output one observation per BY group
But you can use your single dataset as both the ORIGINAL and the TRANSACTIONS. Use the OBS=0 dataset option to start with an empty ORIGINAL dataset and use the whole dataset as transactions.
To prevent the collapse to a single observation per BY group add an explicit OUTPUT statement.
data want;
update original(obs=0) original;
by subjid;
output;
run;
If there are some variables that did not want to have the values carried forward then you can add a SET statement to re-read those variables.
If there are some cases where the missing values are on the FIRST observation for the group you might need to do more work.
For your example it looks like you want to collapse all of the variables expect COHORT to be the same on every observation for that subject. In which case you might want to first collapse to unique observations and then merge back to get the muliple observations with different values of COHORT.
data want;
update original(obs=0) original;
by subjid;
run;
data want;
merge want(drop=cohort) orgiinal(keep=subjid cohort);
by subjid;
run;
data master;
input SUBJID $ AGE SEX $ WEIGHT COHORT;
datalines;
1 37 F 65.5 1
1 . . . 2
2 40 M 70 1
2 41 M . 2
3 . M 80.5 1
4 32 M 75.3 1
4 33 . 75.2 2
5 50 F 68 2
6 45 F 75.5 1
7 30 M 68.5 1
7 . . . 2
;
run;
data want;
update master(obs=0) master;
by subjid;
output;
run;
Thanks heaps for the suggestion. This is a perfect solution to the question I posted, but I have other variables in the dataset so I do require the additional merge step shown below.
Thanks again
The UPDATE statement is for apply transactions to an ORIGINAL dataset. When the value in the transaction is MISSING the original value is not replaced.
It is designed to have one observation per BY group in the ORIGINAL dataset.
It is designed to output one observation per BY group
But you can use your single dataset as both the ORIGINAL and the TRANSACTIONS. Use the OBS=0 dataset option to start with an empty ORIGINAL dataset and use the whole dataset as transactions.
To prevent the collapse to a single observation per BY group add an explicit OUTPUT statement.
data want;
update original(obs=0) original;
by subjid;
output;
run;
If there are some variables that did not want to have the values carried forward then you can add a SET statement to re-read those variables.
If there are some cases where the missing values are on the FIRST observation for the group you might need to do more work.
For your example it looks like you want to collapse all of the variables expect COHORT to be the same on every observation for that subject. In which case you might want to first collapse to unique observations and then merge back to get the muliple observations with different values of COHORT.
data want;
update original(obs=0) original;
by subjid;
run;
data want;
merge want(drop=cohort) orgiinal(keep=subjid cohort);
by subjid;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.