I have two tables, first and second, that contains data from the American Community Survey (ACS). The first table has missing values (always across 4 variables), whereas the second table contains values I wish to fill in for the missing values in the first table (through matching/joining by PAT_ID).
/* ACS variables at census block level */
data first;
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@;
datalines;
1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 . . . .
4 . . . .
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 . . . .
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 . . . .
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 . . . .
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29
;
/* variables at the zipcode level (INCOME_ZIPCODE EDU_ZIPCODE UNEMPLOY_ZIPCODE HOMEOWNER_ZIPCODE)
but named the same as above to help with merging*/
data second;
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@;
datalines;
1 50127 8.33 35.00 82.71
2 43578 29.01 44.06 58.68
3 36251 14.93 50.14 62.01
4 136686 14.93 29.44 67.86
5 136686 22.49 33.52 87.99
6 43578 22.49 50.14 87.99
7 32374 10.86 33.52 64.28
8 39755 7.51 56.55 58.68
9 34548 12.48 56.55 80.46
10 34548 14.93 38.84 58.68
11 50127 72.73 31.91 82.71
12 50127 9.22 50.14 87.99
13 37981 14.93 44.06 75.23
14 61270 8.33 41.37 87.99
15 57101 8.33 50.14 80.46
16 36251 4.79 38.84 62.01
17 136686 8.33 48.14 80.46
18 37981 72.73 31.91 67.86
19 50127 9.22 33.52 62.01
20 136686 19.81 41.37 62.01
;
/* values filled in for missing are from zip-code level instead of census block level */
data want;
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@;
datalines;
1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 36251 14.93 50.14 62.01
4 136686 14.93 29.44 67.86
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 34548 14.93 38.84 58.68
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 61270 8.33 41.37 87.99
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 37981 72.73 31.91 67.86
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29
;
Here's the easy way. It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID. (Mismatches are OK.)
data want;
update second first;
by pad_id;
run;
For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set. That's a little convoluted, but sounds like it's exactly what you need.
Assuming you don't have duplicate PAT_IDs in your SECOND dataset, this works, using a LEFT JOIN and COALESCE():
proc sql;
create table want as
select f.pat_id
,coalesce(f.income_block, s.income_block) as income_block
,coalesce(f.edu_block, s.edu_block) as edu_block
,coalesce(f.unemploy_block, s.unemploy_block) as unemploy_block
,coalesce(f.homeowner_block, s.homeowner_block) as homeowner_block
from first f
left join second s
on f.pat_id = s.pat_id;
quit;
What about if I have other variables in the first dataset that aren't in the second dataset (e.g. SEX, AGE, and a dozen more) that I want to include in the want dataset?
Alright, here is the code that I used to solved my question.
PROC SQL;
CREATE TABLE WANT AS
SELECT coalesce(a.PAT_ID, b.PAT_ID) as PAT_ID, coalesce(a.INCOME_BLOCK,b.INCOME_BLOCK) AS INCOME_BLOCK,
coalesce(a.EDU_BLOCK,b.EDU_BLOCK) AS EDU_BLOCK,
coalesce(a.UNEMPLOY_BLOCK,b.UNEMPLOY_BLOCK) AS UNEMPLOY_BLOCK,
coalesce(a.HOMEOWNER_BLOCK,b.HOMEOWNER_BLOCK) AS HOMEOWNER_BLOCK
FROM first a LEFT JOIN second b
ON a.PAT_ID = b.PAT_ID;
QUIT;
However, I'm having trouble doing the same as above but with additional variables in the first dataset such as sex and age.
data third;
input PAT_ID : best2. SEX :$6. AGE : best2. INCOME_BLOCK : best6. EDU_BLOCK : best4. UNEMPLOY_BLOCK : best4. HOMEOWNER_BLOCK : best4. ;
datalines;
1 Female 22 45810 93.82 13.07 9.32
2 Male 37 35264 54.94 27.81 4.76
3 Male 27 . . . .
4 Male 37 . . . .
5 Female 27 42346 18.62 1.87 1.29
6 Male 28 42346 41.11 4.76 10.29
7 Female 23 250001 41.11 3.02 1.29
8 Female 21 115290 93.82 27.81 1.87
9 Female 43 215493 21.65 10.29 3.60
10 Male 35 . . . .
11 Male 30 215493 15.61 27.81 3.02
12 Male 21 77521 18.62 7.17 9.32
13 Male 21 215493 54.94 5.95 27.81
14 Male 28 . . . .
15 Male 34 115290 34.35 1.29 9.32
16 Male 38 30872 34.35 10.29 7.17
17 Female 36 105039 15.61 4.76 13.07
18 Female 23 . . . .
19 Female 34 38964 9.55 4.76 2.45
20 Female 26 38964 54.94 10.29 10.29
;
Here's the easy way. It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID. (Mismatches are OK.)
data want;
update second first;
by pad_id;
run;
For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set. That's a little convoluted, but sounds like it's exactly what you need.
Actually, I should have elaborated.
In reality my second data set has 250,000 patients, whereas the first dataset has only 100,000.
The update statement in the data step that you mentioned gives me a table with 250,000 patients, when I only want the 100,000
A small tweak to the program should take care of that:
data want;
update second first (in=original_100k);
by pat_id;
if original_100k;
run;
What's your source for this? Does SAS have any references to using the update statement in this manner?
Other than this source: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/5184-2020.pdf
And how do I maintain the original column order from the first dataset?
PAT_ID SEX AGE INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK
instead of
PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK SEX AGE
There's no source other than what you have (likely) see already. It's the documentation on what the UPDATE statement does.
To keep the variables in the order of the first data set, here's a trick you can use:
data want;
if 5=4 then set first;
update second first (in=original_100k);
by pat_id;
if original_100k;
run;
Since 5 will never equal 4, the SET statement never executes. But its presence is enough to define all the variables in FIRST at that point in the code.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.