Solved: Re: Lookup table and fill in missing values (ACS data)?

PharmlyDoc · Posted 07-12-2021 01:57 PM

I have two tables, first and second, that contains data from the American Community Survey (ACS). The first table has missing values (always across 4 variables), whereas the second table contains values I wish to fill in for the missing values in the first table (through matching/joining by PAT_ID).

/* ACS variables at census block level */
data first;
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@;
datalines;
1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 . . . . 
4 . . . . 
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 . . . . 
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 . . . . 
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 . . . .  
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29
;

/* variables at the zipcode level (INCOME_ZIPCODE EDU_ZIPCODE UNEMPLOY_ZIPCODE HOMEOWNER_ZIPCODE)
 but named the same as above to help with merging*/ 
data second; 
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@; 
datalines; 
1 50127 8.33 35.00 82.71 
2 43578 29.01 44.06 58.68 
3 36251 14.93 50.14 62.01 
4 136686 14.93 29.44 67.86 
5 136686 22.49 33.52 87.99 
6 43578 22.49 50.14 87.99 
7 32374 10.86 33.52 64.28 
8 39755 7.51 56.55 58.68 
9 34548 12.48 56.55 80.46 
10 34548 14.93 38.84 58.68 
11 50127 72.73 31.91 82.71 
12 50127 9.22 50.14 87.99 
13 37981 14.93 44.06 75.23 
14 61270 8.33 41.37 87.99 
15 57101 8.33 50.14 80.46 
16 36251 4.79 38.84 62.01 
17 136686 8.33 48.14 80.46 
18 37981 72.73 31.91 67.86 
19 50127 9.22 33.52 62.01 
20 136686 19.81 41.37 62.01 
;

/* values filled in for missing are from zip-code level instead of census block level */

data want;

input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@; datalines;

1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 36251 14.93 50.14 62.01
4 136686 14.93 29.44 67.86
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 34548 14.93 38.84 58.68
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 61270 8.33 41.37 87.99
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 37981 72.73 31.91 67.86
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29

;

Astounding · Posted 07-12-2021 07:42 PM

Here's the easy way. It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID. (Mismatches are OK.)

data want;
   update second first;
   by pad_id;
run;

For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set. That's a little convoluted, but sounds like it's exactly what you need.

View solution in original post

mklangley · Posted 07-12-2021 04:46 PM

Assuming you don't have duplicate PAT_IDs in your SECOND dataset, this works, using a LEFT JOIN and COALESCE():

proc sql;
    create table want as
    select f.pat_id
          ,coalesce(f.income_block, s.income_block) as income_block
          ,coalesce(f.edu_block, s.edu_block) as edu_block
          ,coalesce(f.unemploy_block, s.unemploy_block) as unemploy_block
          ,coalesce(f.homeowner_block, s.homeowner_block) as homeowner_block
    from first f
    left join second s
        on f.pat_id = s.pat_id;
quit;

PharmlyDoc · Posted 07-12-2021 05:49 PM

What about if I have other variables in the first dataset that aren't in the second dataset (e.g. SEX, AGE, and a dozen more) that I want to include in the want dataset?

PharmlyDoc · Posted 07-12-2021 05:40 PM

Alright, here is the code that I used to solved my question.

PROC SQL;
    CREATE TABLE WANT AS 
        SELECT coalesce(a.PAT_ID, b.PAT_ID) as PAT_ID, coalesce(a.INCOME_BLOCK,b.INCOME_BLOCK) AS INCOME_BLOCK,
coalesce(a.EDU_BLOCK,b.EDU_BLOCK) AS EDU_BLOCK,
coalesce(a.UNEMPLOY_BLOCK,b.UNEMPLOY_BLOCK) AS UNEMPLOY_BLOCK,
coalesce(a.HOMEOWNER_BLOCK,b.HOMEOWNER_BLOCK) AS HOMEOWNER_BLOCK
            FROM first a LEFT JOIN second b
                ON a.PAT_ID = b.PAT_ID;
QUIT;

However, I'm having trouble doing the same as above but with additional variables in the first dataset such as sex and age.

data third; 
input PAT_ID : best2. SEX :$6. AGE : best2. INCOME_BLOCK : best6. EDU_BLOCK : best4. UNEMPLOY_BLOCK : best4. HOMEOWNER_BLOCK : best4. ;
datalines;
1 Female  22 45810 93.82 13.07 9.32
2 Male   37  35264 54.94 27.81 4.76
3 Male 27     .     .     .    . 
4 Male 37     .     .     .    .
5 Female 27 42346  18.62 1.87 1.29
6 Male   28 42346  41.11 4.76 10.29
7 Female  23 250001 41.11 3.02 1.29
8 Female  21 115290 93.82 27.81 1.87
9 Female  43 215493 21.65 10.29 3.60
10 Male	 35     .     .     .    .
11 Male	 30 215493 15.61 27.81 3.02
12 Male	 21 77521  18.62 7.17  9.32
13 Male	 21 215493 54.94 5.95  27.81
14 Male	 28     .     .     .    .
15 Male	 34 115290 34.35 1.29  9.32
16 Male	 38 30872  34.35 10.29 7.17
17 Female  36 105039 15.61 4.76 13.07
18 Female 23    .     .     .    .
19 Female 34 38964 9.55	 4.76  2.45
20 Female 26 38964 54.94 10.29 10.29
;

Astounding · Posted 07-12-2021 07:42 PM

Here's the easy way. It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID. (Mismatches are OK.)

data want;
   update second first;
   by pad_id;
run;

For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set. That's a little convoluted, but sounds like it's exactly what you need.

PharmlyDoc · Posted 07-12-2021 11:01 PM

Actually, I should have elaborated.

In reality my second data set has 250,000 patients, whereas the first dataset has only 100,000.

The update statement in the data step that you mentioned gives me a table with 250,000 patients, when I only want the 100,000

Astounding · Posted 07-13-2021 04:32 AM

A small tweak to the program should take care of that:

data want;
   update second first (in=original_100k);
   by pat_id;
   if original_100k;
run;

PharmlyDoc · Posted 07-13-2021 11:32 AM

What's your source for this? Does SAS have any references to using the update statement in this manner?

Other than this source: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/5184-2020.pdf

And how do I maintain the original column order from the first dataset?

PAT_ID SEX AGE INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK

instead of

PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK SEX AGE

Astounding · Posted 07-13-2021 02:24 PM

There's no source other than what you have (likely) see already. It's the documentation on what the UPDATE statement does.

To keep the variables in the order of the first data set, here's a trick you can use:

data want;
   if 5=4 then set first;
   update second first (in=original_100k);
   by pat_id;
   if original_100k;
run;

Since 5 will never equal 4, the SET statement never executes. But its presence is enough to define all the variables in FIRST at that point in the code.

Classroom Training Available!