BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PharmlyDoc
Quartz | Level 8

I have two tables, first and second, that contains data from the American Community Survey (ACS). The first table has missing values (always across 4 variables), whereas the second table contains values I wish to fill in for the missing values in the first table (through matching/joining by PAT_ID).

 

/* ACS variables at census block level */
data first;
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@;
datalines;
1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 . . . . 
4 . . . . 
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 . . . . 
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 . . . . 
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 . . . .  
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29
;

/* variables at the zipcode level (INCOME_ZIPCODE EDU_ZIPCODE UNEMPLOY_ZIPCODE HOMEOWNER_ZIPCODE)
 but named the same as above to help with merging*/ 
data second; 
input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@; 
datalines; 
1 50127 8.33 35.00 82.71 
2 43578 29.01 44.06 58.68 
3 36251 14.93 50.14 62.01 
4 136686 14.93 29.44 67.86 
5 136686 22.49 33.52 87.99 
6 43578 22.49 50.14 87.99 
7 32374 10.86 33.52 64.28 
8 39755 7.51 56.55 58.68 
9 34548 12.48 56.55 80.46 
10 34548 14.93 38.84 58.68 
11 50127 72.73 31.91 82.71 
12 50127 9.22 50.14 87.99 
13 37981 14.93 44.06 75.23 
14 61270 8.33 41.37 87.99 
15 57101 8.33 50.14 80.46 
16 36251 4.79 38.84 62.01 
17 136686 8.33 48.14 80.46 
18 37981 72.73 31.91 67.86 
19 50127 9.22 33.52 62.01 
20 136686 19.81 41.37 62.01 
;

 

/* values filled in for missing are from zip-code level instead of census block level   */

data want;

input PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK @@; datalines;

1 45810 93.82 13.07 9.32
2 35264 54.94 27.81 4.76
3 36251 14.93 50.14 62.01 
4 136686 14.93 29.44 67.86
5 42346 18.62 1.87 1.29
6 42346 41.11 4.76 10.29
7 250001 41.11 3.02 1.29
8 115290 93.82 27.81 1.87
9 215493 21.65 10.29 3.60
10 34548 14.93 38.84 58.68 
11 215493 15.61 27.81 3.02
12 77521 18.62 7.17 9.32
13 215493 54.94 5.95 27.81
14 61270 8.33 41.37 87.99
15 115290 34.35 1.29 9.32
16 30872 34.35 10.29 7.17
17 105039 15.61 4.76 13.07
18 37981 72.73 31.91 67.86  
19 38964 9.55 4.76 2.45
20 38964 54.94 10.29 10.29

;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Here's the easy way.  It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID.  (Mismatches are OK.)

 

data want;
   update second first;
   by pad_id;
run;

For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set.  That's a little convoluted, but sounds like it's exactly what you need.

View solution in original post

8 REPLIES 8
mklangley
Lapis Lazuli | Level 10

Assuming you don't have duplicate PAT_IDs in your SECOND dataset, this works, using a LEFT JOIN and COALESCE():

proc sql;
    create table want as
    select f.pat_id
          ,coalesce(f.income_block, s.income_block) as income_block
          ,coalesce(f.edu_block, s.edu_block) as edu_block
          ,coalesce(f.unemploy_block, s.unemploy_block) as unemploy_block
          ,coalesce(f.homeowner_block, s.homeowner_block) as homeowner_block
    from first f
    left join second s
        on f.pat_id = s.pat_id;
quit; 

 

PharmlyDoc
Quartz | Level 8

What about if I have other variables in the first dataset that aren't in the second dataset (e.g. SEX, AGE, and a dozen more)  that I want to include in the want dataset?

PharmlyDoc
Quartz | Level 8

 

 

Alright, here is the code that I used to solved my question. 

 

PROC SQL;
    CREATE TABLE WANT AS 
        SELECT coalesce(a.PAT_ID, b.PAT_ID) as PAT_ID, coalesce(a.INCOME_BLOCK,b.INCOME_BLOCK) AS INCOME_BLOCK,
coalesce(a.EDU_BLOCK,b.EDU_BLOCK) AS EDU_BLOCK,
coalesce(a.UNEMPLOY_BLOCK,b.UNEMPLOY_BLOCK) AS UNEMPLOY_BLOCK,
coalesce(a.HOMEOWNER_BLOCK,b.HOMEOWNER_BLOCK) AS HOMEOWNER_BLOCK
            FROM first a LEFT JOIN second b
                ON a.PAT_ID = b.PAT_ID;
QUIT;

However, I'm having trouble doing the same as above but with additional variables in the first dataset such as sex and age. 

data third; 
input PAT_ID : best2. SEX :$6. AGE : best2. INCOME_BLOCK : best6. EDU_BLOCK : best4. UNEMPLOY_BLOCK : best4. HOMEOWNER_BLOCK : best4. ;
datalines;
1 Female  22 45810 93.82 13.07 9.32
2 Male   37  35264 54.94 27.81 4.76
3 Male 27     .     .     .    . 
4 Male 37     .     .     .    .
5 Female 27 42346  18.62 1.87 1.29
6 Male   28 42346  41.11 4.76 10.29
7 Female  23 250001 41.11 3.02 1.29
8 Female  21 115290 93.82 27.81 1.87
9 Female  43 215493 21.65 10.29 3.60
10 Male	 35     .     .     .    .
11 Male	 30 215493 15.61 27.81 3.02
12 Male	 21 77521  18.62 7.17  9.32
13 Male	 21 215493 54.94 5.95  27.81
14 Male	 28     .     .     .    .
15 Male	 34 115290 34.35 1.29  9.32
16 Male	 38 30872  34.35 10.29 7.17
17 Female  36 105039 15.61 4.76 13.07
18 Female 23    .     .     .    .
19 Female 34 38964 9.55	 4.76  2.45
20 Female 26 38964 54.94 10.29 10.29
;

 

Astounding
PROC Star

Here's the easy way.  It assumes that both data sets are sorted, and that your data sets have just one observation per PAT_ID.  (Mismatches are OK.)

 

data want;
   update second first;
   by pad_id;
run;

For each PAT_ID, it takes data from your second data set, then replaces those values with any non-missing values from the first data set.  That's a little convoluted, but sounds like it's exactly what you need.

PharmlyDoc
Quartz | Level 8

Actually, I should have elaborated. 

In reality my second data set has 250,000 patients, whereas the first dataset has only 100,000. 

The update statement in the data step that you mentioned gives me a table with 250,000 patients, when I only want the 100,000

Astounding
PROC Star

A small tweak to the program should take care of that:

data want;
   update second first (in=original_100k);
   by pat_id;
   if original_100k;
run;
PharmlyDoc
Quartz | Level 8

What's your source for this? Does SAS have any references to using the update statement in this manner? 

Other than this source: https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2020/5184-2020.pdf

 

And how do I maintain the original column order from the first dataset?

PAT_ID SEX AGE INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK 

instead of 

PAT_ID INCOME_BLOCK EDU_BLOCK UNEMPLOY_BLOCK HOMEOWNER_BLOCK SEX AGE 

Astounding
PROC Star

There's no source other than what you have (likely) see already.  It's the documentation on what the UPDATE statement does.

 

To keep the variables in the order of the first data set, here's a trick you can use:

 

data want;
   if 5=4 then set first;
   update second first (in=original_100k);
   by pat_id;
   if original_100k;
run;

Since 5 will never equal 4, the SET statement never executes.  But its presence is enough to define all the variables in FIRST at that point in the code.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1106 views
  • 3 likes
  • 3 in conversation