Re: Merge/left join onto certain rows

lnicholl9 · Posted 07-19-2017 02:57 PM

Here is the scenario: I have a dataset with hospital identifiers, years, and a cost to charge ratio. For the year 2003, no cost to charge ratio is available. I need to take the ratio from 2004, multipy it by 1.04, and then merge by hospital ID to get cost to charge ratios for 2003. It might make more sense if you look at the code--I want to go from old to new.

data old;

infile datalines;

input Hospid $ Year Ratio;

datalines;

A 2003 .

B 2003 .

C 2003 .

A 2004 .2

B 2004 .24

C 2004 .3

;

run;

and I want to get new by taking the ratio from 2004, multiplying it by 1.04 and then matching it by hospid:

data new;

infile datalines;

input hospid $ year ratio;

datalines;

A 2003 .208

B 2003 .2496

C 2003 .312

A 2004 .2

B 2004 .24

C 2004 .3

;

run;

I have tried two methods:

1. a proc sql left join, in which I made a separate dataset of 2004 observations and then tried to join by hospid. that still gave me missing values

2. also making a separate dataset of 2004 observations, sorting that and the original dataset, then merging by hospid. that mostly worked, except when hospital id appeared more than once in which case i only got a ratio for one of the occurances.

thanks for any advice!

kiranv_ · Posted 07-19-2017 03:36 PM

something like this hsould work for your scenario

proc sql;
create table new as 
select a.hospid, a.year, case when a.hospid=b.hospid and a.ratio is missing 
                               then b.ratio *1.04 
                               else a.ratio
                               end as ratio 
from 
(select hospid, year, ratio
from old)a
left join
(select distinct hospid, ratio from old
where ratio is not missing)b
on a.hospid =b.hospid;
;

lnicholl9 · Posted 07-21-2017 04:40 PM

When I tried this I ended up with more observations than in the original dataset

GreggB · Posted 07-19-2017 03:54 PM

Your only matching criteria is ID but ID is not unique to each observation. Am I reading your data set correctly?

lnicholl9 · Posted 07-19-2017 05:06 PM

I think so, there can be more than one of the same hospital ID within a given year.

Astounding · Posted 07-19-2017 04:37 PM

There are too many cases that you haven't covered here. What should the results be for these combinations?

data old;

infile datalines;

input Hospid $ Year Ratio;

datalines;

A 2003 .

B 2003 .

C 2003 .

A 2004 .2

A 2004 .3

B 2004 .24

B 2004 .26

C 2004 .3

;

Anything is possible. But the rules are incomplete at this point.

lnicholl9 · Posted 07-19-2017 05:04 PM

Sorry, should have been more clear. There would not be a case in which the ratio would be different for two instances of the same hospital.

So this:
A 2004 .3
A 2004 .2

Would never happen. However this could happen:
A 2004 .3
A 2005 .2

But I don't care about years besides 2003 and 2004, I want those years to be unaffected

Astounding · Posted 07-19-2017 05:22 PM

I'm not sure if this would be the fastest way, but it's probably the clearest way.

data just2003 just2004 other_years;

set old;

if year=2003 then output just2003;

else if year = 2004 then output just2004;

else output other_years;

run;

proc sort data=just2003;

by HospID;

run;

proc sort data=just2004 out=ratio2004 nodupkey;

by HospID;

run;

data replace2003;

set ratio2004 (in=in1) just2003 (in=in2);

by HospID;

if first.HospID then do;

if in1 then replacement_ratio = ratio * 1.04;

else replacement_ratio = .;

end;

if in2;

if ratio = . then ratio = replacement_ratio;

drop replacement_ratio;

retain replacement_ratio;

run;

data want;

set replace2003 just2004 other_years;

run;

ChrisNZ · Posted 07-21-2017 12:10 AM

Like this?


proc sql;
   create table WANT as  
     select HOSPID, 2003 as YEAR, RATIO*1.04 as RATIO
     from HAVE
     where YEAR eq 2004
   union all
     select * 
     from HAVE
     where YEAR ne 2003;
quit;

HOSPID	YEAR	RATIO
A	2003	0.208
B	2003	0.2496
C	2003	0.312
A	2004	0.2
B	2004	0.24
C	2004	0.3

High-Performance SAS Coding - Third Edition

lnicholl9 · Posted 07-21-2017 04:49 PM

I tried this code but if there is a hospital in 2003 which is not in the other years, it gets overwritten. see this dataset;

data old;

infile datalines;

input hospid $ year ratio;

datalines;

A 2003 .

B 2003 .

C 2003 .

A 2004 .2

B 2004 .3

D 2004 .4

A 2005 .3

;

run;

output should be this but instead C gets overwriten:

A 2003 .208

B 2003 .312

C 2003 .

A 2004 .2

B 2004 .3

D 2004 .4

A 2005 .3

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away