Hello,
I've created a new table called "UPDATE" which fixes the duplicate svcdate in the "OLD" table (highlighted In red font) -- it is a duplicate because this is the same person (as you can also see by dob). I want to merge these tables, taking the enrolid & svcdates from the "UPDATE" table and merge onto the "OLD" dataset (keeping the other variables from the OLD dataset but ensuring it updates the duplicate svcdate issue).
I have used the following merge statement to do this, but it is not working. Is there something else that I need to do? Thanks in advance!
data PERSISTENCE_3M_12M2;
MERGE sum persistence_3M_12M;
by enrolid;
run;
OLD
enrolid | ndcnum | svcdate | daysupp | copay | coins | deduct | dob |
A | 21 | 4/25/14 | 8 | 0 | 0 | 0 | x |
A | 21 | 4/25/14 | 8 | 45 | 0 | 0 | x |
A | 22 | 5/2/14 | 30 | 45 | 0 | 0 | z |
A | 23 | 7/22/14 | 30 | 45 | 0 | 0 | q |
UPDATE
enrolid | ndcnum | svcdate | daysupp | copay |
A | 21 | 4/25/14 | 8 | 45 |
A | 22 | 5/2/14 | 30 | 45 |
A | 23 | 7/22/14 | 30 | 45 |
What I want:
enrolid | ndcnum | svcdate | daysupp | copay | coins | deduct | dob |
A | 21 | 4/25/14 | 8 | 45 | 0 | 0 | x |
A | 22 | 5/2/14 | 30 | 45 | 0 | 0 | z |
A | 23 | 7/22/14 | 30 | 45 | 0 | 0 | q |
It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose. It does not add new information. If so, then:
data want;
set sum;
if _n_=1 then do;
declare hash h (dataset:'persistence_3m_12m');
h.definekey(all:'Y');
h.definedone();
end;
if h.check()=0;
run;
The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.
The program create a hash object h, with all the content from persistence_3m_12m. The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m). The subsetting if statement merely tests whether the record in sum is found in the hash object. (success here means the h.check() method returns a zero).
Editted note, per your revision of the problem.
If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:
data want;
set sum persistence_3m_12m (obs=0);
if _n_=1 then do;
declare hash h (dataset:'persistence_3m_12m');
h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
h.definedata(all:'Y');
h.definedone();
end;
if h.find()=0;
run;
In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement. Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching. So this will bring in the other variables from persistence_3m_12m.
The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data. The data from that file are read in the declare hash statement.
It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose. It does not add new information. If so, then:
data want;
set sum;
if _n_=1 then do;
declare hash h (dataset:'persistence_3m_12m');
h.definekey(all:'Y');
h.definedone();
end;
if h.check()=0;
run;
The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.
The program create a hash object h, with all the content from persistence_3m_12m. The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m). The subsetting if statement merely tests whether the record in sum is found in the hash object. (success here means the h.check() method returns a zero).
Editted note, per your revision of the problem.
If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:
data want;
set sum persistence_3m_12m (obs=0);
if _n_=1 then do;
declare hash h (dataset:'persistence_3m_12m');
h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
h.definedata(all:'Y');
h.definedone();
end;
if h.find()=0;
run;
In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement. Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching. So this will bring in the other variables from persistence_3m_12m.
The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data. The data from that file are read in the declare hash statement.
@mkeintz -- while your solution worked for this particular example, I realized that I did not give you a case in which new information is in fact added. For example, for the "sum" below, "daysupp" changes to 8, since I've added the 1+7 columns.
OLD (PERSISTENCE_3M_12M)
Enrolid | ndcnum | coins | copay | deduct | cob | svcdate | daysupp | DOB |
1 | 21 | 8.88 | 0 | 0 | 0 | 4/23/13 | 1 | a |
1 | 21 | 0 | 20 | 0 | 0 | 4/23/13 | 7 | a |
1 | 22 | 0 | 20 | 0 | 0 | 4/30/13 | 14 | a |
SUM
Enrolid | ndcnum | coins | copay | deduct | cob | svcdate | daysupp |
1 | 21 | 8.88 | 20 | 0 | 0 | 4/23/13 | 8 |
WHAT I WANT (PERSISTENCE_3M_12M2)
Enrolid | ndcnum | coins | copay | deduct | cob | svcdate | daysupp | DOB |
1 | 21 | 8.88 | 20 | 0 | 0 | 4/23/13 | 8 | a |
1 | 22 | 0 | 20 | 0 | 0 | 4/30/13 | 14 | a |
When I used your code below, it deleted the svcdate 4/23/13 observation altogether & only included the 4/30/13 service date.
data PERSISTENCE_3M_12M2;
set PERSISTENCE_3M_12M;
if _n_=1 then do;
declare hash h (dataset:'sum');
h.definekey(all:'Y');
h.definedone();
end;
if h.check()=0;
run;
Thanks so much, would really appreciate any further advice that you have!
I believe the solution I proposed already accommodates actual updates. Have you taken a look at it? Because it uses COALESCE, you only have to code the data values you wish to update. The keys all have to be coded in order to match updates, but only the data that changes need be coded. The remaining columns can be missing.
Jim
@jimbarbour yes thank you!! I actually ended up using your example since it accommodates updates. Thank you very much!
Below is how I would code what I indicated in my original reply. @mkeintz's looks good. My method would just be different. Some people like to use SQL. Some people like to use Hash tables. Take your pick.
Jim
DATA OLD;
INFILE Datalines DSD DLM='09'X;
INPUT EnrolId $
NdcNum $
SvcDate : ANYDTDTE8.
DaySupp
CoPay
CoIns
Deduct
DoB $
;
FORMAT SvcDate MMDDYYS10.;
DATALINES;
A 21 4/25/14 8 0 0 0 x
A 21 4/25/14 8 45 0 0 x
A 22 5/2/14 30 45 0 0 z
A 23 7/22/14 30 45 0 0 q
RUN;
DATA UPDATES;
INFILE Datalines DSD DLM='09'X;
INPUT EnrolId $
NdcNum $
SvcDate : ANYDTDTE8.
DaySupp
CoPay
;
DATALINES;
A 21 4/25/14 8 45
A 22 5/2/14 30 45
A 23 7/22/14 30 45
RUN;
PROC SORT DATA=OLD NODUPKEY;
BY EnrolId NdcNum SvcDate;
RUN;
PROC SQL;
CREATE TABLE What_I_want AS
SELECT Old.EnrolId
,Old.NdcNum
,Old.SvcDate
,COALESCE(Upd.DaySupp, Old.DaySupp) AS DaySupp
,COALESCE(Upd.CoPay, Old.CoPay) AS CoPay
,Old.CoIns
,Old.Deduct
,Old.DoB
FROM Old AS Old
INNER JOIN Updates AS Upd
ON Old.EnrolId = Upd.EnrolId
AND Old.NdcNum = Upd.NdcNum
AND Old.SvcDate = Upd.SvcDate
;
QUIT;
Thank you both @mkeintz & @jimbarbour ! Really appreciate it!!
You're welcome.
The one possible advantage to the SQL approach that I laid out is that if you use COALESCE (which takes the first non-missing value), you only have to code the columns you actually want to update. Take a look at a more fully fleshed out version of the code, below.
Notice that in the UPDATES Data step, all the values are missing except the one I actually intend to update. There's no need to code the rest of the values. If a value is missing in the Updates data, then the COALESCE will just keep the non-missing value in the old data. In fact, in the Updates data, since all the values are missing for the 2nd and 3rd rows, I can omit those two rows completely.
Be careful with a period as a missing value for character data. For format $, SAS will interpret a period as a missing value, but for formats $CHAR and $VARYING, SAS will interpret a period as a period.
Jim
DATA OLD;
INFILE Datalines DSD DLM='09'X;
INPUT EnrolId $
NdcNum $
SvcDate : ANYDTDTE8.
DaySupp
CoPay
CoIns
Deduct
DoB $
;
FORMAT SvcDate MMDDYYS10.;
DATALINES;
A 21 4/25/14 8 0 0 0 x
A 21 4/25/14 8 45 0 0 x
A 22 5/2/14 30 45 0 0 z
A 23 7/22/14 30 45 0 0 q
RUN;
DATA UPDATES;
INFILE Datalines DSD DLM='09'X;
INPUT EnrolId $
NdcNum $
SvcDate : ANYDTDTE8.
DaySupp
CoPay
CoIns
Deduct
DoB $
;
DATALINES;
A 21 4/25/14 . 45 . . .
A 22 5/2/14 . . . . .
A 23 7/22/14 . . . . .
RUN;
PROC SORT DATA=OLD NODUPKEY;
BY EnrolId NdcNum SvcDate;
RUN;
PROC SQL;
CREATE TABLE What_I_want AS
SELECT Old.EnrolId
,Old.NdcNum
,Old.SvcDate
,COALESCE(Upd.DaySupp, Old.DaySupp) AS DaySupp
,COALESCE(Upd.CoPay, Old.CoPay) AS CoPay
,COALESCE(Upd.CoIns, Old.CoIns) AS CoIns
,COALESCE(Upd.Deduct, Old.Deduct) AS Deduct
,COALESCE(Upd.DoB, Old.DoB) AS DoB
FROM Old AS Old
LEFT JOIN Updates AS Upd
ON Old.EnrolId = Upd.EnrolId
AND Old.NdcNum = Upd.NdcNum
AND Old.SvcDate = Upd.SvcDate
;
QUIT;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.