Solved: Re: Merging two datasets & updating the rows in the target table

alaxman · Posted 10-24-2020 09:14 PM

Hello,

I've created a new table called "UPDATE" which fixes the duplicate svcdate in the "OLD" table (highlighted In red font) -- it is a duplicate because this is the same person (as you can also see by dob). I want to merge these tables, taking the enrolid & svcdates from the "UPDATE" table and merge onto the "OLD" dataset (keeping the other variables from the OLD dataset but ensuring it updates the duplicate svcdate issue).

I have used the following merge statement to do this, but it is not working. Is there something else that I need to do? Thanks in advance!

data PERSISTENCE_3M_12M2;

MERGE sum persistence_3M_12M;

by enrolid;

run;

OLD

enrolid	ndcnum	svcdate	daysupp	copay	coins	deduct	dob
A	21	4/25/14	8	0	0	0	x
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q

UPDATE

enrolid	ndcnum	svcdate	daysupp	copay
A	21	4/25/14	8	45
A	22	5/2/14	30	45
A	23	7/22/14	30	45

What I want:

enrolid	ndcnum	svcdate	daysupp	copay	coins	deduct	dob
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q

mkeintz · Posted 10-24-2020 10:31 PM

It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose. It does not add new information. If so, then:

data want;
  set sum;
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.check()=0;
run;

The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.

The program create a hash object h, with all the content from persistence_3m_12m. The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m). The subsetting if statement merely tests whether the record in sum is found in the hash object. (success here means the h.check() method returns a zero).

Editted note, per your revision of the problem.

If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:

data want;
  set sum persistence_3m_12m (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
      h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
      h.definedata(all:'Y');
	  h.definedone();
  end;
  if h.find()=0;
run;

In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement. Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching. So this will bring in the other variables from persistence_3m_12m.

The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data. The data from that file are read in the declare hash statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

jimbarbour · Posted 10-24-2020 09:57 PM

Well, I'd probably do a Sort on the OLD dataset with a NODUPKEY option with Enrolid, ndcnum, and SvcDate as my BY variables before I tried to merge the two. The NODUPKEY will reduce the two duplicate rows to one.

Jim

mkeintz · Posted 10-24-2020 10:31 PM

It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose. It does not add new information. If so, then:

data want;
  set sum;
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.check()=0;
run;

The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.

The program create a hash object h, with all the content from persistence_3m_12m. The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m). The subsetting if statement merely tests whether the record in sum is found in the hash object. (success here means the h.check() method returns a zero).

Editted note, per your revision of the problem.

If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:

data want;
  set sum persistence_3m_12m (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
      h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
      h.definedata(all:'Y');
	  h.definedone();
  end;
  if h.find()=0;
run;

In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement. Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching. So this will bring in the other variables from persistence_3m_12m.

The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data. The data from that file are read in the declare hash statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

alaxman · Posted 10-25-2020 11:28 PM

@mkeintz -- while your solution worked for this particular example, I realized that I did not give you a case in which new information is in fact added. For example, for the "sum" below, "daysupp" changes to 8, since I've added the 1+7 columns.

OLD (PERSISTENCE_3M_12M)

Enrolid	ndcnum	coins	copay	deduct	cob	svcdate	daysupp	DOB
1	21	8.88	0	0	0	4/23/13	1	a
1	21	0	20	0	0	4/23/13	7	a
1	22	0	20	0	0	4/30/13	14	a

SUM

Enrolid	ndcnum	coins	copay	deduct	cob	svcdate	daysupp
1	21	8.88	20	0	0	4/23/13	8

WHAT I WANT (PERSISTENCE_3M_12M2)

Enrolid	ndcnum	coins	copay	deduct	cob	svcdate	daysupp	DOB
1	21	8.88	20	0	0	4/23/13	8	a
1	22	0	20	0	0	4/30/13	14	a

When I used your code below, it deleted the svcdate 4/23/13 observation altogether & only included the 4/30/13 service date.

data PERSISTENCE_3M_12M2;

set PERSISTENCE_3M_12M;

if _n_=1 then do;

declare hash h (dataset:'sum');

h.definekey(all:'Y');

h.definedone();

end;

if h.check()=0;

run;

Thanks so much, would really appreciate any further advice that you have!

jimbarbour · Posted 10-25-2020 11:33 PM

@alaxman,

I believe the solution I proposed already accommodates actual updates. Have you taken a look at it? Because it uses COALESCE, you only have to code the data values you wish to update. The keys all have to be coded in order to match updates, but only the data that changes need be coded. The remaining columns can be missing.

Jim

alaxman · Posted 10-26-2020 07:41 PM

@jimbarbour yes thank you!! I actually ended up using your example since it accommodates updates. Thank you very much!

jimbarbour · Posted 10-24-2020 11:52 PM

@alaxman,

Below is how I would code what I indicated in my original reply. @mkeintz's looks good. My method would just be different. Some people like to use SQL. Some people like to use Hash tables. Take your pick.

Jim

DATA	OLD;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
	FORMAT	SvcDate	MMDDYYS10.;
DATALINES;
A	21	4/25/14	8	0	0	0	x
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q
RUN; 

DATA	UPDATES;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			;
DATALINES;
A	21	4/25/14	8	45
A	22	5/2/14	30	45
A	23	7/22/14	30	45
RUN;

PROC	SORT	DATA=OLD	NODUPKEY;
	BY	EnrolId	NdcNum	SvcDate;
RUN;

PROC	SQL;
	CREATE	TABLE	What_I_want		AS
		SELECT	Old.EnrolId
				,Old.NdcNum
				,Old.SvcDate
				,COALESCE(Upd.DaySupp,	Old.DaySupp)	AS	DaySupp
				,COALESCE(Upd.CoPay,	Old.CoPay)		AS	CoPay
				,Old.CoIns
				,Old.Deduct
				,Old.DoB
			FROM			Old		AS	Old
			INNER	JOIN	Updates	AS	Upd
				ON	Old.EnrolId	=	Upd.EnrolId
				AND	Old.NdcNum	=	Upd.NdcNum
				AND	Old.SvcDate	=	Upd.SvcDate
				;
QUIT;

alaxman · Posted 10-25-2020 12:39 AM

Thank you both @mkeintz & @jimbarbour ! Really appreciate it!!

jimbarbour · Posted 10-25-2020 01:02 AM

You're welcome.

The one possible advantage to the SQL approach that I laid out is that if you use COALESCE (which takes the first non-missing value), you only have to code the columns you actually want to update. Take a look at a more fully fleshed out version of the code, below.

Notice that in the UPDATES Data step, all the values are missing except the one I actually intend to update. There's no need to code the rest of the values. If a value is missing in the Updates data, then the COALESCE will just keep the non-missing value in the old data. In fact, in the Updates data, since all the values are missing for the 2nd and 3rd rows, I can omit those two rows completely.

Be careful with a period as a missing value for character data. For format $, SAS will interpret a period as a missing value, but for formats $CHAR and $VARYING, SAS will interpret a period as a period.

Jim

DATA	OLD;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
	FORMAT	SvcDate	MMDDYYS10.;
DATALINES;
A	21	4/25/14	8	0	0	0	x
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q
RUN; 

DATA	UPDATES;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
DATALINES;
A	21	4/25/14	.	45	.	.	.
A	22	5/2/14	.	.	.	.	.
A	23	7/22/14	.	.	.	.	.
RUN;

PROC	SORT	DATA=OLD	NODUPKEY;
	BY	EnrolId	NdcNum	SvcDate;
RUN;

PROC	SQL;
	CREATE	TABLE	What_I_want		AS
		SELECT	Old.EnrolId
				,Old.NdcNum
				,Old.SvcDate
				,COALESCE(Upd.DaySupp,	Old.DaySupp)	AS	DaySupp
				,COALESCE(Upd.CoPay,	Old.CoPay)		AS	CoPay
				,COALESCE(Upd.CoIns,	Old.CoIns)		AS	CoIns
				,COALESCE(Upd.Deduct,	Old.Deduct)		AS	Deduct
				,COALESCE(Upd.DoB,		Old.DoB)		AS	DoB
			FROM			Old		AS	Old
			LEFT	JOIN	Updates	AS	Upd
				ON	Old.EnrolId	=	Upd.EnrolId
				AND	Old.NdcNum	=	Upd.NdcNum
				AND	Old.SvcDate	=	Upd.SvcDate
				;
QUIT;

Classroom Training Available!