BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
alaxman
Obsidian | Level 7

Hello, 

 

I've created a new table called "UPDATE" which fixes the duplicate svcdate in the "OLD" table (highlighted In red font) -- it is a duplicate because this is the same person (as you can also see by dob). I want to merge these tables, taking the enrolid & svcdates from the "UPDATE"  table and merge onto the "OLD" dataset (keeping the other variables from the OLD dataset but ensuring it updates the duplicate svcdate issue). 

 

I have used the following merge statement to do this, but it is not working. Is there something else that I need to do? Thanks in advance!

 

data PERSISTENCE_3M_12M2;

MERGE sum persistence_3M_12M;

by enrolid;

run;

 

 

OLD

enrolid ndcnum svcdate daysupp copay coins deduct dob
A 21 4/25/14 8 0 0 0 x
A 21 4/25/14 8 45 0 0 x
A 22 5/2/14 30 45 0 0 z
A 23 7/22/14 30 45 0 0 q

 

UPDATE

enrolid ndcnum svcdate daysupp copay
A 21 4/25/14 8 45
A 22 5/2/14 30 45
A 23 7/22/14 30 45

 

What I want: 

enrolid ndcnum svcdate daysupp copay coins deduct dob
A 21 4/25/14 8 45 0 0 x
A 22 5/2/14 30 45 0 0 z
A 23 7/22/14 30 45 0 0 q

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose.  It does not add new information.  If so, then:

data want;
  set sum;
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.check()=0;
run;

The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.

 

The program create a hash object h, with all the content from persistence_3m_12m.  The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m).  The subsetting if statement merely tests whether the record in sum is found in the hash object.  (success here means the h.check() method returns a zero).

 

Editted note, per your revision of the problem.

 

If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:

 

data want;
  set sum persistence_3m_12m (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
      h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
      h.definedata(all:'Y');
	  h.definedone();
  end;
  if h.find()=0;
run;

In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement.  Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching.  So this will bring in the other variables from persistence_3m_12m.

 

The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data.  The data from that file are read in the declare hash statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

8 REPLIES 8
jimbarbour
Meteorite | Level 14
Well, I'd probably do a Sort on the OLD dataset with a NODUPKEY option with Enrolid, ndcnum, and SvcDate as my BY variables before I tried to merge the two. The NODUPKEY will reduce the two duplicate rows to one.

Jim
mkeintz
PROC Star

It appears that your persistence_3m_12M is meant only to identify which case among duplicates to choose.  It does not add new information.  If so, then:

data want;
  set sum;
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
	  h.definekey(all:'Y');
	  h.definedone();
  end;
  if h.check()=0;
run;

The data do not need to be sorted for this to work, but this program does assume that the dataset persistence_3m_12m is complete - i.e. it identifies every case from sum that you want.

 

The program create a hash object h, with all the content from persistence_3m_12m.  The object is keyed (i.e. it is "indexed" on each combination encountered of all the variable found in persistence_3m_12m).  The subsetting if statement merely tests whether the record in sum is found in the hash object.  (success here means the h.check() method returns a zero).

 

Editted note, per your revision of the problem.

 

If the persistence_3m_12m has other variables to be added to the result, then a minor modification to the hash object definition, and changing h.check() to h.find() will do:

 

data want;
  set sum persistence_3m_12m (obs=0);
  if _n_=1 then do;
    declare hash h (dataset:'persistence_3m_12m');
      h.definekey('enrolid','ndcnum','svcdate','daysupp','copay');
      h.definedata(all:'Y');
	  h.definedone();
  end;
  if h.find()=0;
run;

In this version, only the variables in common between persistence_3m_12m and sum are used in the definekey statement, while all the variables are in the definedata statement.  Unlke the check() method, which only confirms the existence of an appropriate match in hash object h, the find method actually retrieves all the data upon successful matching.  So this will bring in the other variables from persistence_3m_12m.

 

The purpose of the (obs=0) applied to the persistence_3m_12m dataset is to obligate the sas compiler to make provision for any variables in the datasets that are not already in the sum dataset, but without reading in any actual data.  The data from that file are read in the declare hash statement.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
alaxman
Obsidian | Level 7

@mkeintz -- while your solution worked for this particular example, I realized that I did not give you a case in which new information is in fact added. For example, for the "sum" below, "daysupp" changes to 8, since I've added the 1+7 columns. 

 

OLD (PERSISTENCE_3M_12M)

Enrolid ndcnum coins copay deduct cob svcdate daysupp DOB
1 21 8.88 0 0 0 4/23/13 1 a
1 21 0 20 0 0 4/23/13 7 a
1 22 0 20 0 0 4/30/13 14 a

 

SUM

Enrolid ndcnum coins copay deduct cob svcdate daysupp
1 21 8.88 20 0 0 4/23/13 8

 

WHAT I WANT (PERSISTENCE_3M_12M2)

Enrolid ndcnum coins copay deduct cob svcdate daysupp DOB
1 21 8.88 20 0 0 4/23/13 8 a
1 22 0 20 0 0 4/30/13 14 a

 

When I used your code below,  it deleted the svcdate 4/23/13 observation altogether & only included the 4/30/13 service date. 

 

data PERSISTENCE_3M_12M2;

  set PERSISTENCE_3M_12M;

  if _n_=1 then do;

    declare hash h (dataset:'sum');

  h.definekey(all:'Y');

  h.definedone();

  end;

  if h.check()=0;

run;

 

Thanks so much, would really appreciate any further advice that you have! 

 

jimbarbour
Meteorite | Level 14

@alaxman,

 

I believe the solution I proposed already accommodates actual updates.  Have you taken a look at it?  Because it uses COALESCE, you only have to code the data values you wish to update.  The keys all have to be coded in order to match updates, but only the data that changes need be coded.  The remaining columns can be missing.

 

Jim

alaxman
Obsidian | Level 7

@jimbarbour yes thank you!! I actually ended up using your example since it accommodates updates. Thank you very much! 

jimbarbour
Meteorite | Level 14

@alaxman,

 

Below is how I would code what I indicated in my original reply.  @mkeintz's looks good.  My method would just be different.  Some people like to use SQL. Some people like to use Hash tables.  Take your pick.

 

Jim

 

DATA	OLD;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
	FORMAT	SvcDate	MMDDYYS10.;
DATALINES;
A	21	4/25/14	8	0	0	0	x
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q
RUN; 

DATA	UPDATES;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			;
DATALINES;
A	21	4/25/14	8	45
A	22	5/2/14	30	45
A	23	7/22/14	30	45
RUN;

PROC	SORT	DATA=OLD	NODUPKEY;
	BY	EnrolId	NdcNum	SvcDate;
RUN;

PROC	SQL;
	CREATE	TABLE	What_I_want		AS
		SELECT	Old.EnrolId
				,Old.NdcNum
				,Old.SvcDate
				,COALESCE(Upd.DaySupp,	Old.DaySupp)	AS	DaySupp
				,COALESCE(Upd.CoPay,	Old.CoPay)		AS	CoPay
				,Old.CoIns
				,Old.Deduct
				,Old.DoB
			FROM			Old		AS	Old
			INNER	JOIN	Updates	AS	Upd
				ON	Old.EnrolId	=	Upd.EnrolId
				AND	Old.NdcNum	=	Upd.NdcNum
				AND	Old.SvcDate	=	Upd.SvcDate
				;
QUIT;
alaxman
Obsidian | Level 7

Thank you both @mkeintz & @jimbarbour ! Really appreciate it!! 

jimbarbour
Meteorite | Level 14

You're welcome.

 

The one possible advantage to the SQL approach that I laid out is that if you use COALESCE (which takes the first non-missing value), you only have to code the columns you actually want to update.  Take a look at a more fully fleshed out version of the code, below.

 

Notice that in the UPDATES Data step, all the values are missing except the one I actually intend to update.  There's no need to code the rest of the values.  If a value is missing in the Updates data, then the COALESCE will just keep the non-missing value in the old data.  In fact, in the Updates data, since all the values are missing for the 2nd and 3rd rows, I can omit those two rows completely.

 

Be careful with a period as a missing value for character data.  For format $, SAS will interpret a period as a missing value, but for formats $CHAR and $VARYING, SAS will interpret a period as a period.

 

Jim

DATA	OLD;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
	FORMAT	SvcDate	MMDDYYS10.;
DATALINES;
A	21	4/25/14	8	0	0	0	x
A	21	4/25/14	8	45	0	0	x
A	22	5/2/14	30	45	0	0	z
A	23	7/22/14	30	45	0	0	q
RUN; 

DATA	UPDATES;
	INFILE	Datalines	DSD	DLM='09'X;
	INPUT	EnrolId	$
			NdcNum	$
			SvcDate	:	ANYDTDTE8.
			DaySupp	
			CoPay
			CoIns
			Deduct
			DoB		$
			;
DATALINES;
A	21	4/25/14	.	45	.	.	.
A	22	5/2/14	.	.	.	.	.
A	23	7/22/14	.	.	.	.	.
RUN;

PROC	SORT	DATA=OLD	NODUPKEY;
	BY	EnrolId	NdcNum	SvcDate;
RUN;

PROC	SQL;
	CREATE	TABLE	What_I_want		AS
		SELECT	Old.EnrolId
				,Old.NdcNum
				,Old.SvcDate
				,COALESCE(Upd.DaySupp,	Old.DaySupp)	AS	DaySupp
				,COALESCE(Upd.CoPay,	Old.CoPay)		AS	CoPay
				,COALESCE(Upd.CoIns,	Old.CoIns)		AS	CoIns
				,COALESCE(Upd.Deduct,	Old.Deduct)		AS	Deduct
				,COALESCE(Upd.DoB,		Old.DoB)		AS	DoB
			FROM			Old		AS	Old
			LEFT	JOIN	Updates	AS	Upd
				ON	Old.EnrolId	=	Upd.EnrolId
				AND	Old.NdcNum	=	Upd.NdcNum
				AND	Old.SvcDate	=	Upd.SvcDate
				;
QUIT;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2941 views
  • 7 likes
  • 3 in conversation