Re: Correcting a merge dataset

wheddingsjr · Posted 08-03-2020 11:32 PM

Hi all. I am trying to duplicate SAS logic and get the same results that some else did previously merging two datasets. The first dataset (HPHC_EOL_FY20), has 85k lines and the second (EOL_COHORT), has 135 lines. In the original results that I am trying to duplicate the final result has 55k lines, when I run the below code I get 130k. I am no sure what I am doing wrong. Any advise would be appreciated.

PROC SQL;
create table HPHC_FY20_EOL_combined_v2 as
	select med.*,eol.*
		from Dataset1 as med
				left join
			 Dataset2 as eol
			 	on med.member_id=eol.INSUREDSIDNUMBER
		;
Quit;

Jagadishkatam · Posted 08-03-2020 11:58 PM

Without looking at the data, it is difficult to predict. However i assume the previous merge was done if both datasets had med.member_id=eol.INSUREDSIDNUMBER, considering this i would try inner join instead if left join. please try and let me know if it works

PROC SQL;
create table HPHC_FY20_EOL_combined_v2 as
	select med.*,eol.*
		from Dataset1 as med
				inner join
			 Dataset2 as eol
			 	on med.member_id=eol.INSUREDSIDNUMBER
		;
Quit;

Thanks,
Jag

wheddingsjr · Posted 08-04-2020 01:15 AM

Thanks for the response Jag. I tried the inner join and got the same results. I think I have to delete dupes, but not sure how to do that.

Kurt_Bremser · Posted 08-04-2020 02:13 AM

@wheddingsjr wrote:
Thanks for the response Jag. I tried the inner join and got the same results. I think I have to delete dupes, but not sure how to do that.

proc sort
  data=dataset2
  out=dedup
  nodupkey
;
by INSUREDSIDNUMBER;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wheddingsjr · Posted 08-04-2020 06:47 AM

Thanks Kurt for the response, but that did not work

Kurt_Bremser · Posted 08-04-2020 07:08 AM

Please describe in detail what happened. "Did not work" is not helpful at all, and worthy of the proverbial blonde secretary.

how many observations are in the first dataset?
how many are in the second dataset before and after deduplicating (read the log)?
how many are in the result?

Mind that a left join will ALWAYS have at least the number of observations of the "left" dataset. To build the intersection set, you have to use an inner join.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wheddingsjr · Posted 08-04-2020 08:37 AM

Hi Kurt

This is what I included in my original post:

Hi all. I am trying to duplicate SAS logic and get the same results that some else did previously merging two datasets. The first dataset (HPHC_EOL_FY20), has 85k lines and the second (EOL_COHORT), has 135 lines. In the original results that I am trying to duplicate the final result has 55k lines, when I run the below code I get 130k. I am no sure what I am doing wrong. Any advise would be appreciated.

Kurt_Bremser · Posted 08-04-2020 08:39 AM

how many are in the second dataset before and after deduplicating (read the log)?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wheddingsjr · Posted 08-04-2020 09:28 AM

Hi Kurt

This the log:
NOTE: There were 136 observations read from the data set WORK.dataset2.
NOTE: 33 observations with duplicate key values were deleted.
NOTE: The data set WORK.DEDUP has 103 observations and 17 variables.
NOTE: PROCEDURE SORT used (Total process time):

Kurt_Bremser · Posted 08-04-2020 09:45 AM

And how many observations resulted from the join of the large table with your deduped dataset?

Do you need the deduped dataset to create a subset of the large table, or should all observations from the large table be kept?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wheddingsjr · Posted 08-04-2020 10:07 AM

There were still 130k lines..What I believe needs to happen is to delete all the dupes from the 130k

Kurt_Bremser · Posted 08-04-2020 11:07 AM

Did you change from a left join to an inner join? Please post the complete log of your current join step.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

wheddingsjr · Posted 08-04-2020 11:43 AM

138 PROC SQL;
139 create table HPHC_FY20_EOL_combined_v2 as
140 select med.*,eol.*
141 from HPHC_EOL_FY20 as med
142 inner join
143 EOL_Cohort as eol
144 on med.member_id = eol.INSUREDSIDNUMBER
145 ;
NOTE: Table WORK.HPHC_FY20_EOL_COMBINED_V2 created, with 130170 rows and 35 columns.

Kurt_Bremser · Posted 08-04-2020 11:52 AM

So that simply means your large table consists mostly of member_id's that have a match in the small table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Registration is open

SAS Training: Just a Click Away