BookmarkSubscribeRSS Feed
wheddingsjr
Pyrite | Level 9

Hi all. I am trying to duplicate SAS logic and get the same results that some else did previously merging two datasets. The first dataset (HPHC_EOL_FY20), has 85k lines and the second (EOL_COHORT), has 135 lines. In the original results that I am trying to duplicate the final result has 55k lines, when I run the below code I get 130k. I am no sure what I am doing wrong. Any advise would be appreciated.

PROC SQL;
create table HPHC_FY20_EOL_combined_v2 as
	select med.*,eol.*
		from Dataset1 as med
				left join
			 Dataset2 as eol
			 	on med.member_id=eol.INSUREDSIDNUMBER
		;
Quit;
13 REPLIES 13
Jagadishkatam
Amethyst | Level 16

Without looking at the data, it is difficult to predict. However i assume the previous merge was done if both datasets had med.member_id=eol.INSUREDSIDNUMBER, considering this i would try inner join instead if left join. please try and let me know if it works

 

PROC SQL;
create table HPHC_FY20_EOL_combined_v2 as
	select med.*,eol.*
		from Dataset1 as med
				inner join
			 Dataset2 as eol
			 	on med.member_id=eol.INSUREDSIDNUMBER
		;
Quit;
Thanks,
Jag
wheddingsjr
Pyrite | Level 9
Thanks for the response Jag. I tried the inner join and got the same results. I think I have to delete dupes, but not sure how to do that.
Kurt_Bremser
Super User

@wheddingsjr wrote:
Thanks for the response Jag. I tried the inner join and got the same results. I think I have to delete dupes, but not sure how to do that.
proc sort
  data=dataset2
  out=dedup
  nodupkey
;
by INSUREDSIDNUMBER;
run;
wheddingsjr
Pyrite | Level 9
Thanks Kurt for the response, but that did not work
Kurt_Bremser
Super User

Please describe in detail what happened. "Did not work" is not helpful at all, and worthy of the proverbial blonde secretary.

  • how many observations are in the first dataset?
  • how many are in the second dataset before and after deduplicating (read the log)?
  • how many are in the result?

Mind that a left join will ALWAYS have at least the number of observations of the "left" dataset. To build the intersection set, you have to use an inner join.

wheddingsjr
Pyrite | Level 9
Hi Kurt

This is what I included in my original post:

Hi all. I am trying to duplicate SAS logic and get the same results that some else did previously merging two datasets. The first dataset (HPHC_EOL_FY20), has 85k lines and the second (EOL_COHORT), has 135 lines. In the original results that I am trying to duplicate the final result has 55k lines, when I run the below code I get 130k. I am no sure what I am doing wrong. Any advise would be appreciated.
wheddingsjr
Pyrite | Level 9
Hi Kurt

This the log:
NOTE: There were 136 observations read from the data set WORK.dataset2.
NOTE: 33 observations with duplicate key values were deleted.
NOTE: The data set WORK.DEDUP has 103 observations and 17 variables.
NOTE: PROCEDURE SORT used (Total process time):
Kurt_Bremser
Super User

And how many observations resulted from the join of the large table with your deduped dataset?

Do you need the deduped dataset to create a subset of the large table, or should all observations from the large table be kept?

wheddingsjr
Pyrite | Level 9
There were still 130k lines..What I believe needs to happen is to delete all the dupes from the 130k
wheddingsjr
Pyrite | Level 9
138 PROC SQL;
139 create table HPHC_FY20_EOL_combined_v2 as
140 select med.*,eol.*
141 from HPHC_EOL_FY20 as med
142 inner join
143 EOL_Cohort as eol
144 on med.member_id = eol.INSUREDSIDNUMBER
145 ;
NOTE: Table WORK.HPHC_FY20_EOL_COMBINED_V2 created, with 130170 rows and 35 columns.
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2410 views
  • 0 likes
  • 3 in conversation