Solved: Re: Join produces duplicates

rapt1 · Posted 07-05-2021 09:26 AM

Hi everyone,

I am trying to join tables again using the code below. userlist has 300,000 observations while utilization has about 1,000,000 observations:

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

However, MERGED had around 300,100 observations in the output. I found that there were duplicates in the MERGED table by using another code such as this, to eliminate them. The output did show the 100 dupes:

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

Now, this works fine for now. Here are my questions:

1. I made sure that there are no duplicates in both the left and right table. So why is it still producing duplicates?

2. When I ran code block 1 with another set of table, I still had duplicates. When I ran code block 2, it did show the number of duplicates but I am still not getting the same amount of observations in the original table and the output table.

Any insights on this problem?

Reeza · Posted 07-05-2021 01:03 PM

You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.

You're checking for full row duplicates ie
ClientID Date Name
1 2020/01/01 Dave
1 2021/01/01 Dave

These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by clientID;
run;

View solution in original post

Astounding · Posted 07-05-2021 09:44 AM

First step: check again. It sure looks like you have duplicates in the utilization table.

rapt1 · Posted 07-05-2021 09:49 AM

I have checked and there were no duplicates. I also checked the userlist and there were none.

I forgot to mention that I was using left join because I had to capture those users in userlist whether they had usage/price values in utilization. So I was only expecting 300,000 users in the merged table.

Tom · Posted 07-05-2021 09:59 AM

How did you test for duplicates? You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable.

rapt1 · Posted 07-05-2021 10:15 AM

I used the same one for checking for dupes:

proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;

"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.

Tom · Posted 07-05-2021 11:10 AM

Change the BY statement to include just the client_id variable.

Reeza · Posted 07-05-2021 11:12 AM

Show your full code. Maybe you referred to the unsorted data set somewhere accidentally? Or maybe you're using the wrong variable or specification somewhere.

If you get multiples you MUST have duplicates somewhere, SAS doesn't generate matches out of thin air so its in your data somewhere. We can't see your data or code so we can't tell you where you may be going wrong.

@rapt1 wrote:

I used the same one for checking for dupes:
proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;
"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.

rapt1 · Posted 07-05-2021 12:22 PM

Sharing the full code. I am using EG and I try to run these by code block,

First, I made copies of the tables and checked for dupes at the same time, One for the userlist, which has 300,000 rows:

proc sql;
create table WORK.userlist as
select *
from filelocation1;
quit;

proc sort data=WORK.userlist nodupkey dupout=userdups_checking;
by _all_;
run;

There are 0 observations that came out of userdups_checking. In another process flow, I made a table for the utilization (has 1 million rows). There are also no dupes as a result:

proc sql;
create table WORK.utilization as
select *
from filelocation2;
quit;

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by _all_;
run;

Then I merged the table in another process flow:

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

This is where I am getting the duplicates at this point (300,100).

Let me know if there's anything else I need to share

Reeza · Posted 07-05-2021 01:03 PM

You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.

You're checking for full row duplicates ie
ClientID Date Name
1 2020/01/01 Dave
1 2021/01/01 Dave

These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by clientID;
run;

rapt1 · Posted 07-06-2021 06:18 AM

This makes sense. I tried it and I am getting the output I am looking for. Thank you so much guys!

SAS Innovate 2025: Save the Date