topic Re: Join produces duplicates in New SAS User

Join produces duplicates

rapt1 — Mon, 05 Jul 2021 13:26:49 GMT

Hi everyone,

I am trying to join tables again using the code below. userlist has 300,000 observations while utilization has about 1,000,000 observations:

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

However, MERGED had around 300,100 observations in the output. I found that there were duplicates in the MERGED table by using another code such as this, to eliminate them. The output did show the 100 dupes:

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

Now, this works fine for now. Here are my questions:

1. I made sure that there are no duplicates in both the left and right table. So why is it still producing duplicates?

2. When I ran code block 1 with another set of table, I still had duplicates. When I ran code block 2, it did show the number of duplicates but I am still not getting the same amount of observations in the original table and the output table.

Any insights on this problem?

Re: Join produces duplicates

Astounding — Mon, 05 Jul 2021 13:44:14 GMT

First step: check again. It sure looks like you have duplicates in the utilization table.

Re: Join produces duplicates

rapt1 — Mon, 05 Jul 2021 13:49:59 GMT

I have checked and there were no duplicates. I also checked the userlist and there were none.

I forgot to mention that I was using left join because I had to capture those users in userlist whether they had usage/price values in utilization. So I was only expecting 300,000 users in the merged table.

Re: Join produces duplicates

Tom — Mon, 05 Jul 2021 13:59:35 GMT

How did you test for duplicates? You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable.

Re: Join produces duplicates

rapt1 — Mon, 05 Jul 2021 14:15:09 GMT

I used the same one for checking for dupes:

proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;

"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.

Re: Join produces duplicates

Tom — Mon, 05 Jul 2021 15:17:42 GMT

Change the BY statement to include just the client_id variable.

Re: Join produces duplicates

Reeza — Mon, 05 Jul 2021 15:12:19 GMT

Show your full code. Maybe you referred to the unsorted data set somewhere accidentally? Or maybe you're using the wrong variable or specification somewhere.

If you get multiples you MUST have duplicates somewhere, SAS doesn't generate matches out of thin air so its in your data somewhere. We can't see your data or code so we can't tell you where you may be going wrong.

@rapt1 wrote:

I used the same one for checking for dupes:
proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;
"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.

Re: Join produces duplicates

rapt1 — Tue, 06 Jul 2021 10:18:58 GMT

Sharing the full code. I am using EG and I try to run these by code block,

First, I made copies of the tables and checked for dupes at the same time, One for the userlist, which has 300,000 rows:

proc sql;
create table WORK.userlist as
select *
from filelocation1;
quit;

proc sort data=WORK.userlist nodupkey dupout=userdups_checking;
by _all_;
run;

There are 0 observations that came out of userdups_checking. In another process flow, I made a table for the utilization (has 1 million rows). There are also no dupes as a result:

proc sql;
create table WORK.utilization as
select *
from filelocation2;
quit;

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by _all_;
run;

Then I merged the table in another process flow:

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

This is where I am getting the duplicates at this point (300,100).

Let me know if there's anything else I need to share

Re: Join produces duplicates

Reeza — Mon, 05 Jul 2021 17:03:30 GMT

You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.

You're checking for full row duplicates ie
ClientID Date Name
1 2020/01/01 Dave
1 2021/01/01 Dave

These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by clientID;
run;

Re: Join produces duplicates

rapt1 — Tue, 06 Jul 2021 10:18:24 GMT

This makes sense. I tried it and I am getting the output I am looking for. Thank you so much guys!