BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rapt1
Obsidian | Level 7

Hi everyone,

 

I am trying to join tables again using the code below. userlist has 300,000 observations while utilization has about 1,000,000 observations:

 

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

However, MERGED had around 300,100 observations in the output. I found that there were duplicates in the MERGED table by using another code such as this, to eliminate them. The output did show the 100 dupes:

 

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

Now, this works fine for now. Here are my questions:

 

1. I made sure that there are no duplicates in both the left and right table. So why is it still producing duplicates?

2. When I ran code block 1 with another set of table, I still had duplicates. When I ran code block 2, it did show the number of duplicates but I am still not getting the same amount of observations in the original table and the output table.

 

Any insights on this problem?

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.

You're checking for full row duplicates ie
ClientID Date Name
1 2020/01/01 Dave
1 2021/01/01 Dave

These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by clientID;
run;

View solution in original post

9 REPLIES 9
Astounding
PROC Star

First step:  check again.  It sure looks like you have duplicates in the utilization table.

rapt1
Obsidian | Level 7

I have checked and there were no duplicates. I also checked the userlist and there were none. 

 

I forgot to mention that I was using left join because I had to capture those users in userlist whether they had usage/price values in utilization. So I was only expecting 300,000 users in the merged table.

Tom
Super User Tom
Super User

How did you test for duplicates?  You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable.

rapt1
Obsidian | Level 7

I used the same one for checking for dupes:

 

proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;

"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

 

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.

Tom
Super User Tom
Super User

Change the BY statement to include just the client_id variable.

Reeza
Super User

Show your full code. Maybe you referred to the unsorted data set somewhere accidentally? Or maybe you're using the wrong variable or specification somewhere. 

 

If you get multiples you MUST have duplicates somewhere, SAS doesn't generate matches out of thin air so its in your data somewhere. We can't see your data or code so we can't tell you where you may be going wrong.

 


@rapt1 wrote:

I used the same one for checking for dupes:

 

proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;

"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."

 

That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.


 

rapt1
Obsidian | Level 7

Sharing the full code. I am using EG and I try to run these by code block, 

 

First, I made copies of the tables and checked for dupes at the same time, One for the userlist, which has 300,000 rows:

proc sql;
create table WORK.userlist as
select *
from filelocation1;
quit;

proc sort data=WORK.userlist nodupkey dupout=userdups_checking;
by _all_;
run;

There are 0 observations that came out of userdups_checking. In another process flow, I made a table for the utilization (has 1 million rows). There are also no dupes as a result:

proc sql;
create table WORK.utilization as
select *
from filelocation2;
quit;

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by _all_;
run;

 

Then I merged the table in another process flow:

proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;

This is where I am getting the duplicates at this point (300,100). 

 

Let me know if there's anything else I need to share

 

 

Reeza
Super User
You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.

You're checking for full row duplicates ie
ClientID Date Name
1 2020/01/01 Dave
1 2021/01/01 Dave

These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by clientID;
run;
rapt1
Obsidian | Level 7

This makes sense. I tried it and I am getting the output I am looking for. Thank you so much guys!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1017 views
  • 4 likes
  • 4 in conversation