Hi everyone,
I am trying to join tables again using the code below. userlist has 300,000 observations while utilization has about 1,000,000 observations:
proc sql;
create table WORK.MERGED as
select t1.*,
t2.usage,
t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;
However, MERGED had around 300,100 observations in the output. I found that there were duplicates in the MERGED table by using another code such as this, to eliminate them. The output did show the 100 dupes:
proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;
Now, this works fine for now. Here are my questions:
1. I made sure that there are no duplicates in both the left and right table. So why is it still producing duplicates?
2. When I ran code block 1 with another set of table, I still had duplicates. When I ran code block 2, it did show the number of duplicates but I am still not getting the same amount of observations in the original table and the output table.
Any insights on this problem?
First step: check again. It sure looks like you have duplicates in the utilization table.
I have checked and there were no duplicates. I also checked the userlist and there were none.
I forgot to mention that I was using left join because I had to capture those users in userlist whether they had usage/price values in utilization. So I was only expecting 300,000 users in the merged table.
How did you test for duplicates? You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable.
I used the same one for checking for dupes:
proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;
"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."
That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.
Change the BY statement to include just the client_id variable.
Show your full code. Maybe you referred to the unsorted data set somewhere accidentally? Or maybe you're using the wrong variable or specification somewhere.
If you get multiples you MUST have duplicates somewhere, SAS doesn't generate matches out of thin air so its in your data somewhere. We can't see your data or code so we can't tell you where you may be going wrong.
@rapt1 wrote:
I used the same one for checking for dupes:
proc sort data=WORK.utilization nodupkey dupout=dups_checking; by _all_; run;
"You could have distinct observations in UTILIZATION but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."
That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.
Sharing the full code. I am using EG and I try to run these by code block,
First, I made copies of the tables and checked for dupes at the same time, One for the userlist, which has 300,000 rows:
proc sql;
create table WORK.userlist as
select *
from filelocation1;
quit;
proc sort data=WORK.userlist nodupkey dupout=userdups_checking;
by _all_;
run;
There are 0 observations that came out of userdups_checking. In another process flow, I made a table for the utilization (has 1 million rows). There are also no dupes as a result:
proc sql;
create table WORK.utilization as
select *
from filelocation2;
quit;
proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by _all_;
run;
Then I merged the table in another process flow:
proc sql;
create table WORK.MERGED as
select t1.*,
t2.usage,
t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;
proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;
This is where I am getting the duplicates at this point (300,100).
Let me know if there's anything else I need to share
This makes sense. I tried it and I am getting the output I am looking for. Thank you so much guys!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.