<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Join produces duplicates in New SAS User</title>
    <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752101#M29739</link>
    <description>&lt;P&gt;Change the BY statement to include just the client_id variable.&lt;/P&gt;</description>
    <pubDate>Mon, 05 Jul 2021 15:17:42 GMT</pubDate>
    <dc:creator>Tom</dc:creator>
    <dc:date>2021-07-05T15:17:42Z</dc:date>
    <item>
      <title>Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752086#M29734</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am trying to join tables again using the code below. userlist has 300,000 observations while utilization has about 1,000,000 observations:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, MERGED had around 300,100 observations in the output. I found that there were duplicates in the MERGED table by using another code such as this, to eliminate them. The output did show the 100 dupes:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Now, this works fine for now. Here are my questions:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. I made sure that there are no duplicates in both the left and right table. So why is it still producing duplicates?&lt;/P&gt;&lt;P&gt;2. When I ran code block 1 with another set of table, I still had duplicates. When I ran code block 2, it did show the number of duplicates but I am still not getting the same amount of observations in the original table and the output table.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any insights on this problem?&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 13:26:49 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752086#M29734</guid>
      <dc:creator>rapt1</dc:creator>
      <dc:date>2021-07-05T13:26:49Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752089#M29735</link>
      <description>&lt;P&gt;First step:&amp;nbsp; check again.&amp;nbsp; It sure looks like you have duplicates in the utilization table.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 13:44:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752089#M29735</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2021-07-05T13:44:14Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752091#M29736</link>
      <description>&lt;P&gt;I have checked and there were no duplicates. I also checked the userlist and there were none.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I forgot to mention that I was using left join because I had to capture those users in userlist whether they had usage/price values in utilization. So I was only expecting 300,000 users in the merged table.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 13:49:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752091#M29736</guid>
      <dc:creator>rapt1</dc:creator>
      <dc:date>2021-07-05T13:49:59Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752092#M29737</link>
      <description>&lt;P&gt;How did you test for duplicates?&amp;nbsp; You could have distinct observations in UTILIZATION&amp;nbsp;but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 13:59:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752092#M29737</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2021-07-05T13:59:35Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752094#M29738</link>
      <description>&lt;P&gt;I used the same one for checking for dupes:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;SPAN&gt;"You could have distinct observations in UTILIZATION&amp;nbsp;but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 14:15:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752094#M29738</guid>
      <dc:creator>rapt1</dc:creator>
      <dc:date>2021-07-05T14:15:09Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752101#M29739</link>
      <description>&lt;P&gt;Change the BY statement to include just the client_id variable.&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 15:17:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752101#M29739</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2021-07-05T15:17:42Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752103#M29740</link>
      <description>&lt;P&gt;Show your full code. Maybe you referred to the unsorted data set somewhere accidentally? Or maybe you're using the wrong variable or specification somewhere.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you get multiples you MUST have duplicates somewhere, SAS doesn't generate matches out of thin air so its in your data somewhere. We can't see your data or code so we can't tell you where you may be going wrong.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/379057"&gt;@rapt1&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;I used the same one for checking for dupes:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=WORK.utilization nodupkey dupout=dups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN&gt;"You could have distinct observations in UTILIZATION&amp;nbsp;but it could still contain multiple observations for some of the values of CLIENTID because the observations differ in some other variable."&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;That might be the case. If that's so, how can we check for multiple CLIENTIDs? So I can see which variables they differ.&lt;/SPAN&gt;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 05 Jul 2021 15:12:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752103#M29740</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2021-07-05T15:12:19Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752115#M29741</link>
      <description>&lt;P&gt;Sharing the full code. I am using EG and I try to run these by code block,&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;First, I made copies of the tables and checked for dupes at the same time, One for the userlist, which has 300,000 rows:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table WORK.userlist as
select *
from filelocation1;
quit;

proc sort data=WORK.userlist nodupkey dupout=userdups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;There are 0 observations that came out of userdups_checking. In another process flow, I made a table for the utilization (has 1 million rows). There are also no dupes as a result:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table WORK.utilization as
select *
from filelocation2;
quit;

proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then I merged the table in another process flow:&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
create table WORK.MERGED as
select t1.*,
	t2.usage,
	t2.price
from WORK.userlist t1
left join WORK.utilization t2
on t1.ID=t2.clientID;
quit;

proc sort data=WORK.MERGED nodupkey dupout=dups_checking;
by _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This is where I am getting the duplicates at this point (300,100).&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Let me know if there's anything else I need to share&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 06 Jul 2021 10:18:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752115#M29741</guid>
      <dc:creator>rapt1</dc:creator>
      <dc:date>2021-07-06T10:18:58Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752127#M29742</link>
      <description>You're checking for duplicates across all your variables not the variables you're joining on so how you're checking for duplicates is wrong.&lt;BR /&gt;&lt;BR /&gt;You're checking for full row duplicates ie&lt;BR /&gt;ClientID Date Name&lt;BR /&gt;1 2020/01/01 Dave&lt;BR /&gt;1 2021/01/01 Dave&lt;BR /&gt;&lt;BR /&gt;These would not be duplicates since you're checking for _all_ variables in terms of duplicates. Then if you join by ClientID you'll still get duplicates because you have duplicate IDs.&lt;BR /&gt;&lt;BR /&gt;proc sort data=WORK.utilization nodupkey dupout=ussagedups_checking;&lt;BR /&gt;by clientID;&lt;BR /&gt;run;</description>
      <pubDate>Mon, 05 Jul 2021 17:03:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752127#M29742</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2021-07-05T17:03:30Z</dc:date>
    </item>
    <item>
      <title>Re: Join produces duplicates</title>
      <link>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752254#M29749</link>
      <description>&lt;P&gt;This makes sense. I tried it and I am getting the output I am looking for. Thank you so much guys!&lt;/P&gt;</description>
      <pubDate>Tue, 06 Jul 2021 10:18:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/New-SAS-User/Join-produces-duplicates/m-p/752254#M29749</guid>
      <dc:creator>rapt1</dc:creator>
      <dc:date>2021-07-06T10:18:24Z</dc:date>
    </item>
  </channel>
</rss>

