<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Finding the variable causing duplication in my dataset in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692037#M210726</link>
    <description>&lt;P&gt;A short program can do this, as long as you occasionally have two observations per ARRANGEMENT_KEY.&amp;nbsp; It's trickier if you have three or more.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;First, split the data.&amp;nbsp; Remove all observations where ARRANGEMENT_KEY is unique.&amp;nbsp; Then split the observations where there are duplicates.&amp;nbsp; For example:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
   by arrangement_key;
run;
data dup1 dup2;
   set have;
   by arrangement_key;
   if first.arrangement_key and last.arrangement_key then delete;
   if first.arrangement_key then output dup1;
   else output dup2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;So all original observations with a unique ARRANGEMENT_KEY have been deleted by the DELETE statement.&amp;nbsp; Of those that remain, the first observation for ARRANGEMENT_KEY is in DUP1 and the second (hopefully there are only two) are in DUP2.&amp;nbsp; Then just run PROC COMPARE to compare the two data sets.&amp;nbsp; The software will do the work for you.&lt;/P&gt;</description>
    <pubDate>Fri, 16 Oct 2020 03:08:06 GMT</pubDate>
    <dc:creator>Astounding</dc:creator>
    <dc:date>2020-10-16T03:08:06Z</dc:date>
    <item>
      <title>Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/691997#M210711</link>
      <description>&lt;P&gt;I have a dataset that is supposed to be unique by my variable called arrangement_key, however on occasion its not unique, that's easy to identify the records causing duplication, but scanning all the variables in those duplicate records trying to find the culprit variable(s) is a pain, my dataset has 500 columns, is there an easy way to spit out a dataset that contains the culprit variable(s) that are causing the duplication? thank you in advance for your help.&lt;/P&gt;</description>
      <pubDate>Thu, 15 Oct 2020 23:57:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/691997#M210711</guid>
      <dc:creator>jimbobob</dc:creator>
      <dc:date>2020-10-15T23:57:25Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692000#M210713</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'm confused, if arrangement_key is supposed to be unique, but sometimes it isn't, what do yo mean by scanning all the other variables to find the culprit.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If your data are sorted by arrangement_key, you could split out the duplicates with a step like:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data
  want   /* no duplicates */ 
  dups   /* all duplicates */
  ;
  set have ;
  by arrangement_key ;
  if NOT (first.arrangement_key and last.arrangement_key)  then output dups ;
  else output want ;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;After identifying the duplicate records, what do you mean about scanning variables to find culprits?&lt;/P&gt;</description>
      <pubDate>Fri, 16 Oct 2020 00:18:47 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692000#M210713</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2020-10-16T00:18:47Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692037#M210726</link>
      <description>&lt;P&gt;A short program can do this, as long as you occasionally have two observations per ARRANGEMENT_KEY.&amp;nbsp; It's trickier if you have three or more.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;First, split the data.&amp;nbsp; Remove all observations where ARRANGEMENT_KEY is unique.&amp;nbsp; Then split the observations where there are duplicates.&amp;nbsp; For example:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=have;
   by arrangement_key;
run;
data dup1 dup2;
   set have;
   by arrangement_key;
   if first.arrangement_key and last.arrangement_key then delete;
   if first.arrangement_key then output dup1;
   else output dup2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;So all original observations with a unique ARRANGEMENT_KEY have been deleted by the DELETE statement.&amp;nbsp; Of those that remain, the first observation for ARRANGEMENT_KEY is in DUP1 and the second (hopefully there are only two) are in DUP2.&amp;nbsp; Then just run PROC COMPARE to compare the two data sets.&amp;nbsp; The software will do the work for you.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Oct 2020 03:08:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692037#M210726</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2020-10-16T03:08:06Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692047#M210728</link>
      <description>&lt;P&gt;I'm assuming that you want to identify the culprit variables that's causing the duplication &lt;EM&gt;so that you can fix them&lt;/EM&gt;. Below code will create the dataset you wanted, but it assumes there is max 1 additional duplicate. You may have to run this multiple times, or macrotize it however you want, if your data contains more than 1 duplicate observations for each&amp;nbsp;&lt;SPAN&gt;arrangement_key.&lt;/SPAN&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data a1 a2;
  set a;
  by &lt;SPAN&gt;arrangement_key&lt;/SPAN&gt;;
  if first.&lt;SPAN&gt;arrangement_key &lt;/SPAN&gt;ne last.&lt;SPAN&gt;arrangement_key&lt;/SPAN&gt;;
  if first.&lt;SPAN&gt;arrangement_key &lt;/SPAN&gt;then output a1;
  else output a2;
run;

proc compare b=a1 c=a2 outstats=diffs noprint;
  id &lt;SPAN&gt;arrangement_key&lt;/SPAN&gt;;
run;

data diffs (keep=_VAR_);
  set diffs;
  where _TYPE_='NDIF' and _BASE_ &amp;gt; 0;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 16 Oct 2020 05:35:21 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692047#M210728</guid>
      <dc:creator>vellad</dc:creator>
      <dc:date>2020-10-16T05:35:21Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692049#M210729</link>
      <description>&lt;P&gt;Like this?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data HAVE;
  array A[500] A1-A500;
  do ID=1 to 10;
    if ID=5 then do; A55 = 2; output; end;
    output;
  end;
run;

data _null_;
  set HAVE;
  by ID;
  array ARR A1--A500;                        * index variables into an array;
  if first.ID &amp;amp; ^last.ID then do I=1 to 500; * duplicate found, fill lag values;
    ARR[I]=lag(ARR[I]);
  end;
  if ^first.ID;                              * keep repeated observation;
  do I=1 to dim(ARR);                        * seek different value(s)  ;
    if ARR[I] ne lag(ARR[I]) then putlog _N_= ID= ARR[I]=;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;_N_=6 ID=5 A55=2&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Replace&lt;/P&gt;
&lt;P&gt;&lt;CODE class=" language-sas"&gt;  array ARR A1--A500;
&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;with&lt;/P&gt;
&lt;P&gt;&lt;CODE class=" language-sas"&gt;  array ARR &lt;EM&gt;2 variable names from your table (the double dash makes that easy, look it up)&lt;/EM&gt; ;
&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Oct 2020 05:52:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692049#M210729</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-16T05:52:45Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692284#M210863</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13255"&gt;@jimbobob&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;I have a dataset that is supposed to be unique by my variable called arrangement_key, however on occasion its not unique, that's easy to identify the records causing duplication, but scanning all the variables in those duplicate records trying to find the culprit variable(s) is a pain, my dataset has 500 columns, is there an easy way to spit out a dataset that contains the culprit variable(s) that are causing the duplication? thank you in advance for your help.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;You don't say what process is causing the duplicate rows.&amp;nbsp; Is it a single table you're concerned about or the result of an SQL join (undesired Cartesian Product) across two or more tables?&amp;nbsp; If you KNOW that your table has non-unique keys, and is participating in a join, you should address this first.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here is SQL code that will identify duplicate keys, although it is functionally equivalent to SAS code previously posted:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sql;
   create table want as
   select key1, key2, key3, count(0) as count
   from have
   group by key1, key2, key3
   having count(0) &amp;gt; 1;
quit;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You can filter on the count column, looking for rows with &amp;gt; 2, &amp;gt; 3, &amp;gt; 4, etc. duplicates.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;At this point I usually just "eyeball" the data, scrolling right and skimming for satellite (non-key) columns where the data changes.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But, again, your first step is to ensure your source table keys are, in fact, keys.&amp;nbsp;&amp;nbsp;&lt;A href="https://www.google.com/search?q=what+is+a+key+column" target="_blank"&gt;https://www.google.com/search?q=what+is+a+key+column&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 17 Oct 2020 07:11:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692284#M210863</guid>
      <dc:creator>ScottBass</dc:creator>
      <dc:date>2020-10-17T07:11:31Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692287#M210866</link>
      <description>&lt;P&gt;You only need to scan for duplicates of the key variable(s), the others are irrelevant for the join.&lt;/P&gt;
&lt;P&gt;Please post the complete log of the code that causes the duplicates; use the &amp;lt;/&amp;gt; button to post it.&lt;/P&gt;</description>
      <pubDate>Sat, 17 Oct 2020 08:25:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692287#M210866</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-17T08:25:15Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692402#M210935</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13255"&gt;@jimbobob&lt;/a&gt;&amp;nbsp;&amp;nbsp;Has this worked?&lt;/P&gt;</description>
      <pubDate>Sun, 18 Oct 2020 21:01:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692402#M210935</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-18T21:01:39Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692646#M211032</link>
      <description>&lt;P&gt;Yes it is supposed to be unique, I had to create the key from concatenation of other fields from our data warehouse, however there are occasions due to human error, where we end up getting two records that get pulled into our sas dataset has the same arrangement key,&amp;nbsp; we then have to scan all the other fields seeing why we have two records, sometimes it our split rated indicator, show a split rating when there is not a split rating on the record, so we manually remove the records showing "Y". Other times it due miscoded zip code, or reg rating or pd, or any number of issues. Unfortunately, I'm not in charge of our data warehouse and how they pull the data together, but, we identifies these issues and pass the info on to IT. But we still need to remove and remedy the column causing the duplicates.&lt;/P&gt;</description>
      <pubDate>Mon, 19 Oct 2020 18:01:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692646#M211032</guid>
      <dc:creator>jimbobob</dc:creator>
      <dc:date>2020-10-19T18:01:09Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692654#M211036</link>
      <description>&lt;P&gt;Perfect, thanks Chris&lt;/P&gt;</description>
      <pubDate>Mon, 19 Oct 2020 18:13:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/692654#M211036</guid>
      <dc:creator>jimbobob</dc:creator>
      <dc:date>2020-10-19T18:13:25Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/693054#M211262</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/16961"&gt;@ChrisNZ&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Like this?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data HAVE;
  array A[500] A1-A500;
  do ID=1 to 10;
    if ID=5 then do; A55 = 2; output; end;
    output;
  end;
run;

data _null_;
  set HAVE;
  by ID;
  array ARR A1--A500;                        * index variables into an array;
  if first.ID &amp;amp; ^last.ID then do I=1 to 500; * duplicate found, fill lag values;
    ARR[I]=lag(ARR[I]);
  end;
  if ^first.ID;                              * keep repeated observation;
  do I=1 to dim(ARR);                        * seek different value(s)  ;
    if ARR[I] ne lag(ARR[I]) then putlog _N_= ID= ARR[I]=;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;FONT face="courier new,courier"&gt;_N_=6 ID=5 A55=2&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Replace&lt;/P&gt;
&lt;P&gt;&lt;CODE class=" language-sas"&gt;  array ARR A1--A500;
&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;with&lt;/P&gt;
&lt;P&gt;&lt;CODE class=" language-sas"&gt;  array ARR &lt;EM&gt;2 variable names from your table (the double dash makes that easy, look it up)&lt;/EM&gt; ;
&lt;/CODE&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Untested, just a "thought experiment"...&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Would:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc sort data=foo out=bar dupout=dups noduprecs;
   by key1 key2 key3 _all_;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;also work?&lt;/P&gt;</description>
      <pubDate>Wed, 21 Oct 2020 02:20:46 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/693054#M211262</guid>
      <dc:creator>ScottBass</dc:creator>
      <dc:date>2020-10-21T02:20:46Z</dc:date>
    </item>
    <item>
      <title>Re: Finding the variable causing duplication in my data set</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/693071#M211275</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/15043"&gt;@ScottBass&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It should work, but proc sort is expensive, and the question is specifically to find which variables cause the duplication, since there are hundreds to review.&lt;/P&gt;
&lt;P&gt;PS sorry about not replying, will reply soon. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Oct 2020 05:04:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Finding-the-variable-causing-duplication-in-my-dataset/m-p/693071#M211275</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-21T05:04:01Z</dc:date>
    </item>
  </channel>
</rss>

