<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Need help to create a duplicate record report in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954839#M372920</link>
    <description>&lt;P&gt;Do you need the actual cardinality of each column, or do you just need to determine if there are ANY duplicates for each column?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If it's the latter, then you can program a data step with a hash for each column.&amp;nbsp; When a duplicate is found, you can stop profiling that column and delete the corresponding hash object, freeing up memory for use on the other columns.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course, if there are very few columns with a duplicate, you'll probably end up exhausting memory before processing the entire data set.&amp;nbsp; Then you'd have to loop over subsets of columns that don't exhaust memory.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 31 Dec 2024 22:36:09 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2024-12-31T22:36:09Z</dc:date>
    <item>
      <title>Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954819#M372903</link>
      <description>&lt;P&gt;I need to create a dup record report. I have a code, but it's not working as the data is huge (more than 100mil). I have a dataset with more than 200 columns and 100+ mil records. This is what I'm trying to do -&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;column_names&lt;/TD&gt;&lt;TD&gt;duplicate_values&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;a&lt;/TD&gt;&lt;TD&gt;No&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;b&lt;/TD&gt;&lt;TD&gt;Yes&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;c&amp;nbsp;&lt;/TD&gt;&lt;TD&gt;Yes&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;d&lt;/TD&gt;&lt;TD&gt;N/A&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;e&lt;/TD&gt;&lt;TD&gt;No&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;f&lt;/TD&gt;&lt;TD&gt;No&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;g&lt;/TD&gt;&lt;TD&gt;No&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;h&lt;/TD&gt;&lt;TD&gt;Yes&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;i&lt;/TD&gt;&lt;TD&gt;N/A&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;j&lt;/TD&gt;&lt;TD&gt;Yes&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here, it there are duplicate values in a column, then it's "Yes". If there are no duplicate values at all, then "No", if the value is unique (one value for all the records), or the values are 100% missing for that column, then "N/A".&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;As I said, I have a code to do the job, but it only works if the record count is less than a million. This is the step where it runs out of memory -&lt;/P&gt;&lt;P class="lia-align-left lia-indent-padding-left-30px"&gt;&lt;EM&gt;&lt;STRONG&gt;proc freq data = dup nlevels;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P class="lia-align-left lia-indent-padding-left-30px"&gt;&lt;EM&gt;&lt;STRONG&gt;ods output nlevels = duplevels;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P class="lia-align-left lia-indent-padding-left-30px"&gt;&lt;EM&gt;&lt;STRONG&gt;run;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any help would appreciated.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 15:53:31 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954819#M372903</guid>
      <dc:creator>rodelabrishti</dc:creator>
      <dc:date>2024-12-31T15:53:31Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954820#M372904</link>
      <description>&lt;P&gt;Is it running out of memory because there are too many distinct levels?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Or is it running out of memory because you told it to print so much information? To prevent it printing the frequency tables for every variable add a tables statement with the noprint option.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;tables _all_ / noprint;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 31 Dec 2024 16:07:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954820#M372904</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-12-31T16:07:03Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954823#M372907</link>
      <description>It says the result is too big and will take time and resources. even after adding noprint.</description>
      <pubDate>Tue, 31 Dec 2024 16:40:42 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954823#M372907</guid>
      <dc:creator>rodelabrishti</dc:creator>
      <dc:date>2024-12-31T16:40:42Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954826#M372910</link>
      <description>&lt;P&gt;Please tell us the EXACT word-for-word unedited error message. "Too big" is too vague.&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 16:51:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954826#M372910</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2024-12-31T16:51:39Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954828#M372912</link>
      <description>&lt;P&gt;How should this cases be handled?&lt;/P&gt;
&lt;P&gt;Some missing values, but all non-missing values are identical&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 18:42:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954828#M372912</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2024-12-31T18:42:54Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954834#M372917</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/31077"&gt;@rodelabrishti&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;It says the result is too big and will take time and resources. even after adding noprint.&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;That sounds like the message you get when the LOG or RESULTS output is large.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Make sure you are not doing something else in the same block of code that would produce a lot of output.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you are unsure you might try running with system option OBS set to a small number so you can actually see the results and figure out where they are coming from in the code.&amp;nbsp; But don't do that if the code is overwriting any existing datasets as that would cause the loss of data.&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 20:07:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954834#M372917</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2024-12-31T20:07:57Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954835#M372918</link>
      <description>&lt;P&gt;If you run PROC FREQ just on one variable, will it succeed?&amp;nbsp; If so, then you could write a macro to iterate over the list of variables, calculating the cardinality of each variable, and then append the results.&lt;BR /&gt;&lt;BR /&gt;If you have a LOT of memory, you could try using a hash table to look for duplicates.&amp;nbsp; You might have to do it one variable at a time.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If you&amp;nbsp; have Viya, you could use PROC CARDINALITY.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'd be curious, if you take a random sample of, say, 1 million rows, how many of the 200 columns have no duplicate values?&amp;nbsp; Hopefully it would be a small number.&amp;nbsp; So you could take the sample, run PROC FREQ NLEVELS on the sample, and then hopefully be left with a small number of variables that you have to brute force (running proc freq one at a time, or hash table one at a time, or even heck proc sort nodupkey one at a time...)&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 20:20:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954835#M372918</guid>
      <dc:creator>Quentin</dc:creator>
      <dc:date>2024-12-31T20:20:15Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954839#M372920</link>
      <description>&lt;P&gt;Do you need the actual cardinality of each column, or do you just need to determine if there are ANY duplicates for each column?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If it's the latter, then you can program a data step with a hash for each column.&amp;nbsp; When a duplicate is found, you can stop profiling that column and delete the corresponding hash object, freeing up memory for use on the other columns.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course, if there are very few columns with a duplicate, you'll probably end up exhausting memory before processing the entire data set.&amp;nbsp; Then you'd have to loop over subsets of columns that don't exhaust memory.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Dec 2024 22:36:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954839#M372920</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2024-12-31T22:36:09Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954853#M372927</link>
      <description>&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
 set sashelp.heart;
 call missing(height);
 weight=2;
 x=rand('uniform');
 keep status sex height weight x;
run;



proc transpose data=have(obs=0) out=temp;
var _all_;
run;
filename x temp;
data _null_;
 set temp end=last;
 file x;
 if _n_=1 then put 'proc sql;create table want as';
 put 'select "' _name_ '" as column_names,
case when count(distinct ' _name_ ') in (0 1) then "N/A" 
     when count(distinct ' _name_ ') = count(' _name_ ') then "No" 
     else "Yes" end as duplicate_values from have';
 if last then put ';quit;';
  else put 'union';
run;
%include x/source2;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;P.S. You need to take into account of many scenarios (like: Mixed with missing and non-missing value).&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Jan 2025 03:19:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954853#M372927</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2025-01-01T03:19:45Z</dc:date>
    </item>
    <item>
      <title>Re: Need help to create a duplicate record report</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954989#M372971</link>
      <description>&lt;P&gt;The below generates a dataset, one row per original columns, with the variable name, and a "Yes" or "No " value for DUPE_FOUND.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It makes a hash table for each column, which allows checking for duplicates.&amp;nbsp; But once a duplicate is found for a given column, three things happen: (1) DUPE_FOUND goes from "No " to "Yes", (2) that column is no longer examined, and (3) its hash object is deleted, freeing up memory.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want (keep=varname dupe_found);
  set have  end=end_of_have;

  array cols_to_check {*}  _numeric_;

  declare hash h ;   /*Reserve the name H for hash objects, not a variable name*/
  length varname $32   dupe_found $3;

  if _n_=1 then do;
    declare hash dupecheck (ordered:'A');
      dupecheck.definekey('varname');
      dupecheck.definedata ('varname','dupe_found','h');
      dupecheck.definedone();
    declare hiter hd ('dupecheck');   

    dupe_found='No ';

    do v=1 to dim(cols_to_check);   /*Make a hash for each column*/
      varname=vname(cols_to_check{v});
      h=_new_ hash ();
        h.definekey(varname);
        h.definedata(varname);
        h.definedone();
      dupecheck.add();       /* Add this hash &amp;amp; varname as dataitem to dupecheck*/
    end;
  end;    

  do while(hd.next()=0);
    if dupe_found='Yes' then continue;
    if h.check()^=0 then h.add();
    else do;
      dupe_found='Yes';
      dupecheck.replace();
      h.delete();
    end;
  end;

  if end_of_have then do while(hd.next()=0);
    output;
  end;
run;
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 03 Jan 2025 03:53:03 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Need-help-to-create-a-duplicate-record-report/m-p/954989#M372971</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2025-01-03T03:53:03Z</dc:date>
    </item>
  </channel>
</rss>

