Need help to create a duplicate record report

rodelabrishti · Posted 12-31-2024 10:53 AM

I need to create a dup record report. I have a code, but it's not working as the data is huge (more than 100mil). I have a dataset with more than 200 columns and 100+ mil records. This is what I'm trying to do -

column_names	duplicate_values
a	No
b	Yes
c	Yes
d	N/A
e	No
f	No
g	No
h	Yes
i	N/A
j	Yes

Here, it there are duplicate values in a column, then it's "Yes". If there are no duplicate values at all, then "No", if the value is unique (one value for all the records), or the values are 100% missing for that column, then "N/A".

As I said, I have a code to do the job, but it only works if the record count is less than a million. This is the step where it runs out of memory -

proc freq data = dup nlevels;

ods output nlevels = duplevels;

run;

Any help would appreciated.

Thanks!

Tom · Posted 12-31-2024 11:07 AM

Is it running out of memory because there are too many distinct levels?

Or is it running out of memory because you told it to print so much information? To prevent it printing the frequency tables for every variable add a tables statement with the noprint option.

tables _all_ / noprint;

rodelabrishti · Posted 12-31-2024 11:40 AM

It says the result is too big and will take time and resources. even after adding noprint.

PaigeMiller · Posted 12-31-2024 11:51 AM

Please tell us the EXACT word-for-word unedited error message. "Too big" is too vague.

--
Paige Miller

Tom · Posted 12-31-2024 03:07 PM

@rodelabrishti wrote:
It says the result is too big and will take time and resources. even after adding noprint.

That sounds like the message you get when the LOG or RESULTS output is large.

Make sure you are not doing something else in the same block of code that would produce a lot of output.

If you are unsure you might try running with system option OBS set to a small number so you can actually see the results and figure out where they are coming from in the code. But don't do that if the code is overwriting any existing datasets as that would cause the loss of data.

Astounding · Posted 12-31-2024 01:42 PM

How should this cases be handled?

Some missing values, but all non-missing values are identical

Quentin · Posted 12-31-2024 03:20 PM

If you run PROC FREQ just on one variable, will it succeed? If so, then you could write a macro to iterate over the list of variables, calculating the cardinality of each variable, and then append the results.

If you have a LOT of memory, you could try using a hash table to look for duplicates. You might have to do it one variable at a time.

If you have Viya, you could use PROC CARDINALITY.

I'd be curious, if you take a random sample of, say, 1 million rows, how many of the 200 columns have no duplicate values? Hopefully it would be a small number. So you could take the sample, run PROC FREQ NLEVELS on the sample, and then hopefully be left with a small number of variables that you have to brute force (running proc freq one at a time, or hash table one at a time, or even heck proc sort nodupkey one at a time...)

mkeintz · Posted 12-31-2024 05:36 PM

Do you need the actual cardinality of each column, or do you just need to determine if there are ANY duplicates for each column?

If it's the latter, then you can program a data step with a hash for each column. When a duplicate is found, you can stop profiling that column and delete the corresponding hash object, freeing up memory for use on the other columns.

Of course, if there are very few columns with a duplicate, you'll probably end up exhausting memory before processing the entire data set. Then you'd have to loop over subsets of columns that don't exhaust memory.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 12-31-2024 10:09 PM

data have;
 set sashelp.heart;
 call missing(height);
 weight=2;
 x=rand('uniform');
 keep status sex height weight x;
run;



proc transpose data=have(obs=0) out=temp;
var _all_;
run;
filename x temp;
data _null_;
 set temp end=last;
 file x;
 if _n_=1 then put 'proc sql;create table want as';
 put 'select "' _name_ '" as column_names,
case when count(distinct ' _name_ ') in (0 1) then "N/A" 
     when count(distinct ' _name_ ') = count(' _name_ ') then "No" 
     else "Yes" end as duplicate_values from have';
 if last then put ';quit;';
  else put 'union';
run;
%include x/source2;

P.S. You need to take into account of many scenarios (like: Mixed with missing and non-missing value).

mkeintz · Posted 01-02-2025 10:53 PM

The below generates a dataset, one row per original columns, with the variable name, and a "Yes" or "No " value for DUPE_FOUND.

It makes a hash table for each column, which allows checking for duplicates. But once a duplicate is found for a given column, three things happen: (1) DUPE_FOUND goes from "No " to "Yes", (2) that column is no longer examined, and (3) its hash object is deleted, freeing up memory.

data want (keep=varname dupe_found);
  set have  end=end_of_have;

  array cols_to_check {*}  _numeric_;

  declare hash h ;   /*Reserve the name H for hash objects, not a variable name*/
  length varname $32   dupe_found $3;

  if _n_=1 then do;
    declare hash dupecheck (ordered:'A');
      dupecheck.definekey('varname');
      dupecheck.definedata ('varname','dupe_found','h');
      dupecheck.definedone();
    declare hiter hd ('dupecheck');   

    dupe_found='No ';

    do v=1 to dim(cols_to_check);   /*Make a hash for each column*/
      varname=vname(cols_to_check{v});
      h=_new_ hash ();
        h.definekey(varname);
        h.definedata(varname);
        h.definedone();
      dupecheck.add();       /* Add this hash & varname as dataitem to dupecheck*/
    end;
  end;    

  do while(hd.next()=0);
    if dupe_found='Yes' then continue;
    if h.check()^=0 then h.add();
    else do;
      dupe_found='Yes';
      dupecheck.replace();
      h.delete();
    end;
  end;

  if end_of_have then do while(hd.next()=0);
    output;
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Re: Need help to create a duplicate record report

Registration is open

SAS Training: Just a Click Away