BookmarkSubscribeRSS Feed
rodelabrishti
Calcite | Level 5

I need to create a dup record report. I have a code, but it's not working as the data is huge (more than 100mil). I have a dataset with more than 200 columns and 100+ mil records. This is what I'm trying to do - 

 

 

column_namesduplicate_values
aNo
bYes
Yes
dN/A
eNo
fNo
gNo
hYes
iN/A
jYes

 

Here, it there are duplicate values in a column, then it's "Yes". If there are no duplicate values at all, then "No", if the value is unique (one value for all the records), or the values are 100% missing for that column, then "N/A". 

 

As I said, I have a code to do the job, but it only works if the record count is less than a million. This is the step where it runs out of memory -

proc freq data = dup nlevels;

ods output nlevels = duplevels;

run;

 

Any help would appreciated. 

 

Thanks!

 

 

9 REPLIES 9
Tom
Super User Tom
Super User

Is it running out of memory because there are too many distinct levels?

 

Or is it running out of memory because you told it to print so much information? To prevent it printing the frequency tables for every variable add a tables statement with the noprint option.

tables _all_ / noprint;
rodelabrishti
Calcite | Level 5
It says the result is too big and will take time and resources. even after adding noprint.
PaigeMiller
Diamond | Level 26

Please tell us the EXACT word-for-word unedited error message. "Too big" is too vague.

--
Paige Miller
Tom
Super User Tom
Super User

@rodelabrishti wrote:
It says the result is too big and will take time and resources. even after adding noprint.

That sounds like the message you get when the LOG or RESULTS output is large.  

Make sure you are not doing something else in the same block of code that would produce a lot of output.

 

If you are unsure you might try running with system option OBS set to a small number so you can actually see the results and figure out where they are coming from in the code.  But don't do that if the code is overwriting any existing datasets as that would cause the loss of data.

Astounding
PROC Star

How should this cases be handled?

Some missing values, but all non-missing values are identical

Quentin
Super User

If you run PROC FREQ just on one variable, will it succeed?  If so, then you could write a macro to iterate over the list of variables, calculating the cardinality of each variable, and then append the results.

If you have a LOT of memory, you could try using a hash table to look for duplicates.  You might have to do it one variable at a time.

 

If you  have Viya, you could use PROC CARDINALITY.

 

I'd be curious, if you take a random sample of, say, 1 million rows, how many of the 200 columns have no duplicate values?  Hopefully it would be a small number.  So you could take the sample, run PROC FREQ NLEVELS on the sample, and then hopefully be left with a small number of variables that you have to brute force (running proc freq one at a time, or hash table one at a time, or even heck proc sort nodupkey one at a time...)

The Boston Area SAS Users Group (BASUG) is hosting an in person Meeting & Training on June 27!
Full details and registration info at https://www.basug.org/events.
mkeintz
PROC Star

Do you need the actual cardinality of each column, or do you just need to determine if there are ANY duplicates for each column?

 

If it's the latter, then you can program a data step with a hash for each column.  When a duplicate is found, you can stop profiling that column and delete the corresponding hash object, freeing up memory for use on the other columns.

 

Of course, if there are very few columns with a duplicate, you'll probably end up exhausting memory before processing the entire data set.  Then you'd have to loop over subsets of columns that don't exhaust memory.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data have;
 set sashelp.heart;
 call missing(height);
 weight=2;
 x=rand('uniform');
 keep status sex height weight x;
run;



proc transpose data=have(obs=0) out=temp;
var _all_;
run;
filename x temp;
data _null_;
 set temp end=last;
 file x;
 if _n_=1 then put 'proc sql;create table want as';
 put 'select "' _name_ '" as column_names,
case when count(distinct ' _name_ ') in (0 1) then "N/A" 
     when count(distinct ' _name_ ') = count(' _name_ ') then "No" 
     else "Yes" end as duplicate_values from have';
 if last then put ';quit;';
  else put 'union';
run;
%include x/source2;

P.S. You need to take into account of many scenarios (like: Mixed with missing and non-missing value). 

mkeintz
PROC Star

The below generates a dataset, one row per original columns, with the variable name, and a "Yes" or "No " value for DUPE_FOUND.

 

It makes a hash table for each column, which allows checking for duplicates.  But once a duplicate is found for a given column, three things happen: (1) DUPE_FOUND goes from "No " to "Yes", (2) that column is no longer examined, and (3) its hash object is deleted, freeing up memory.

 

data want (keep=varname dupe_found);
  set have  end=end_of_have;

  array cols_to_check {*}  _numeric_;

  declare hash h ;   /*Reserve the name H for hash objects, not a variable name*/
  length varname $32   dupe_found $3;

  if _n_=1 then do;
    declare hash dupecheck (ordered:'A');
      dupecheck.definekey('varname');
      dupecheck.definedata ('varname','dupe_found','h');
      dupecheck.definedone();
    declare hiter hd ('dupecheck');   

    dupe_found='No ';

    do v=1 to dim(cols_to_check);   /*Make a hash for each column*/
      varname=vname(cols_to_check{v});
      h=_new_ hash ();
        h.definekey(varname);
        h.definedata(varname);
        h.definedone();
      dupecheck.add();       /* Add this hash & varname as dataitem to dupecheck*/
    end;
  end;    

  do while(hd.next()=0);
    if dupe_found='Yes' then continue;
    if h.check()^=0 then h.add();
    else do;
      dupe_found='Yes';
      dupecheck.replace();
      h.delete();
    end;
  end;

  if end_of_have then do while(hd.next()=0);
    output;
  end;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon Kicks Off on June 11!

Watch the live Hackathon Kickoff to get all the essential information about the SAS Hackathon—including how to join, how to participate, and expert tips for success.

YouTube LinkedIn

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1674 views
  • 3 likes
  • 7 in conversation