I need to create a dup record report. I have a code, but it's not working as the data is huge (more than 100mil). I have a dataset with more than 200 columns and 100+ mil records. This is what I'm trying to do -
column_names | duplicate_values |
a | No |
b | Yes |
c | Yes |
d | N/A |
e | No |
f | No |
g | No |
h | Yes |
i | N/A |
j | Yes |
Here, it there are duplicate values in a column, then it's "Yes". If there are no duplicate values at all, then "No", if the value is unique (one value for all the records), or the values are 100% missing for that column, then "N/A".
As I said, I have a code to do the job, but it only works if the record count is less than a million. This is the step where it runs out of memory -
proc freq data = dup nlevels;
ods output nlevels = duplevels;
run;
Any help would appreciated.
Thanks!
Is it running out of memory because there are too many distinct levels?
Or is it running out of memory because you told it to print so much information? To prevent it printing the frequency tables for every variable add a tables statement with the noprint option.
tables _all_ / noprint;
Please tell us the EXACT word-for-word unedited error message. "Too big" is too vague.
@rodelabrishti wrote:
It says the result is too big and will take time and resources. even after adding noprint.
That sounds like the message you get when the LOG or RESULTS output is large.
Make sure you are not doing something else in the same block of code that would produce a lot of output.
If you are unsure you might try running with system option OBS set to a small number so you can actually see the results and figure out where they are coming from in the code. But don't do that if the code is overwriting any existing datasets as that would cause the loss of data.
How should this cases be handled?
Some missing values, but all non-missing values are identical
If you run PROC FREQ just on one variable, will it succeed? If so, then you could write a macro to iterate over the list of variables, calculating the cardinality of each variable, and then append the results.
If you have a LOT of memory, you could try using a hash table to look for duplicates. You might have to do it one variable at a time.
If you have Viya, you could use PROC CARDINALITY.
I'd be curious, if you take a random sample of, say, 1 million rows, how many of the 200 columns have no duplicate values? Hopefully it would be a small number. So you could take the sample, run PROC FREQ NLEVELS on the sample, and then hopefully be left with a small number of variables that you have to brute force (running proc freq one at a time, or hash table one at a time, or even heck proc sort nodupkey one at a time...)
Do you need the actual cardinality of each column, or do you just need to determine if there are ANY duplicates for each column?
If it's the latter, then you can program a data step with a hash for each column. When a duplicate is found, you can stop profiling that column and delete the corresponding hash object, freeing up memory for use on the other columns.
Of course, if there are very few columns with a duplicate, you'll probably end up exhausting memory before processing the entire data set. Then you'd have to loop over subsets of columns that don't exhaust memory.
data have;
set sashelp.heart;
call missing(height);
weight=2;
x=rand('uniform');
keep status sex height weight x;
run;
proc transpose data=have(obs=0) out=temp;
var _all_;
run;
filename x temp;
data _null_;
set temp end=last;
file x;
if _n_=1 then put 'proc sql;create table want as';
put 'select "' _name_ '" as column_names,
case when count(distinct ' _name_ ') in (0 1) then "N/A"
when count(distinct ' _name_ ') = count(' _name_ ') then "No"
else "Yes" end as duplicate_values from have';
if last then put ';quit;';
else put 'union';
run;
%include x/source2;
P.S. You need to take into account of many scenarios (like: Mixed with missing and non-missing value).
The below generates a dataset, one row per original columns, with the variable name, and a "Yes" or "No " value for DUPE_FOUND.
It makes a hash table for each column, which allows checking for duplicates. But once a duplicate is found for a given column, three things happen: (1) DUPE_FOUND goes from "No " to "Yes", (2) that column is no longer examined, and (3) its hash object is deleted, freeing up memory.
data want (keep=varname dupe_found);
set have end=end_of_have;
array cols_to_check {*} _numeric_;
declare hash h ; /*Reserve the name H for hash objects, not a variable name*/
length varname $32 dupe_found $3;
if _n_=1 then do;
declare hash dupecheck (ordered:'A');
dupecheck.definekey('varname');
dupecheck.definedata ('varname','dupe_found','h');
dupecheck.definedone();
declare hiter hd ('dupecheck');
dupe_found='No ';
do v=1 to dim(cols_to_check); /*Make a hash for each column*/
varname=vname(cols_to_check{v});
h=_new_ hash ();
h.definekey(varname);
h.definedata(varname);
h.definedone();
dupecheck.add(); /* Add this hash & varname as dataitem to dupecheck*/
end;
end;
do while(hd.next()=0);
if dupe_found='Yes' then continue;
if h.check()^=0 then h.add();
else do;
dupe_found='Yes';
dupecheck.replace();
h.delete();
end;
end;
if end_of_have then do while(hd.next()=0);
output;
end;
run;
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.