Hi,
I have the following code wich enable me to see the records that have duplicates and how many duplicates there is :
Hi,
I have the following code :
data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;
John a . .
John . bb cc
George aaa . ccc
Adam aaaa bbbb .
Brian a . cccc
Brian a . cccc
Brian bb . cccc
;
run;
proc freq data=h1;
tables name / out = Form1_DUPIDS (keep = name Count where = (Count > 1)); run;
The output of the code is :
John 2
Brian 3
How do I write the code so the output is the total number of duplicates records. In this example the result would be 3; 1 duplicate for John and 2 duplicates for Brian.
Also, what is the way the count the number of observations (meaning the whole line and not having to specify we want to analyze the names only) that are duplicates?
Thank you for your help and time.
There are many ways to identify and count duplicates. One very convenient tool is the DUPOUT, NODUPKEY, UNIQUEOUT, NODUPRECS options in PROC SORT.
PG
Thank you for the quick reply.
My current code is the following :
data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;
John a . .
John . bb cc
George aaa . ccc
Adam aaaa bbbb .
Brian a . cccc
Brian a . cccc
Brian bb . cccc
;
run;
proc freq data=h1;
tables name / out = dups_table (keep = name Count where = (Count > 1)); run;
data no_of_dups_table;
set dups_table;
no_of_dups = count -1;
run;
proc means data = no_of_dups_table sum;
var no_of_dups;
run;
My current code gives me the following info in the no_of_dups_table :
Name COUNT no_of_dups
Brian 3 2
John 2 1
First of all, I would like the get the sum of the variable no_of_dups (i.e. 3)
What is the code to get that result and how do I store it in a variable?
Thank you for your help and time.
Given the fact that I would only want to count the number of duplicate observations (not variable), I found the following code :
proc sort data=h1 noduprecs;
by _all_;
run;
Is there a way to store the number of observation that where deleted in a variable?
Thank you for your help and time.
hi ... using your original data ...
proc sql;
create table dups as
select name, count(*)-1 as dups
from h1
group by name
having dups gt 0;
quit;
name dups
Brian 2
John 1
Thank you MikeZdeb for the reply.
In fact, I am halfway there. What I am really trying to do is to have the sum of the dups (how many duplicate observation do I have in a given dataset)
I then want to incorporate this information in a macro that you have written for me before ( see post here : https://communities.sas.com/thread/37110).
Your original macro gave this output
table name missing not_applicable missing total
table1 name 2 1 1 4
table1 age 4 . . 4
z.table2 color 1 . . 1
z.table2 height 1 . . 1
sashelp.class Name 19 . . 19
sashelp.class Sex 19 . . 19
sashelp.class Age 19 . . 19
sashelp.class Height 19 . . 19
sashelp.class Weight 19 . . 19
sashelp.heart Status 5209 . . 5209
sashelp.heart DeathCause 1991 . 3218 5209
sashelp.heart AgeCHDdiag 1449 . 3760 5209
I would like to have the same output with an additional column that would provide the number of duplicate observations.
Could you please help me tweak the code in the corresponding post in order to have the output I require?
Your help and time are really appreciated.
Regards,
nicnad
Hi nicnad,
Seems what you need to do is deduct 1 from each frequency then add up the new variable as follows.
DATA NUM_OF_DUPS;
SET Form1_DUPIDS;
DUP_VALUES = COUNT - 1;
run;
Proc Means Data = Num_of_dups SUM;
var DUP_VALUES;
run;
that should give you the number of duplicate values.
Cheers,
Mohammad
Hi nicnad
do you like this?
data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;
John a . .
John . bb cc
George aaa . ccc
Adam aaaa bbbb .
Brian a . cccc
Brian a . cccc
Brian bb . cccc
;
run;
;
proc sort data=h1;
by name;
run;
data h1_count;
set h1;
by name;
retain count;
if first.name then count=1;
else count+1;
run;
/* to keep only 1st record*/
data nodup;
set h1_count;
where count=1;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.