Re: Help with freq and duplicates

nicnad · Posted 08-23-2012 11:52 AM

Hi,

I have the following code wich enable me to see the records that have duplicates and how many duplicates there is :

Hi,

I have the following code :

data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;

John a . .

John . bb cc

George aaa . ccc

Adam aaaa bbbb .

Brian a . cccc

Brian bb . cccc

;
run;

proc freq data=h1;
tables name / out = Form1_DUPIDS (keep = name Count where = (Count > 1)); run;

The output of the code is :

John 2

Brian 3

How do I write the code so the output is the total number of duplicates records. In this example the result would be 3; 1 duplicate for John and 2 duplicates for Brian.

Also, what is the way the count the number of observations (meaning the whole line and not having to specify we want to analyze the names only) that are duplicates?

Thank you for your help and time.

PGStats · Posted 08-23-2012 12:29 PM

There are many ways to identify and count duplicates. One very convenient tool is the DUPOUT, NODUPKEY, UNIQUEOUT, NODUPRECS options in PROC SORT.

PG

nicnad · Posted 08-23-2012 02:12 PM

Thank you for the quick reply.

My current code is the following :

data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;

John a . .

John . bb cc

George aaa . ccc

Adam aaaa bbbb .

Brian a . cccc

Brian bb . cccc

;
run;

proc freq data=h1;
tables name / out = dups_table (keep = name Count where = (Count > 1)); run;

data no_of_dups_table;
set dups_table;
no_of_dups = count -1;
run;

proc means data = no_of_dups_table sum;
var no_of_dups;
run;

My current code gives me the following info in the no_of_dups_table :

Name COUNT no_of_dups
Brian 3 2
John 2 1

First of all, I would like the get the sum of the variable no_of_dups (i.e. 3)

What is the code to get that result and how do I store it in a variable?

Thank you for your help and time.

nicnad · Posted 08-23-2012 02:20 PM

Given the fact that I would only want to count the number of duplicate observations (not variable), I found the following code :

proc sort data=h1 noduprecs;

by _all_;

run;

Is there a way to store the number of observation that where deleted in a variable?

Thank you for your help and time.

MikeZdeb · Posted 08-23-2012 03:23 PM

hi ... using your original data ...

proc sql;

create table dups as

select name, count(*)-1 as dups

from h1

group by name

having dups gt 0;

quit;

name dups

Brian 2

John 1

nicnad · Posted 08-23-2012 05:16 PM

Thank you MikeZdeb for the reply.

In fact, I am halfway there. What I am really trying to do is to have the sum of the dups (how many duplicate observation do I have in a given dataset)

I then want to incorporate this information in a macro that you have written for me before ( see post here : https://communities.sas.com/thread/37110).

Your original macro gave this output

table name missing not_applicable missing total

table1 name 2 1 1 4

table1 age 4 . . 4

z.table2 color 1 . . 1

z.table2 height 1 . . 1

sashelp.class Name 19 . . 19

sashelp.class Sex 19 . . 19

sashelp.class Age 19 . . 19

sashelp.class Height 19 . . 19

sashelp.class Weight 19 . . 19

sashelp.heart Status 5209 . . 5209

sashelp.heart DeathCause 1991 . 3218 5209

sashelp.heart AgeCHDdiag 1449 . 3760 5209

I would like to have the same output with an additional column that would provide the number of duplicate observations.

Could you please help me tweak the code in the corresponding post in order to have the output I require?

Your help and time are really appreciated.

Regards,

nicnad

M_Elteibi · Posted 11-07-2013 07:46 PM

Hi nicnad,

Seems what you need to do is deduct 1 from each frequency then add up the new variable as follows.

DATA NUM_OF_DUPS;

SET Form1_DUPIDS;

DUP_VALUES = COUNT - 1;

run;

Proc Means Data = Num_of_dups SUM;

var DUP_VALUES;

run;

that should give you the number of duplicate values.

Cheers,

Mohammad

Mit · Posted 11-07-2013 09:28 PM

Hi nicnad

do you like this?

data h1;
input (Name Info1 Info2 Info3) (:$10.);
cards;

John    a    . .
John    .    bb    cc
George    aaa     .    ccc
Adam    aaaa     bbbb    .
Brian    a .        cccc
Brian    a .        cccc
Brian    bb .        cccc
;
run;
;

proc sort data=h1;
by name;
run;

data h1_count;
set h1;
by name;
retain count;
if first.name then count=1;
else count+1;
run;

/* to keep only 1st record*/

data nodup;
set h1_count;
where count=1;
run;

Registration is open

SAS Training: Just a Click Away