BookmarkSubscribeRSS Feed
nicnad
Fluorite | Level 6

Hi,

I have the following code wich enable me to see the records that have duplicates and how many duplicates there is :


Hi,

I have the following code :

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . .  

John    .    bb    cc

George    aaa     .    ccc

Adam    aaaa     bbbb    .

Brian    a .        cccc

Brian    a .        cccc

Brian    bb .        cccc

;
run;


proc freq data=h1;
tables name / out = Form1_DUPIDS (keep = name Count where = (Count > 1)); run;

The output of the code is :

John 2

Brian 3

How do I write the code so the output is the total number of duplicates records. In this example the result would be 3; 1 duplicate for John and 2 duplicates for Brian.

Also, what is the way the count the number of observations (meaning the whole line and not having to specify we want to analyze the names only) that are duplicates?

Thank you for your help and time.

7 REPLIES 7
PGStats
Opal | Level 21

There are many ways to identify and count duplicates. One very convenient tool is the DUPOUT, NODUPKEY, UNIQUEOUT, NODUPRECS options in PROC SORT.

PG

PG
nicnad
Fluorite | Level 6

Thank you for the quick reply.

My current code is the following :

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . . 

John    .    bb    cc

George    aaa     .    ccc

Adam    aaaa     bbbb    .

Brian    a .        cccc

Brian    a .        cccc

Brian    bb .        cccc

;
run;


proc freq data=h1;
tables name / out = dups_table (keep = name Count where = (Count > 1)); run;

data no_of_dups_table;
set dups_table;
no_of_dups = count -1;
run;

proc means data = no_of_dups_table sum;
var no_of_dups;
run;

My current code gives me the following info in the no_of_dups_table :

Name COUNT no_of_dups
Brian      3           2
John       2           1

First of all, I would like the get the sum of the variable no_of_dups (i.e. 3)

What is the code to get that result and how do I store it in a variable?

Thank you for your help and time.

nicnad
Fluorite | Level 6

Given the fact that I would only want to count the number of duplicate observations (not variable), I found the following code :

proc sort data=h1 noduprecs;

by _all_;

run;

Is there a way to store the number of observation that where deleted in a variable?

Thank you for your help and time.

MikeZdeb
Rhodochrosite | Level 12

hi ... using your original data ...

proc sql;

create table dups as

select name, count(*)-1 as dups

from h1

group by name

having dups gt 0;

quit;

name            dups

Brian              2

John               1

nicnad
Fluorite | Level 6

Thank you MikeZdeb for the reply.

In fact, I am halfway there. What I am really trying to do is to have the sum of the dups (how many duplicate observation do I have in a given dataset)

I then want to incorporate this information in a macro that you have written for me before ( see post here : https://communities.sas.com/thread/37110).

Your original macro gave this output

table            name              missing    not_applicable    missing    total

table1           name                   2            1               1         4

table1           age                    4            .               .         4

z.table2         color                  1            .               .         1

z.table2         height                 1            .               .         1

sashelp.class    Name                  19            .               .        19

sashelp.class    Sex                   19            .               .        19

sashelp.class    Age                   19            .               .        19

sashelp.class    Height                19            .               .        19

sashelp.class    Weight                19            .               .        19

sashelp.heart    Status              5209            .               .      5209

sashelp.heart    DeathCause          1991            .            3218      5209

sashelp.heart    AgeCHDdiag          1449            .            3760      5209

I would like to have the same output with an additional column that would provide the number of duplicate observations.

Could you please help me tweak the code in the corresponding post in order to have the output I require?

Your help and time are really appreciated.

Regards,

nicnad

M_Elteibi
Calcite | Level 5

Hi nicnad,

Seems what you need to do is deduct 1 from each frequency then add up the new variable as follows.

DATA NUM_OF_DUPS;

SET  Form1_DUPIDS;

DUP_VALUES = COUNT - 1;

run;

Proc Means Data = Num_of_dups SUM;

var DUP_VALUES;

run;

that should give you the number of duplicate values.

Cheers,

Mohammad

Mit
Calcite | Level 5 Mit
Calcite | Level 5

Hi nicnad

do you like this?

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . . 
John    .    bb    cc
George    aaa     .    ccc
Adam    aaaa     bbbb    .
Brian    a .        cccc
Brian    a .        cccc
Brian    bb .        cccc
;
run;
;

proc sort data=h1;
by name;
run;

data h1_count;
set h1;
by name;
retain count;
if first.name then count=1;
else count+1;
run;

/* to keep only 1st record*/

data nodup;
set h1_count;
where count=1;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1225 views
  • 6 likes
  • 5 in conversation