Help using Base SAS procedures

Help with freq and duplicates

Reply
Regular Contributor
Posts: 186

Help with freq and duplicates

Hi,

I have the following code wich enable me to see the records that have duplicates and how many duplicates there is :


Hi,

I have the following code :

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . .  

John    .    bb    cc

George    aaa     .    ccc

Adam    aaaa     bbbb    .

Brian    a .        cccc

Brian    a .        cccc

Brian    bb .        cccc

;
run;


proc freq data=h1;
tables name / out = Form1_DUPIDS (keep = name Count where = (Count > 1)); run;

The output of the code is :

John 2

Brian 3

How do I write the code so the output is the total number of duplicates records. In this example the result would be 3; 1 duplicate for John and 2 duplicates for Brian.

Also, what is the way the count the number of observations (meaning the whole line and not having to specify we want to analyze the names only) that are duplicates?

Thank you for your help and time.

Respected Advisor
Posts: 4,935

Re: Help with freq and duplicates

There are many ways to identify and count duplicates. One very convenient tool is the DUPOUT, NODUPKEY, UNIQUEOUT, NODUPRECS options in PROC SORT.

PG

PG
Regular Contributor
Posts: 186

Re: Help with freq and duplicates

Thank you for the quick reply.

My current code is the following :

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . . 

John    .    bb    cc

George    aaa     .    ccc

Adam    aaaa     bbbb    .

Brian    a .        cccc

Brian    a .        cccc

Brian    bb .        cccc

;
run;


proc freq data=h1;
tables name / out = dups_table (keep = name Count where = (Count > 1)); run;

data no_of_dups_table;
set dups_table;
no_of_dups = count -1;
run;

proc means data = no_of_dups_table sum;
var no_of_dups;
run;

My current code gives me the following info in the no_of_dups_table :

Name COUNT no_of_dups
Brian      3           2
John       2           1

First of all, I would like the get the sum of the variable no_of_dups (i.e. 3)

What is the code to get that result and how do I store it in a variable?

Thank you for your help and time.

Regular Contributor
Posts: 186

Re: Help with freq and duplicates

Given the fact that I would only want to count the number of duplicate observations (not variable), I found the following code :

proc sort data=h1 noduprecs;

by _all_;

run;

Is there a way to store the number of observation that where deleted in a variable?

Thank you for your help and time.

Valued Guide
Posts: 765

Re: Help with freq and duplicates

hi ... using your original data ...

proc sql;

create table dups as

select name, count(*)-1 as dups

from h1

group by name

having dups gt 0;

quit;

name            dups

Brian              2

John               1

Regular Contributor
Posts: 186

Re: Help with freq and duplicates

Thank you MikeZdeb for the reply.

In fact, I am halfway there. What I am really trying to do is to have the sum of the dups (how many duplicate observation do I have in a given dataset)

I then want to incorporate this information in a macro that you have written for me before ( see post here : https://communities.sas.com/thread/37110).

Your original macro gave this output

table            name              missing    not_applicable    missing    total

table1           name                   2            1               1         4

table1           age                    4            .               .         4

z.table2         color                  1            .               .         1

z.table2         height                 1            .               .         1

sashelp.class    Name                  19            .               .        19

sashelp.class    Sex                   19            .               .        19

sashelp.class    Age                   19            .               .        19

sashelp.class    Height                19            .               .        19

sashelp.class    Weight                19            .               .        19

sashelp.heart    Status              5209            .               .      5209

sashelp.heart    DeathCause          1991            .            3218      5209

sashelp.heart    AgeCHDdiag          1449            .            3760      5209

I would like to have the same output with an additional column that would provide the number of duplicate observations.

Could you please help me tweak the code in the corresponding post in order to have the output I require?

Your help and time are really appreciated.

Regards,

nicnad

New Contributor
Posts: 2

Re: Help with freq and duplicates

Hi nicnad,

Seems what you need to do is deduct 1 from each frequency then add up the new variable as follows.

DATA NUM_OF_DUPS;

SET  Form1_DUPIDS;

DUP_VALUES = COUNT - 1;

run;

Proc Means Data = Num_of_dups SUM;

var DUP_VALUES;

run;

that should give you the number of duplicate values.

Cheers,

Mohammad

Frequent Contributor
Frequent Contributor
Posts: 83

Re: Help with freq and duplicates

Hi nicnad

do you like this?

data h1;
input (Name    Info1    Info2    Info3) (:$10.);
cards;

John    a    . . 
John    .    bb    cc
George    aaa     .    ccc
Adam    aaaa     bbbb    .
Brian    a .        cccc
Brian    a .        cccc
Brian    bb .        cccc
;
run;
;

proc sort data=h1;
by name;
run;

data h1_count;
set h1;
by name;
retain count;
if first.name then count=1;
else count+1;
run;

/* to keep only 1st record*/

data nodup;
set h1_count;
where count=1;
run;

Ask a Question
Discussion stats
  • 7 replies
  • 339 views
  • 6 likes
  • 5 in conversation