effective way to count the number of uniqueness

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

effective way to count the number of uniqueness

Hi All,

I have about 10 million records SAS data set with ids. I want to check if the id is unique or not? What is the effective way to do this?

I tried use proc sql to count the number but it takes long time. Should I create an index for the id? Please advise.

proc sql;

  select count(distinct id) from master;

quit;


Accepted Solutions
Solution
‎08-04-2015 08:06 AM
Super User
Posts: 9,691

Re: effective way to count the number of uniqueness

If you just want know if ID is unique , don't care which ID is duplicated.

The Best way is using Hash Table . If there are some duplicated ID , "ha.add()"  will eject ERROR info in LOG .

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash ha();

ha.definekey('id');

ha.definedone();

end;

set have;

ha.add();

run;

Message was edited by: xia keshan

View solution in original post


All Replies
Respected Advisor
Posts: 4,663

Re: effective way to count the number of uniqueness

Your query will count the number of distinct IDs in the dataset. Is that what you want?

If you want a list of records with non-unique IDs, you should use

proc sql;

create table dups as

select * from master

group by ID

having count(*) > 1;

quit;

Adding an index to your dataset would help if you are going to be repeating this check often. It is not worth the effort if this is a one time verification.

PG

PG
Occasional Contributor
Posts: 7

Re: effective way to count the number of uniqueness

Yes. I just want to check if the IDs are unique or not. However my simple query runs a long time. Is there any other efficient way to check?Thanks.

Super Contributor
Posts: 257

Re: effective way to count the number of uniqueness

Is your ID, a number or character? If number, do have any idea about it Range.

If it is number, array will be fast.

Occasional Contributor
Posts: 7

Re: effective way to count the number of uniqueness

I have character IDs. Let say my data set has 10,000,000 records. I just want to know if the IDs are unique or not. That's why I use:

proc sql;

  select count(distinct id) from master;

quit;

If I can get the exact same number of count then I am able to know they are all unique. But this method takes a lot of time.

Solution
‎08-04-2015 08:06 AM
Super User
Posts: 9,691

Re: effective way to count the number of uniqueness

If you just want know if ID is unique , don't care which ID is duplicated.

The Best way is using Hash Table . If there are some duplicated ID , "ha.add()"  will eject ERROR info in LOG .

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash ha();

ha.definekey('id');

ha.definedone();

end;

set have;

ha.add();

run;

Message was edited by: xia keshan

Super Contributor
Posts: 257

Re: effective way to count the number of uniqueness

If you consider that a unique ID is one that occurs once only, it is possible to find by using Hash objects. But how much time it will take compared to SQL can be found only by you, by running both ways.

Xia has already given a hash solution. It can not tell which ID occurs once only. Here is another hash way with an example, the length of ID need not be 6 in your case. It It collects all records having one ID into one data set(Unique) and the rest to another data set(Duplicates).

data have;

input ID $6.;

datalines;

aaaaaa

aaaaaa

aaaaaa

bbbbbb

dddddd

;

run;

data _null_;

   if _n_ = 1 then do;

      declare hash h();

      h.definekey('ID');

      h.definedata('ID','freq');

      h.definedone();

   end;

   set have end = eof;

   if h.find() ne 0 then freq = 0;

   freq + 1;

   h.replace();

   if eof;

     h.output(dataset:'Unique (where=(freq=1)');

     h.output(dataset:'Duplicates (where=(freq > 1)');

run;

proc print data = Unique;

run;

proc print data = Duplicates;

run;

Frequent Contributor
Posts: 85

Re: effective way to count the number of uniqueness

Proc FREQ will do this much quicker.  Try this:

proc freq data=master   noprint ;

   table ID / out=UniqueIDs(where=(count=1));

run;

Frequent Contributor
Posts: 85

Re: effective way to count the number of uniqueness

Depending on what else you are doing, you could use Proc SORT, which performs about the same as Proc FREQ.

proc sort data=master(keep=ID)  nounikey

          out=duplicateIDs

          uniout=uniqueIDs ;

  by ID;

run;

And if you're not interested in the non-unique IDs  being written to a data set, you can use...

            out = _null_

Trusted Advisor
Posts: 1,131

Re: effective way to count the number of uniqueness

May be first. and last. might be handy.

The below query will output only those id which are unique. For example in the below code only id 2 will be outputted. But may be i did not understand the query, please correct if i did not.

data have;

input id;

cards;

1

1

1

1

2

3

3

;

data want;

set have;

by id;

if first.id and last.id;

run;

proc print;

run;

Thanks,

Jag

Thanks,
Jag
🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 413 views
  • 8 likes
  • 6 in conversation