topic Re: effective way to count the number of uniqueness in SAS Programming

effective way to count the number of uniqueness

chenyiwen1717 — Mon, 03 Aug 2015 21:37:21 GMT

Hi All,

I have about 10 million records SAS data set with ids. I want to check if the id is unique or not? What is the effective way to do this?

I tried use proc sql to count the number but it takes long time. Should I create an index for the id? Please advise.

proc sql;

select count(distinct id) from master;

quit;

Re: effective way to count the number of uniqueness

PGStats — Mon, 03 Aug 2015 22:05:26 GMT

Your query will count the number of distinct IDs in the dataset. Is that what you want?

If you want a list of records with non-unique IDs, you should use

proc sql;

create table dups as

select * from master

group by ID

having count(*) > 1;

quit;

Adding an index to your dataset would help if you are going to be repeating this check often. It is not worth the effort if this is a one time verification.

Re: effective way to count the number of uniqueness

chenyiwen1717 — Mon, 03 Aug 2015 23:04:06 GMT

Yes. I just want to check if the IDs are unique or not. However my simple query runs a long time. Is there any other efficient way to check?Thanks.

Re: effective way to count the number of uniqueness

KachiM — Tue, 04 Aug 2015 00:12:31 GMT

Is your ID, a number or character? If number, do have any idea about it Range.

If it is number, array will be fast.

Re: effective way to count the number of uniqueness

JerryLeBreton — Tue, 04 Aug 2015 00:26:49 GMT

Proc FREQ will do this much quicker. Try this:

proc freq data=master noprint ;

table ID / out=UniqueIDs(where=(count=1));

run;

Re: effective way to count the number of uniqueness

JerryLeBreton — Tue, 04 Aug 2015 01:08:12 GMT

Depending on what else you are doing, you could use Proc SORT, which performs about the same as Proc FREQ.

proc sort data=master(keep=ID) nounikey

out=duplicateIDs

uniout=uniqueIDs ;

by ID;

run;

And if you're not interested in the non-unique IDs being written to a data set, you can use...

out = _null_

Re: effective way to count the number of uniqueness

Jagadishkatam — Tue, 04 Aug 2015 02:03:18 GMT

May be first. and last. might be handy.

The below query will output only those id which are unique. For example in the below code only id 2 will be outputted. But may be i did not understand the query, please correct if i did not.

data have;

input id;

cards;

;

data want;

set have;

by id;

if first.id and last.id;

run;

proc print;

run;

Thanks,

Jag

Re: effective way to count the number of uniqueness

chenyiwen1717 — Tue, 04 Aug 2015 03:27:12 GMT

I have character IDs. Let say my data set has 10,000,000 records. I just want to know if the IDs are unique or not. That's why I use:

proc sql;

select count(distinct id) from master;

quit;

If I can get the exact same number of count then I am able to know they are all unique. But this method takes a lot of time.

Re: effective way to count the number of uniqueness

Ksharp — Tue, 04 Aug 2015 12:06:05 GMT

If you just want know if ID is unique , don't care which ID is duplicated.

The Best way is using Hash Table . If there are some duplicated ID , "ha.add()" will eject ERROR info in LOG .

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash ha();

ha.definekey('id');

ha.definedone();

end;

set have;

ha.add();

run;

Message was edited by: xia keshan

Re: effective way to count the number of uniqueness

KachiM — Tue, 04 Aug 2015 14:23:19 GMT

If you consider that a unique ID is one that occurs once only, it is possible to find by using Hash objects. But how much time it will take compared to SQL can be found only by you, by running both ways.

Xia has already given a hash solution. It can not tell which ID occurs once only. Here is another hash way with an example, the length of ID need not be 6 in your case. It It collects all records having one ID into one data set(Unique) and the rest to another data set(Duplicates).

data have;

input ID $6.;

datalines;

aaaaaa

bbbbbb

dddddd

;

run;

data _null_;

if _n_ = 1 then do;

declare hash h();

h.definekey('ID');

h.definedata('ID','freq');

h.definedone();

end;

set have end = eof;

if h.find() ne 0 then freq = 0;

freq + 1;

h.replace();

if eof;

h.output(dataset:'Unique (where=(freq=1)');

h.output(dataset:'Duplicates (where=(freq > 1)');

run;

proc print data = Unique;

run;

proc print data = Duplicates;

run;