BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
chenyiwen1717
Fluorite | Level 6

Hi All,

I have about 10 million records SAS data set with ids. I want to check if the id is unique or not? What is the effective way to do this?

I tried use proc sql to count the number but it takes long time. Should I create an index for the id? Please advise.

proc sql;

  select count(distinct id) from master;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

If you just want know if ID is unique , don't care which ID is duplicated.

The Best way is using Hash Table . If there are some duplicated ID , "ha.add()"  will eject ERROR info in LOG .

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash ha();

ha.definekey('id');

ha.definedone();

end;

set have;

ha.add();

run;

Message was edited by: xia keshan

View solution in original post

9 REPLIES 9
PGStats
Opal | Level 21

Your query will count the number of distinct IDs in the dataset. Is that what you want?

If you want a list of records with non-unique IDs, you should use

proc sql;

create table dups as

select * from master

group by ID

having count(*) > 1;

quit;

Adding an index to your dataset would help if you are going to be repeating this check often. It is not worth the effort if this is a one time verification.

PG

PG
chenyiwen1717
Fluorite | Level 6

Yes. I just want to check if the IDs are unique or not. However my simple query runs a long time. Is there any other efficient way to check?Thanks.

KachiM
Rhodochrosite | Level 12

Is your ID, a number or character? If number, do have any idea about it Range.

If it is number, array will be fast.

chenyiwen1717
Fluorite | Level 6

I have character IDs. Let say my data set has 10,000,000 records. I just want to know if the IDs are unique or not. That's why I use:

proc sql;

  select count(distinct id) from master;

quit;

If I can get the exact same number of count then I am able to know they are all unique. But this method takes a lot of time.

Ksharp
Super User

If you just want know if ID is unique , don't care which ID is duplicated.

The Best way is using Hash Table . If there are some duplicated ID , "ha.add()"  will eject ERROR info in LOG .

data _null_;

if _n_=1 then do;

if 0 then set have;

declare hash ha();

ha.definekey('id');

ha.definedone();

end;

set have;

ha.add();

run;

Message was edited by: xia keshan

KachiM
Rhodochrosite | Level 12

If you consider that a unique ID is one that occurs once only, it is possible to find by using Hash objects. But how much time it will take compared to SQL can be found only by you, by running both ways.

Xia has already given a hash solution. It can not tell which ID occurs once only. Here is another hash way with an example, the length of ID need not be 6 in your case. It It collects all records having one ID into one data set(Unique) and the rest to another data set(Duplicates).

data have;

input ID $6.;

datalines;

aaaaaa

aaaaaa

aaaaaa

bbbbbb

dddddd

;

run;

data _null_;

   if _n_ = 1 then do;

      declare hash h();

      h.definekey('ID');

      h.definedata('ID','freq');

      h.definedone();

   end;

   set have end = eof;

   if h.find() ne 0 then freq = 0;

   freq + 1;

   h.replace();

   if eof;

     h.output(dataset:'Unique (where=(freq=1)');

     h.output(dataset:'Duplicates (where=(freq > 1)');

run;

proc print data = Unique;

run;

proc print data = Duplicates;

run;

JerryLeBreton
Pyrite | Level 9

Proc FREQ will do this much quicker.  Try this:

proc freq data=master   noprint ;

   table ID / out=UniqueIDs(where=(count=1));

run;

JerryLeBreton
Pyrite | Level 9

Depending on what else you are doing, you could use Proc SORT, which performs about the same as Proc FREQ.

proc sort data=master(keep=ID)  nounikey

          out=duplicateIDs

          uniout=uniqueIDs ;

  by ID;

run;

And if you're not interested in the non-unique IDs  being written to a data set, you can use...

            out = _null_

Jagadishkatam
Amethyst | Level 16

May be first. and last. might be handy.

The below query will output only those id which are unique. For example in the below code only id 2 will be outputted. But may be i did not understand the query, please correct if i did not.

data have;

input id;

cards;

1

1

1

1

2

3

3

;

data want;

set have;

by id;

if first.id and last.id;

run;

proc print;

run;

Thanks,

Jag

Thanks,
Jag

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 2047 views
  • 8 likes
  • 6 in conversation