Hi All,
I have about 10 million records SAS data set with ids. I want to check if the id is unique or not? What is the effective way to do this?
I tried use proc sql to count the number but it takes long time. Should I create an index for the id? Please advise.
proc sql;
select count(distinct id) from master;
quit;
If you just want know if ID is unique , don't care which ID is duplicated.
The Best way is using Hash Table . If there are some duplicated ID , "ha.add()" will eject ERROR info in LOG .
data _null_;
if _n_=1 then do;
if 0 then set have;
declare hash ha();
ha.definekey('id');
ha.definedone();
end;
set have;
ha.add();
run;
Message was edited by: xia keshan
Your query will count the number of distinct IDs in the dataset. Is that what you want?
If you want a list of records with non-unique IDs, you should use
proc sql;
create table dups as
select * from master
group by ID
having count(*) > 1;
quit;
Adding an index to your dataset would help if you are going to be repeating this check often. It is not worth the effort if this is a one time verification.
PG
Yes. I just want to check if the IDs are unique or not. However my simple query runs a long time. Is there any other efficient way to check?Thanks.
Is your ID, a number or character? If number, do have any idea about it Range.
If it is number, array will be fast.
I have character IDs. Let say my data set has 10,000,000 records. I just want to know if the IDs are unique or not. That's why I use:
proc sql;
select count(distinct id) from master;
quit;
If I can get the exact same number of count then I am able to know they are all unique. But this method takes a lot of time.
If you just want know if ID is unique , don't care which ID is duplicated.
The Best way is using Hash Table . If there are some duplicated ID , "ha.add()" will eject ERROR info in LOG .
data _null_;
if _n_=1 then do;
if 0 then set have;
declare hash ha();
ha.definekey('id');
ha.definedone();
end;
set have;
ha.add();
run;
Message was edited by: xia keshan
If you consider that a unique ID is one that occurs once only, it is possible to find by using Hash objects. But how much time it will take compared to SQL can be found only by you, by running both ways.
Xia has already given a hash solution. It can not tell which ID occurs once only. Here is another hash way with an example, the length of ID need not be 6 in your case. It It collects all records having one ID into one data set(Unique) and the rest to another data set(Duplicates).
data have;
input ID $6.;
datalines;
aaaaaa
aaaaaa
aaaaaa
bbbbbb
dddddd
;
run;
data _null_;
if _n_ = 1 then do;
declare hash h();
h.definekey('ID');
h.definedata('ID','freq');
h.definedone();
end;
set have end = eof;
if h.find() ne 0 then freq = 0;
freq + 1;
h.replace();
if eof;
h.output(dataset:'Unique (where=(freq=1)');
h.output(dataset:'Duplicates (where=(freq > 1)');
run;
proc print data = Unique;
run;
proc print data = Duplicates;
run;
Proc FREQ will do this much quicker. Try this:
proc freq data=master noprint ;
table ID / out=UniqueIDs(where=(count=1));
run;
Depending on what else you are doing, you could use Proc SORT, which performs about the same as Proc FREQ.
proc sort data=master(keep=ID) nounikey
out=duplicateIDs
uniout=uniqueIDs ;
by ID;
run;
And if you're not interested in the non-unique IDs being written to a data set, you can use...
out = _null_
May be first. and last. might be handy.
The below query will output only those id which are unique. For example in the below code only id 2 will be outputted. But may be i did not understand the query, please correct if i did not.
data have;
input id;
cards;
1
1
1
1
2
3
3
;
data want;
set have;
by id;
if first.id and last.id;
run;
proc print;
run;
Thanks,
Jag
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
