BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Callam1
Obsidian | Level 7
Can people advise on the most efficient/quickest way to check if a dataset with millions of records (millions of individuals where each individual has multiple records), contains a list of unique individual records as specified in a small dataset (30,000 unique IDs). So I have a dataset called ‘small’ with one variable called ID with 30,000 unique records. Then a dataset called ‘big’ with a variables ID and other variables with millions of records (ID not unique). I want to find how many individuals from ‘small’ I can find in ‘Big’. It is taking me days to sort by ID and merge. There must be a more efficient way. Thanks in advance
1 ACCEPTED SOLUTION
12 REPLIES 12
ballardw
Super User

Something like this should work:

 

Proc sql;
   create table work.found as
   select distinct a.id
          
   from work.small as a
        left join
        (selecet id from work.big) as b
        on a.id = b.id
   where not missing(b.id)
   ;
run;

However if the dataset is very large there is going to be some time in the processing.

The output set work.found will only have the ID values that matched one time (the DISTINCT does that normally a left join would have the ID appear once for each match found).

Left join would normally include all the values from the A data set but the WHERE clause instructs SAS to only include the ones with a match.

The (select Id from work.big) means only include that variable so the amount of stuff moved around is less and may run faster.

Callam1
Obsidian | Level 7
Thank you, I was considering that as next thing to try but I am still concerned about run time.

Kurt_Bremser
Super User

Use a hash:

data check;
set big (keep=id);
if _n_ = 1
then do;
  declare hash s (dataset:"small");
  s.definekey("id");
  s.definedone();
end;
if s.check() = 0
then do;
  rc = s.remove();
 output;
end;
keep id;
run;

Because of the REMOVE, only one observation per unique id will make it, so the final dataset has the exact count of unique found id's.

 

Edit: note that no explicit sorting is needed, dataset "big" can be read as is.

While loading "small" into the hash, a search tree is built, effectively sorting it on the fly.

Tom
Super User Tom
Super User

You could also add a test to see if the hash is empty so you could stop reading the BIG dataset if every ID from the small dataset was already found.

if s.check() = 0 then do;
  output;
  rc = s.remove();
  if 0=s.num_items then stop;
end;
Callam1
Obsidian | Level 7
Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.
Kurt_Bremser
Super User

@Callam1 wrote:
Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.

You can do that, but a format is slower than the hash, which becomes important when working with large datasets.

We used to do it a lot before the introduction of the hash.

Callam1
Obsidian | Level 7
Brilliant! Thank you so much!
Callam1
Obsidian | Level 7
Once I establish how many IDs are found in the big dataset, I might decide to analyse the data. In which case I will need to retain from the big dataset all of the records for the IDs (from Small) found. Is the hash method still valid or should I use the indexing approach suggested by Tom below?
Tom
Super User Tom
Super User

HASH should be faster. (And code is not that hard once you get used to working with HASH).  

 

The idea of using a format would be to generate an CNTLIN dataset to define the FORMAT (or perhaps an INFORMAT if ID is character).  Then use PROC FORMAT to create the format.  Then in a data step that reads in the BIG dataset check each ID to see if the formatted value is then result you defined the format to return.   Would require reading all of the BIG dataset.  Would return all of the matching observations from BIG, not just one per ID.

 

Another way do handle it is to make an index on the BIG dataset.  That might take some time but should be faster than sorting the dataset.  Plus the index might be useful for other things.  Then you could run a data step that reads in the SMALL dataset and checks whether each ID is found in the BIG dataset.  Should be fastest (once you have paid the up-front cost of creating the index).

https://www.lexjansen.com/wuss/2003/SASSolutions/c-an_animated_guide_speed_merges__iorc_.pdf

So something like: 

data want;
  set small;
  set big key=id/unique;
  if _iorc_ then do;
     _error_=0;
     delete;
  end;
run;
Ksharp
Super User
proc sql;
create table want as
select distinct id
 from big
  where id in (select distinct id from small);
quit;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 606 views
  • 11 likes
  • 5 in conversation