Solved: Re: Work with large datasets

Callam1

Can people advise on the most efficient/quickest way to check if a dataset with millions of records (millions of individuals where each individual has multiple records), contains a list of unique individual records as specified in a small dataset (30,000 unique IDs). So I have a dataset called ‘small’ with one variable called ID with 30,000 unique records. Then a dataset called ‘big’ with a variables ID and other variables with millions of records (ID not unique). I want to find how many individuals from ‘small’ I can find in ‘Big’. It is taking me days to sort by ID and merge. There must be a more efficient way. Thanks in advance

Kurt_Bremser

Just omit the REMOVE method and modify the KEEP statement to include all variables needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

ballardw

Something like this should work:

Proc sql;
   create table work.found as
   select distinct a.id
          
   from work.small as a
        left join
        (selecet id from work.big) as b
        on a.id = b.id
   where not missing(b.id)
   ;
run;

However if the dataset is very large there is going to be some time in the processing.

The output set work.found will only have the ID values that matched one time (the DISTINCT does that normally a left join would have the ID appear once for each match found).

Left join would normally include all the values from the A data set but the WHERE clause instructs SAS to only include the ones with a match.

The (select Id from work.big) means only include that variable so the amount of stuff moved around is less and may run faster.

Callam1

Thank you, I was considering that as next thing to try but I am still concerned about run time.

Kurt_Bremser

Use a hash:

data check;
set big (keep=id);
if _n_ = 1
then do;
  declare hash s (dataset:"small");
  s.definekey("id");
  s.definedone();
end;
if s.check() = 0
then do;
  rc = s.remove();
 output;
end;
keep id;
run;

Because of the REMOVE, only one observation per unique id will make it, so the final dataset has the exact count of unique found id's.

Edit: note that no explicit sorting is needed, dataset "big" can be read as is.

While loading "small" into the hash, a search tree is built, effectively sorting it on the fly.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom

You could also add a test to see if the hash is empty so you could stop reading the BIG dataset if every ID from the small dataset was already found.

if s.check() = 0 then do;
  output;
  rc = s.remove();
  if 0=s.num_items then stop;
end;

Kurt_Bremser

Good idea. Given 30000 vs. millions, the early stop will quite surely outweigh the additional IF, performancewise.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Callam1

Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.

Kurt_Bremser

@Callam1 wrote:
Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.

You can do that, but a format is slower than the hash, which becomes important when working with large datasets.

We used to do it a lot before the introduction of the hash.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Callam1

Brilliant! Thank you so much!

Callam1

Once I establish how many IDs are found in the big dataset, I might decide to analyse the data. In which case I will need to retain from the big dataset all of the records for the IDs (from Small) found. Is the hash method still valid or should I use the indexing approach suggested by Tom below?

Kurt_Bremser

Just omit the REMOVE method and modify the KEEP statement to include all variables needed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom

HASH should be faster. (And code is not that hard once you get used to working with HASH).

The idea of using a format would be to generate an CNTLIN dataset to define the FORMAT (or perhaps an INFORMAT if ID is character). Then use PROC FORMAT to create the format. Then in a data step that reads in the BIG dataset check each ID to see if the formatted value is then result you defined the format to return. Would require reading all of the BIG dataset. Would return all of the matching observations from BIG, not just one per ID.

Another way do handle it is to make an index on the BIG dataset. That might take some time but should be faster than sorting the dataset. Plus the index might be useful for other things. Then you could run a data step that reads in the SMALL dataset and checks whether each ID is found in the BIG dataset. Should be fastest (once you have paid the up-front cost of creating the index).

https://www.lexjansen.com/wuss/2003/SASSolutions/c-an_animated_guide_speed_merges__iorc_.pdf

So something like:

data want;
  set small;
  set big key=id/unique;
  if _iorc_ then do;
     _error_=0;
     delete;
  end;
run;

Ksharp

proc sql;
create table want as
select distinct id
 from big
  where id in (select distinct id from small);
quit;

SAS Innovate 2025: Save the Date