topic Work with large datasets in Statistical Procedures

Work with large datasets

Callam1 — Fri, 25 Oct 2024 14:52:35 GMT

Can people advise on the most efficient/quickest way to check if a dataset with millions of records (millions of individuals where each individual has multiple records), contains a list of unique individual records as specified in a small dataset (30,000 unique IDs). So I have a dataset called ‘small’ with one variable called ID with 30,000 unique records. Then a dataset called ‘big’ with a variables ID and other variables with millions of records (ID not unique). I want to find how many individuals from ‘small’ I can find in ‘Big’. It is taking me days to sort by ID and merge. There must be a more efficient way. Thanks in advance

Re: Work with large datasets

ballardw — Fri, 25 Oct 2024 15:14:12 GMT

Something like this should work:

Proc sql;
   create table work.found as
   select distinct a.id
          
   from work.small as a
        left join
        (selecet id from work.big) as b
        on a.id = b.id
   where not missing(b.id)
   ;
run;

However if the dataset is very large there is going to be some time in the processing.

The output set work.found will only have the ID values that matched one time (the DISTINCT does that normally a left join would have the ID appear once for each match found).

Left join would normally include all the values from the A data set but the WHERE clause instructs SAS to only include the ones with a match.

The (select Id from work.big) means only include that variable so the amount of stuff moved around is less and may run faster.

Re: Work with large datasets

Callam1 — Fri, 25 Oct 2024 15:22:11 GMT

Thank you, I was considering that as next thing to try but I am still concerned about run time.

Re: Work with large datasets

Kurt_Bremser — Fri, 25 Oct 2024 17:38:38 GMT

Use a hash:

data check;
set big (keep=id);
if _n_ = 1
then do;
  declare hash s (dataset:"small");
  s.definekey("id");
  s.definedone();
end;
if s.check() = 0
then do;
  rc = s.remove();
 output;
end;
keep id;
run;

Because of the REMOVE, only one observation per unique id will make it, so the final dataset has the exact count of unique found id's.

Edit: note that no explicit sorting is needed, dataset "big" can be read as is.

While loading "small" into the hash, a search tree is built, effectively sorting it on the fly.

Re: Work with large datasets

Tom — Sat, 26 Oct 2024 06:21:34 GMT

You could also add a test to see if the hash is empty so you could stop reading the BIG dataset if every ID from the small dataset was already found.

if s.check() = 0 then do;
  output;
  rc = s.remove();
  if 0=s.num_items then stop;
end;

Re: Work with large datasets

Kurt_Bremser — Sat, 26 Oct 2024 12:14:21 GMT

Good idea. Given 30000 vs. millions, the early stop will quite surely outweigh the additional IF, performancewise.

Re: Work with large datasets

Callam1 — Sat, 26 Oct 2024 17:32:31 GMT

Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.

Re: Work with large datasets

Kurt_Bremser — Sat, 26 Oct 2024 18:13:02 GMT

@Callam1 wrote:
Thank you. It looks efficient and elegant. Are you familiar with the approach where you read the small dataset as a format then you apply that format to the big dataset.

You can do that, but a format is slower than the hash, which becomes important when working with large datasets.

We used to do it a lot before the introduction of the hash.

Re: Work with large datasets

Tom — Sat, 26 Oct 2024 18:37:00 GMT

HASH should be faster. (And code is not that hard once you get used to working with HASH).

The idea of using a format would be to generate an CNTLIN dataset to define the FORMAT (or perhaps an INFORMAT if ID is character). Then use PROC FORMAT to create the format. Then in a data step that reads in the BIG dataset check each ID to see if the formatted value is then result you defined the format to return. Would require reading all of the BIG dataset. Would return all of the matching observations from BIG, not just one per ID.

Another way do handle it is to make an index on the BIG dataset. That might take some time but should be faster than sorting the dataset. Plus the index might be useful for other things. Then you could run a data step that reads in the SMALL dataset and checks whether each ID is found in the BIG dataset. Should be fastest (once you have paid the up-front cost of creating the index).

https://www.lexjansen.com/wuss/2003/SASSolutions/c-an_animated_guide_speed_merges__iorc_.pdf

So something like:

data want;
  set small;
  set big key=id/unique;
  if _iorc_ then do;
     _error_=0;
     delete;
  end;
run;

Re: Work with large datasets

Callam1 — Sat, 26 Oct 2024 19:36:11 GMT

Brilliant! Thank you so much!

Re: Work with large datasets

Ksharp — Sun, 27 Oct 2024 00:56:54 GMT

proc sql;
create table want as
select distinct id
 from big
  where id in (select distinct id from small);
quit;

Re: Work with large datasets

Callam1 — Sun, 27 Oct 2024 08:55:12 GMT

Once I establish how many IDs are found in the big dataset, I might decide to analyse the data. In which case I will need to retain from the big dataset all of the records for the IDs (from Small) found. Is the hash method still valid or should I use the indexing approach suggested by Tom below?

Re: Work with large datasets

Kurt_Bremser — Sun, 27 Oct 2024 09:49:05 GMT

Just omit the REMOVE method and modify the KEEP statement to include all variables needed.