Reset observation pointer

Hugochum · Posted 02-12-2021 08:54 AM

What I need to do, boils down (more or less) to joining two huge datasets A and B.

The first approach is of course:

proc sql;
  create table AB as select * from A join B on 1;
quit;

But this is infeasible due to memory restrictions.

The obvious alternative is to loop, for each observation of A, through each observation of B. My current naive approach (ab)uses the POINT option of the SET statement:

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    output;
  end;
run;

This works - but is incredibly slow. Certainly, this can be remedied by replacing the random access of POINT by sequential access. Unfortunately, I was not able to figure out how this can be achieved. So any suggestions are greatly appreciated.

For instance, what would be nice to have is some option RESET (similar to KEYRESET) making the following code work after uncommenting reset = end_A:

data AB;
  do while(not end_A);
    set A end = end_A;
    do while(not end_B);
      set B end = end_B /* reset = end_A */;
      output;
    end;
  end;
run;

(If there is no other solution, I might have to split A into multiple smaller datasets first...)

PeterClemmensen · Posted 02-12-2021 09:24 AM

My first two questions when I see a problem like this are:

1) Why do you want to create a cartesian product? Usually this is part of a bigger problem.

2) What does your data look like?

The DATA to DATA Step Macro
Blog: SASnrd

Hugochum · Posted 02-12-2021 10:02 AM

Roughly, the data has the following form:

data A;
  /* ... */
  client_id = 1234567890; value = 42; output;
  /* ... */
run;

data B;
  /* ... */
  date = '01JAN1997'd; factor = 0.2; output;
  /* ... */
run;

The dataset to store should list a score for each client_id and date, like so:

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    score = factor * value;
    output;
  end;
run;

Probably there is an adhoc solution for this concrete problem (like splitting A into chunks). But it would be quite instructive also for future applications to know how to efficiently loop through multiple datasets.

ChrisNZ · Posted 02-14-2021 06:07 PM

1. What sorts of volumes are we talking about here? Which is is the smaller table (in bytes: nb rows * row length)?

2. What will the resulting Cartesian product data set be used for?

3. When dealing with very large tables, it is usually a good idea to store the data set in SPDE format with binary compression, in order to save space and to lower I/Os

4. To achieve your goal you can replace the POINT= iteration with a hash table iteration. This is much faster.

5. If you don't have enough memory to load any of the data sets in a hash table (hash table have a sizeable overhead), try using POINT= after loading a table in memory using SASFILE.

6. If you can't load a full table in memory, you'll have to split the join into smaller chunks.

7. You could also not join anything:

- create a format with the value of FACTOR for every date (use option CNTLOUT=).
- add a DO loop in the data step that loops through all the known dates and retrieves the value for FACTOR for each iteration.

High-Performance SAS Coding - Third Edition

PhilC · Posted 02-16-2021 11:59 AM

I have never encountered SASFILE! I am glad I subscribed to this post.

ChrisNZ · Posted 02-16-2021 03:42 PM

Yes, one hidden gem that is underused for sure. Another is SPDE. More good stuff in my book! 😉

High-Performance SAS Coding - Third Edition

Tom · Posted 02-14-2021 07:37 PM

What is it that you are actually trying to do? How many DATEs are there? Once you've split VALUE by all of those "factors" what are you going to do with it? Do you really need to re-write that large cartesian product back out into a disk file? Why not just finish your calculations and just save the statistic that summarizes the result?

Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Re: Reset observation pointer

Registration is open

SAS Training: Just a Click Away