topic Re: Reset observation pointer in SAS Programming

Reset observation pointer

Hugochum — Fri, 12 Feb 2021 13:54:53 GMT

What I need to do, boils down (more or less) to joining two huge datasets A and B.

The first approach is of course:

proc sql;
  create table AB as select * from A join B on 1;
quit;

But this is infeasible due to memory restrictions.

The obvious alternative is to loop, for each observation of A, through each observation of B. My current naive approach (ab)uses the POINT option of the SET statement:

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    output;
  end;
run;

This works - but is incredibly slow. Certainly, this can be remedied by replacing the random access of POINT by sequential access. Unfortunately, I was not able to figure out how this can be achieved. So any suggestions are greatly appreciated.

For instance, what would be nice to have is some option RESET (similar to KEYRESET) making the following code work after uncommenting reset = end_A:

data AB;
  do while(not end_A);
    set A end = end_A;
    do while(not end_B);
      set B end = end_B /* reset = end_A */;
      output;
    end;
  end;
run;

(If there is no other solution, I might have to split A into multiple smaller datasets first...)

Re: Reset observation pointer

PeterClemmensen — Fri, 12 Feb 2021 14:24:23 GMT

My first two questions when I see a problem like this are:

1) Why do you want to create a cartesian product? Usually this is part of a bigger problem.

2) What does your data look like?

Re: Reset observation pointer

Hugochum — Fri, 12 Feb 2021 15:04:04 GMT

Roughly, the data has the following form:

data A;
  /* ... */
  client_id = 1234567890; value = 42; output;
  /* ... */
run;

data B;
  /* ... */
  date = '01JAN1997'd; factor = 0.2; output;
  /* ... */
run;

The dataset to store should list a score for each client_id and date, like so:

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    score = factor * value;
    output;
  end;
run;

Probably there is an adhoc solution for this concrete problem (like splitting A into chunks). But it would be quite instructive also for future applications to know how to efficiently loop through multiple datasets.

Re: Reset observation pointer

ChrisNZ — Sun, 14 Feb 2021 23:07:14 GMT

1. What sorts of volumes are we talking about here? Which is is the smaller table (in bytes: nb rows * row length)?

2. What will the resulting Cartesian product data set be used for?

3. When dealing with very large tables, it is usually a good idea to store the data set in SPDE format with binary compression, in order to save space and to lower I/Os

4. To achieve your goal you can replace the POINT= iteration with a hash table iteration. This is much faster.

5. If you don't have enough memory to load any of the data sets in a hash table (hash table have a sizeable overhead), try using POINT= after loading a table in memory using SASFILE.

6. If you can't load a full table in memory, you'll have to split the join into smaller chunks.

7. You could also not join anything:

- create a format with the value of FACTOR for every date (use option CNTLOUT=).
- add a DO loop in the data step that loops through all the known dates and retrieves the value for FACTOR for each iteration.

Re: Reset observation pointer

Tom — Mon, 15 Feb 2021 00:37:13 GMT

What is it that you are actually trying to do? How many DATEs are there? Once you've split VALUE by all of those "factors" what are you going to do with it? Do you really need to re-write that large cartesian product back out into a disk file? Why not just finish your calculations and just save the statistic that summarizes the result?

Re: Reset observation pointer

PhilC — Tue, 16 Feb 2021 16:59:05 GMT

I have never encountered SASFILE! I am glad I subscribed to this post.

Re: Reset observation pointer

ChrisNZ — Tue, 16 Feb 2021 20:42:23 GMT

Yes, one hidden gem that is underused for sure. Another is SPDE. More good stuff in my book! 😉