What I need to do, boils down (more or less) to joining two huge datasets A and B.
The first approach is of course:
proc sql; create table AB as select * from A join B on 1; quit;
But this is infeasible due to memory restrictions.
The obvious alternative is to loop, for each observation of A, through each observation of B. My current naive approach (ab)uses the POINT option of the SET statement:
data AB; set A; do point_B = 1 to nobs_B; set B point = point_B nobs = nobs_B; output; end; run;
This works - but is incredibly slow. Certainly, this can be remedied by replacing the random access of POINT by sequential access. Unfortunately, I was not able to figure out how this can be achieved. So any suggestions are greatly appreciated.
For instance, what would be nice to have is some option RESET (similar to KEYRESET) making the following code work after uncommenting reset = end_A
:
data AB;
do while(not end_A);
set A end = end_A;
do while(not end_B);
set B end = end_B /* reset = end_A */;
output;
end;
end;
run;
(If there is no other solution, I might have to split A into multiple smaller datasets first...)
My first two questions when I see a problem like this are:
1) Why do you want to create a cartesian product? Usually this is part of a bigger problem.
2) What does your data look like?
Roughly, the data has the following form:
data A;
/* ... */
client_id = 1234567890; value = 42; output;
/* ... */
run;
data B;
/* ... */
date = '01JAN1997'd; factor = 0.2; output;
/* ... */
run;
The dataset to store should list a score
for each client_id
and date
, like so:
data AB;
set A;
do point_B = 1 to nobs_B;
set B point = point_B nobs = nobs_B;
score = factor * value;
output;
end;
run;
Probably there is an adhoc solution for this concrete problem (like splitting A into chunks). But it would be quite instructive also for future applications to know how to efficiently loop through multiple datasets.
1. What sorts of volumes are we talking about here? Which is is the smaller table (in bytes: nb rows * row length)?
2. What will the resulting Cartesian product data set be used for?
3. When dealing with very large tables, it is usually a good idea to store the data set in SPDE format with binary compression, in order to save space and to lower I/Os
4. To achieve your goal you can replace the POINT= iteration with a hash table iteration. This is much faster.
5. If you don't have enough memory to load any of the data sets in a hash table (hash table have a sizeable overhead), try using POINT= after loading a table in memory using SASFILE.
6. If you can't load a full table in memory, you'll have to split the join into smaller chunks.
7. You could also not join anything:
- create a format with the value of FACTOR for every date (use option CNTLOUT=).
- add a DO loop in the data step that loops through all the known dates and retrieves the value for FACTOR for each iteration.
I have never encountered SASFILE! I am glad I subscribed to this post.
Yes, one hidden gem that is underused for sure. Another is SPDE. More good stuff in my book! 😉
What is it that you are actually trying to do? How many DATEs are there? Once you've split VALUE by all of those "factors" what are you going to do with it? Do you really need to re-write that large cartesian product back out into a disk file? Why not just finish your calculations and just save the statistic that summarizes the result?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.