BookmarkSubscribeRSS Feed
Hugochum
Calcite | Level 5

What I need to do, boils down (more or less) to joining two huge datasets A and B.

The first approach is of course:

 

proc sql;
  create table AB as select * from A join B on 1;
quit;

But this is infeasible due to memory restrictions.

 

The obvious alternative is to loop, for each observation of A, through each observation of B. My current naive approach (ab)uses the POINT option of the SET statement:

 

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    output;
  end;
run;

This works - but is incredibly slow. Certainly, this can be remedied by replacing the random access of POINT by sequential access. Unfortunately, I was not able to figure out how this can be achieved. So any suggestions are greatly appreciated.

 

For instance, what would be nice to have is some option RESET (similar to KEYRESET) making the following code work after uncommenting reset = end_A:

data AB;
  do while(not end_A);
    set A end = end_A;
    do while(not end_B);
      set B end = end_B /* reset = end_A */;
      output;
    end;
  end;
run;

(If there is no other solution, I might have to split A into multiple smaller datasets first...)

6 REPLIES 6
PeterClemmensen
Tourmaline | Level 20

My first two questions when I see a problem like this are:

 

1) Why do you want to create a cartesian product? Usually this is part of a bigger problem.

 

2) What does your data look like?

Hugochum
Calcite | Level 5

Roughly, the data has the following form:

data A;
  /* ... */
  client_id = 1234567890; value = 42; output;
  /* ... */
run;

data B;
  /* ... */
  date = '01JAN1997'd; factor = 0.2; output;
  /* ... */
run;

The dataset to store should list a score for each client_id and date, like so:

data AB;
  set A;
  do point_B = 1 to nobs_B;
    set B point = point_B nobs = nobs_B;
    score = factor * value;
    output;
  end;
run;

 

Probably there is an adhoc solution for this concrete problem (like splitting A into chunks). But it would be quite instructive also for future applications to know how to efficiently loop through multiple datasets.

 

ChrisNZ
Tourmaline | Level 20

1. What sorts of volumes are we talking about here? Which is is the smaller table (in bytes: nb rows * row length)?

2. What will the resulting Cartesian product data set be used for?

3. When dealing with very large tables, it is usually a good idea to store the data set in SPDE format with binary compression, in order to save space and to lower I/Os

4. To achieve your goal you can replace the POINT= iteration with a hash table iteration. This is much faster.

5. If you don't have enough memory to load any of the data sets in a hash table (hash table have a sizeable overhead), try using POINT= after loading a table in memory using SASFILE.

6. If you can't load a full table in memory, you'll have to split the join into smaller chunks.

7. You could also not join anything:

- create a format with the value of FACTOR for every date (use option CNTLOUT=). 
- add a DO loop in the data step that loops through all the known dates and retrieves the value for FACTOR for each iteration.

 

 

 

 

PhilC
Rhodochrosite | Level 12

I have never encountered SASFILE!  I am glad I subscribed to this post.  

ChrisNZ
Tourmaline | Level 20

Yes, one hidden gem that is underused for sure. Another is SPDE. More good stuff in my book! 😉

Tom
Super User Tom
Super User

What is it that you are actually trying to do? How many DATEs are there?  Once you've split VALUE by all of those "factors" what are you going to do with it? Do you really need to re-write that large cartesian product back out into a disk file?  Why not just finish your calculations and just save the statistic that summarizes the result?

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1812 views
  • 4 likes
  • 5 in conversation