DATA Step, Macro, Functions and more

Cartesian Product Efficiencies

Accepted Solution Solved
Reply
Super Contributor
Posts: 318
Accepted Solution

Cartesian Product Efficiencies

I have a 200k customer base  and would like to do compare each record to other records in the data set.

 

After merging the current row of comparison to all other records, I'm planning to do some matching or comparison of some fields. 

 

Appreciate for any recommendation on how to speed this up as let's say the customer base grow bigger, would it be advisable to split the base table and the do the merging multiple times or merging with the whole table is better?

 

Thanks in advance!

 

data current;
n=1; 
set base point=n;
output;
stop;
run;

do test;
set current(rename=(id=curr_id var1=curr_var1 var2=curr_var2));
do i=1 to 200000;
set base point=i;
output;
end;
run;

 

 


Accepted Solutions
Solution
‎04-03-2016 10:12 PM
Super User
Posts: 17,829

Re: Cartesian Product Efficiencies

Ah. Did you take a look at links from your previous question? You never responded back to it or marked it as answered. 

 

https://communities.sas.com/t5/Base-SAS-Programming/Record-Matching-Across-Single-Table/m-p/259302/h...

 

The discussion I linked to has a solution from FriedEgg that's quite useful. 

Unfortunately with the type of analysis there's not a lot of efficiencies to be gained. The more fields you have to restrict the comparison is best, ie maybe gender or state? 

 

View solution in original post


All Replies
Super User
Posts: 3,106

Re: Cartesian Product Efficiencies

One approach would be using a hash table where it is simply a copy of the original SAS table. 200K rows are no problem for hash. This post contains a example you could use:

 

https://communities.sas.com/t5/SAS-Data-Management/Extracting-subset-of-observations-without-merging...

 

 

Super User
Posts: 5,256

Re: Cartesian Product Efficiencies

Well, I can't see that there should be any significant difference in total run time between the two options, so it depends on how you wish to plan the execution, but from a quick calculation, the total Cartesian product will require quite a few TB.

"...do some matching...".

To speed a Cartesian join up, would be not to do a clean Cartesian product. Don't you have any pre requisites for a match? DO you really need to to join all customers with each other?

A way to speed it up (but not lessen the need for disk) is to load the table in memory suing a hash table. In the data step it would require little more of object programming. You could accomplish this in SQL as well by setting the SQL buffersize option (you can't really control this, just make it more likely for the SQL optimizer to chose this method).

Data never sleeps
Super Contributor
Posts: 318

Re: Cartesian Product Efficiencies

The matching I'm referring to is more on some fuzzy matches. 

 

An example with be using complev on var1.

 

Or would you suggest that on the joining condition I already use complev?

 

Greatly appreciate your inputs. Thank you!

Super User
Posts: 17,829

Re: Cartesian Product Efficiencies

What type of analysis are you trying to achieve that requires Cartesian product? 

 

Super Contributor
Posts: 318

Re: Cartesian Product Efficiencies

I want to match each record to the rest of the dataset and identify possibility that record A and B are the same based on some variables(var1 and var2 in my example).

 

One fuzzy match I want to explore is using complev and then just keep the records with a low complev value.

Solution
‎04-03-2016 10:12 PM
Super User
Posts: 17,829

Re: Cartesian Product Efficiencies

Ah. Did you take a look at links from your previous question? You never responded back to it or marked it as answered. 

 

https://communities.sas.com/t5/Base-SAS-Programming/Record-Matching-Across-Single-Table/m-p/259302/h...

 

The discussion I linked to has a solution from FriedEgg that's quite useful. 

Unfortunately with the type of analysis there's not a lot of efficiencies to be gained. The more fields you have to restrict the comparison is best, ie maybe gender or state? 

 

Super User
Posts: 6,938

Re: Cartesian Product Efficiencies

Also keep in mind that 200.000 * 200.000 = 40.000.000.000

Assuming that you could do a million compares per second, this still comes down to 40.000 seconds, or slightly more than 11 hours.

What kind of hardware does your SAS run on?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 318

Re: Cartesian Product Efficiencies

Thanks for your inputs. I have just decided to have another field to be included to at least trim down the observations to be comapred and not have a single row compared to the whole base

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 445 views
  • 3 likes
  • 5 in conversation