BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
milts
Pyrite | Level 9

I have a 200k customer base  and would like to do compare each record to other records in the data set.

 

After merging the current row of comparison to all other records, I'm planning to do some matching or comparison of some fields. 

 

Appreciate for any recommendation on how to speed this up as let's say the customer base grow bigger, would it be advisable to split the base table and the do the merging multiple times or merging with the whole table is better?

 

Thanks in advance!

 

data current;
n=1; 
set base point=n;
output;
stop;
run;

do test;
set current(rename=(id=curr_id var1=curr_var1 var2=curr_var2));
do i=1 to 200000;
set base point=i;
output;
end;
run;

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Ah. Did you take a look at links from your previous question? You never responded back to it or marked it as answered. 

 

https://communities.sas.com/t5/Base-SAS-Programming/Record-Matching-Across-Single-Table/m-p/259302/h...

 

The discussion I linked to has a solution from FriedEgg that's quite useful. 

Unfortunately with the type of analysis there's not a lot of efficiencies to be gained. The more fields you have to restrict the comparison is best, ie maybe gender or state? 

 

View solution in original post

8 REPLIES 8
SASKiwi
PROC Star

One approach would be using a hash table where it is simply a copy of the original SAS table. 200K rows are no problem for hash. This post contains a example you could use:

 

https://communities.sas.com/t5/SAS-Data-Management/Extracting-subset-of-observations-without-merging...

 

 

LinusH
Tourmaline | Level 20

Well, I can't see that there should be any significant difference in total run time between the two options, so it depends on how you wish to plan the execution, but from a quick calculation, the total Cartesian product will require quite a few TB.

"...do some matching...".

To speed a Cartesian join up, would be not to do a clean Cartesian product. Don't you have any pre requisites for a match? DO you really need to to join all customers with each other?

A way to speed it up (but not lessen the need for disk) is to load the table in memory suing a hash table. In the data step it would require little more of object programming. You could accomplish this in SQL as well by setting the SQL buffersize option (you can't really control this, just make it more likely for the SQL optimizer to chose this method).

Data never sleeps
milts
Pyrite | Level 9

The matching I'm referring to is more on some fuzzy matches. 

 

An example with be using complev on var1.

 

Or would you suggest that on the joining condition I already use complev?

 

Greatly appreciate your inputs. Thank you!

Reeza
Super User

What type of analysis are you trying to achieve that requires Cartesian product? 

 

milts
Pyrite | Level 9

I want to match each record to the rest of the dataset and identify possibility that record A and B are the same based on some variables(var1 and var2 in my example).

 

One fuzzy match I want to explore is using complev and then just keep the records with a low complev value.

Reeza
Super User

Ah. Did you take a look at links from your previous question? You never responded back to it or marked it as answered. 

 

https://communities.sas.com/t5/Base-SAS-Programming/Record-Matching-Across-Single-Table/m-p/259302/h...

 

The discussion I linked to has a solution from FriedEgg that's quite useful. 

Unfortunately with the type of analysis there's not a lot of efficiencies to be gained. The more fields you have to restrict the comparison is best, ie maybe gender or state? 

 

Kurt_Bremser
Super User

Also keep in mind that 200.000 * 200.000 = 40.000.000.000

Assuming that you could do a million compares per second, this still comes down to 40.000 seconds, or slightly more than 11 hours.

What kind of hardware does your SAS run on?

milts
Pyrite | Level 9

Thanks for your inputs. I have just decided to have another field to be included to at least trim down the observations to be comapred and not have a single row compared to the whole base

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1919 views
  • 3 likes
  • 5 in conversation