Re: Another efficient method instead of cartesian product (cross join)

wenzli25 · Posted 03-23-2018 05:19 AM

I have a SAS code question, and I'd like to ask if there is a more efficient way to solve my problem.

Target:
Through the customer's product holding and the product combination datasets, generated the recommended product based on the product holding of each customer. It's like association model.

Method :
I have two datasets:
1st, product combination records

data AAA;

infile datalines missover; 
input SET_SIZE COUNT ITEM1 $ ITEM2 $ ITEM3 $ ITEM4 $ ITEM5 $ ;
datalines;
2 50 DP1 DP2 
2 40 DP2 DP3
2 39 AC1 AC2
2 30 AB1 AB2
3 30 DP1 DP2 DP3
3 20 AB1 AB2 DP1
3 20 AC1 AC2 AB1
4 10 DP1 DP2 DP3 AC1
5 10 AB1 AB2 AC1 DP1
;

2nd, the products held by the customer

data BBB;

infile datalines missover; 
input ID $ count_of_product HOLD1 $ HOLD2 $ HOLD3 $ HOLD4 $ HOLD5 $ HOLD6 $ HOLD7 $ HOLD8 $ HOLD9 $ 

datalines;
ID01 3 DP1 DP2 AB1
ID02 4 DP1 DP2 DP3 DP4 AC1
ID03 4 DP1 AB1 AB2 AC1 AC2
ID04 3 AB1 AB2 AC1
ID05 2 AB1 DP1
ID06 8 AB1 AB2 DP1 DP2 DP3 AC1 AC2 GB1
;

I want to generate a dataset which record
1st, For each product combination, products held by customers and products not held by customers.
2nd, Calculate the number of products held by the customer in the product combination and the number of differentiation.

After the calculation, I sort by the number of differences between the customer and the product combination, and the count of the combination.Then, I delete duplicate records by ID and the recommended product.

Process of Implementation:

1. cartesian product (cross join)

PROC SQL ;		
   CREATE TABLE CCC AS 
SELECT t1.*, t2.*	
      FROM AAA t1 , BBB t2;
QUIT;

2. Comparison and calculation

DATA DDD;			
	SET CCC;		
ARRAY HOLD[9] $ HOLD1-HOLD9;			
ARRAY ITEM[5] $ ITEM1-ITEM5;			
ARRAY FLAG[5] FLAG1-FLAG5(0,0,0,0,0);			
ARRAY PRODUCT[5] $34 PRODUCT1-PRODUCT5;			
ARRAY INCLUD[5] $34 INCLUD1-INCLUD5;			
INCLUDED = 0;			
A=1;			
B=1;			
DO I=1 TO DIM(ITEM);			
	DO J=1 TO DIM(HOLD);		
		IF (ITEM[I]^="" AND ITEM[I]=HOLD[J]) THEN DO;	
			INCLUDED=INCLUDED+1;
			FLAG[I]=1;
			LEAVE;LEAVE;
			END;
		ELSE FLAG[I]=0;	
	END;		
END;			
			
DO K=1 TO DIM(FLAG);			
IF FLAG[K]=0 THEN DO;			
	PRODUCT[A]=ITEM[K];		
	A=A+1;		
END;			
ELSE IF FLAG[K]=1 THEN DO;			
	INCLUD[B]=ITEM[K];		
	B=B+1;		
END;			
END;			
			
OVERLAP = COUNT_OF_PRODUCT - INCLUDED;			
GAP = SET_SIZE - INCLUDED;			
			
RUN;

3. Sort and delete duplicates

PROC SQL ;	

   CREATE TABLE EEE AS 	
   SELECT DISTINCT t1.ID, 	
          t1.COUNT, 	
          t1.PRODUCT1, 	
          t1.INCLUD1, 	
          t1.INCLUD2, 	
          t1.INCLUD3, 	
          t1.INCLUD4, 	
          t1.INCLUD5	
      FROM DDD t1	
      WHERE t1.GAP = 1	
      ORDER BY t1.ID,	
               t1.OVERLAP,	
               t1.COUNT DESC;	
QUIT;

proc sort data=EEE out=FFF nodupkey; by ID product1; run;

Question:

Cuz I need to do "cross join" and then "combination match". It takes lots of time. I'd like to ask if there is anyway (like hash table?but I don't know how to do) to shorten my process time. If you have any idea, please advise me. Many thanks. ^_^

Astounding · Posted 03-23-2018 08:30 AM

Certainly you could create your own Cartesian product without SQL Given that you need a DATA step for the Cartesian product (the logic is probably too complex to implement with CASE statements instead), that ought to be faster. That let's you combine the first two steps. The idea:

data DDD;

set BBB;

do k=1 to n_combinations;

set AAA point=k nobs=n_combinations;

*** lots of processing, carefully done to mimic current DATA step, while overcoming that variables are retained;

output;

end;

run;

data_null__ · Posted 03-23-2018 09:52 AM

@Astounding wrote:

Certainly you could create your own Cartesian product without SQL Given that you need a DATA step for the Cartesian product (the logic is probably too complex to implement with CASE statements instead), that ought to be faster. That let's you combine the first two steps. The idea:

data DDD;

set BBB;

do k=1 to n_combinations;

set AAA point=k nobs=n_combinations;

*** lots of processing, carefully done to mimic current DATA step, while overcoming that variables are retained;

output;

end;

run;

Using SASFILE to load AAA would likely speed up this step.

sasfile aaa open;

wenzli25 · Posted 05-23-2018 08:59 AM

Sorry for the late reply. I tried @Astounding's method, but it failed. I'm not sure why.... still working on it. Could you tell me where should I modify "sasfile aaa open;" ?

Another efficient method instead of cartesian product (cross join)