Proc SQL self join efficiency

Accepted Solution Solved
Reply
Contributor
Posts: 36
Accepted Solution

Proc SQL self join efficiency

Dear fellow SAS users,

 

I have an proc sql programming effiency question.

 

Suppose I have a huge table, in which contains Customer at a given point in time, and two scores, Score1 and Score2.

 

Now I want to produce a table, in which contains that customer at that given point in time, Score1 and Score2, but also a forward-looking

3 quarter Score 1 and Score in the same row.

 

My clumsy solution is to self join 3 times, with incremental date, to achieve this result table.

 

But my question is, if the table is huge, so many self joins must be very inefficient and slow. Can there be a smarter way of doing this?

Thanks for your help. Kenneth

 

My test codes are below:

 

 

data test;

format Date yymmdd10.;

input Date yymmdd10. Customer $ Score1 $ Score2 $;

cards;

2010-03-30 John A B

2010-06-30 John A A

2010-09-30 John B B

2010-12-31 John C D

2010-03-30 Mary D E

2010-06-30 Mary C B

2010-09-30 Mary A B

2010-12-31 Mary D A

2010-03-30 Ted F A

2010-06-30 Ted C B

2010-09-30 Ted D B

2010-12-31 Ted A E

;

run;

proc sql;

create table test2 as

select a.Customer as Customer,

a.Date as Date,

a.Score1 as Score1_Q1,

a.Score2 as Score2_Q1,

case when missing(b.Score1) then '.' else b.Score1 end as Score1_Q2,

case when missing(b.Score2) then '.' else b.Score2 end as Score2_Q2,

case when missing(c.Score1) then '.' else c.Score1 end as Score1_Q3,

case when missing(c.Score2) then '.' else c.Score2 end as Score2_Q3,

case when missing(d.Score1) then '.' else d.Score1 end as Score1_Q4,

case when missing(d.Score2) then '.' else d.Score2 end as Score2_Q4

from test as a

 

left join test as b

on a.Customer = b.Customer

and intnx('Quarter', a.Date, 1, "end") = b.Date

 

left join test as c

on a.Customer = c.Customer

and intnx('Quarter', a.Date, 2, "end") = c.Date

 

left join test as d

on a.Customer = d.Customer

and intnx('Quarter', a.Date, 3, "end") = d.Date

;

quit;

 


Accepted Solutions
Solution
‎06-03-2016 08:30 AM
Contributor
Posts: 36

Re: Proc SQL self join efficiency

Dear PGStats,

Thank you so much for your wonderful code. With some modifications, your code worked elegantly. I am now posting my modification and the result, which is exactly what I want:

 

data test;

infile cards delimiter=',';

format Date yymmdd10.;

input Date yymmdd10. Customer $ Score1 $ Score2 $;

cards;

2009-09-30,John,C,A

2009-12-31,John,B,A

2010-03-31,John,A,B

2010-06-30,John,A,A

2010-09-30,John,B,B

2010-12-31,John, ,D

2011-03-31,John,E,A

2009-12-31,Mary,A,A

2010-03-31,Mary,D,E

2010-06-30,Mary,C,B

2010-09-30,Mary,A,B

2010-12-31,Mary,D,A

2011-03-31,Mary,A,A

2011-06-30,Mary,D,

2010-03-31,Ted,F,A

2010-06-30,Ted,C,B

2010-09-30,Ted,D,B

2010-12-31,Ted,A,E

2011-03-31,Ted,B,B

2011-06-30,Ted, ,C

2011-09-30,Ted,D,B

;

run;

 

 

data testAug;

set test;

array s score1 score2;

do qtr = 0 to 4;

date0 = intnx("QTR", date, -qtr, "END");

do sc = 1, 2;

score = s{sc};

output;

end;

end;

keep customer date0 qtr sc score;

format date0 yymmdd10.;

run;

proc sort data=testAug; by customer date0 qtr sc; run;

data want;

do until(last.date0);

set testAug;

by customer date0;

array s{0:4, 2} $8

score1q0 score2q0

score1q1 score2q1

score1q2 score2q2

score1q3 score2q3

score1q4 score2q4;

s{qtr,sc} = score;

end;

if not missing(score1q0) and not missing(score2q0) then output;

drop qtr sc score;

run;

 

And the result looks like this:

 

CustomerDate0Score1Q0Score2Q2Score1Q1Score2Q1Score1Q2Score2Q2Score1Q3Score2Q3Score1Q4Score2Q4
John2009-09-30CABAABAABB
John2009-12-31BAABAABB D
John2010-03-31ABAABB DEA
John2010-06-30AABB DEA  
John2010-09-30BB DEA    
John2011-03-31EA        
Mary2009-12-31AADECBABDA
Mary2010-03-31DECBABDAAA
Mary2010-06-30CBABDAAAD 
Mary2010-09-30ABDAAAD   
Mary2010-12-31DAAAD     
Mary2011-03-31AAD       
Ted2010-03-31FACBDBAEBB
Ted2010-06-30CBDBAEBB C
Ted2010-09-30DBAEBB CDB
Ted2010-12-31AEBB CDB  
Ted2011-03-31BB CDB    
Ted2011-09-30DB        

 

 

Thank you so very much!!

Ken

View solution in original post


All Replies
Super User
Posts: 5,516

Re: Proc SQL self join efficiency

This seems like an easy problem for a DATA step, without the time-consuming joins.  But it's a little difficult to visualize.  Perhaps you could give an example of what the final result should look like.  If possible, address what should happen if a customer has fewer than 4 records.

Contributor
Posts: 36

Re: Proc SQL self join efficiency

Posted in reply to Astounding

 The final table should look like this:

 

Customer Date Score1_Q1 Score2_Q1 Score1_Q2 Score2_Q2 Score1_Q3 Score2_Q3 Score1_Q4 Score2_Q4
John2010-03-30ABAABBCD
John2010-06-30AABBCD..
John2010-09-30BBCD....
John2010-12-31CD......
Mary2010-03-30DECBABDA
Mary2010-06-30CBABDA..
Mary2010-09-30ABDA....
Mary2010-12-31DA......
Ted2010-03-30FACBDBAE
Ted2010-06-30CBDBAE..
Ted2010-09-30DBAE....
Ted2010-12-31AE......
Contributor
Posts: 36

Re: Proc SQL self join efficiency

For each customer, there are many years of data, per quarter each year. The point is to make a table, for every given point in time, make a forward-looking or backward-looking 4 quarters Score1 and Score2 table.

Contributor
Posts: 36

Re: Proc SQL self join efficiency

CustomerDateScore1_Q1Score2_Q1Score1_Q2Score2_Q2Score1_Q3Score2_Q3Score1_Q4Score2_Q4
John2010-03-31ABAABBCD
John2010-06-30AABBCDEA
John2010-09-30BBCDEABD
John2010-12-31CDEABD..
John2011-03-31EABD....
John2011-06-30BD......
Mary2010-03-31DECBABDA
Mary2010-06-30CBABDAAA
Mary2010-09-30ABDAAADB
Mary2010-12-31DAAADB..
Mary2011-03-31AADB....
Mary2011-06-30DB......
Ted2010-03-31FACBDBAE
Ted2010-06-30CBDBAEBB
Ted2010-09-30DBAEBB..
Ted2010-12-31AEBB....
Ted2011-03-31BB......
Contributor
Posts: 36

Re: Proc SQL self join efficiency

I added some more quarters just to show what it looks like...

Super User
Posts: 5,437

Re: Proc SQL self join efficiency

Perhaps a hash table would be the most efficient given you hate amount RAM required?
May I why?
Transposing data like this is usually beat done in the report/analys phase. You could easily do cross tabulations if you your data in a star schema with a time dimension.
Data never sleeps
Contributor
Posts: 36

Re: Proc SQL self join efficiency

Thanks for your reply. It is unfortunately not a final reporting table, but in the middle of data crunching...

Contributor
Posts: 53

Re: Proc SQL self join efficiency

[ Edited ]

First of all, this may help - http://www.lexjansen.com/phuse/2008/tu/tu07.pdf

 

Attached is my program - score.sas

Note: This returns the date of the quarter end for which scores were ever recorded. i.e. if the data for a customer starts only in quarter 3, then the date returned will be that of quarter 3.

I know it has some inline views, I am hoping it executes subqueries first thus reducing the volume of the initial cartesian product Smiley Happy

 

But by most accounts,self-joins are better than subqueries and inline views. Temporary tables are even better.

 

I tested both your program and mine. The CPU times don't show a significant difference, which is expected without a significant amount of data. Smiley Happy 

 

Attached is the log for your code - orig_score.log

and mine - score.log

 

You could also try breaking up the query into parts and using datastep. You could split the data and do a sort-merge (I do not favour this method for large volumes of data).

 

If you find a better solution, please post it here. Smiley Happy Good luck!

 

Oh the result from my program is score.txt - is this how you see the result?

 

Contributor
Posts: 36

Re: Proc SQL self join efficiency

Posted in reply to PoornimaRavishankar

Thanks, I will test the code. Thanks a lot!!

Contributor
Posts: 36

Re: Proc SQL self join efficiency

Thanks for your code.

 

In reality, my data is huge, including 75 million rows and more than that, 88 columns, so it takes a long time to run those self joins.

 

Customers can start and finish in any quarter, that means, they don't all start at the same quarter, same year. Some last longer, some last shorter, so it is completely random.

 

What I was trying to do was to find out a 1-year horizon behavior pattern, at any given point of time in the customer's life cycle.

 

So that's why I made it in self join, but in a rather clumsy and inefficient way.

Respected Advisor
Posts: 4,932

Re: Proc SQL self join efficiency

This should be fairly efficient:

 

data testAug;
set test;
array s score1 score2;
do qtr = 0 to 3;
    dateQ0 = intnx("QTR", date, -qtr, "END");
    do sc = 1, 2;
        score = s{sc};
        output;
        end;
    end;
keep customer dateQ0 qtr sc score;
format dateQ0 yymmdd10.;
run;

proc sort data=testAug; by customer dateQ0 qtr sc; run;

data want;
do until(last.dateQ0);
    set testAug;
    by customer dateQ0;
    array s{0:3, 2} $8
        score1q0 score2q0 
        score1q1 score2q1 
        score1q2 score2q2
        score1q3 score2q3;
    s{qtr,sc} = score;
    end;
if not missing(score1q0) and not missing(score1q3) then output;
drop qtr sc score;
run;

proc print data=want noobs; 
run;
PG
Solution
‎06-03-2016 08:30 AM
Contributor
Posts: 36

Re: Proc SQL self join efficiency

Dear PGStats,

Thank you so much for your wonderful code. With some modifications, your code worked elegantly. I am now posting my modification and the result, which is exactly what I want:

 

data test;

infile cards delimiter=',';

format Date yymmdd10.;

input Date yymmdd10. Customer $ Score1 $ Score2 $;

cards;

2009-09-30,John,C,A

2009-12-31,John,B,A

2010-03-31,John,A,B

2010-06-30,John,A,A

2010-09-30,John,B,B

2010-12-31,John, ,D

2011-03-31,John,E,A

2009-12-31,Mary,A,A

2010-03-31,Mary,D,E

2010-06-30,Mary,C,B

2010-09-30,Mary,A,B

2010-12-31,Mary,D,A

2011-03-31,Mary,A,A

2011-06-30,Mary,D,

2010-03-31,Ted,F,A

2010-06-30,Ted,C,B

2010-09-30,Ted,D,B

2010-12-31,Ted,A,E

2011-03-31,Ted,B,B

2011-06-30,Ted, ,C

2011-09-30,Ted,D,B

;

run;

 

 

data testAug;

set test;

array s score1 score2;

do qtr = 0 to 4;

date0 = intnx("QTR", date, -qtr, "END");

do sc = 1, 2;

score = s{sc};

output;

end;

end;

keep customer date0 qtr sc score;

format date0 yymmdd10.;

run;

proc sort data=testAug; by customer date0 qtr sc; run;

data want;

do until(last.date0);

set testAug;

by customer date0;

array s{0:4, 2} $8

score1q0 score2q0

score1q1 score2q1

score1q2 score2q2

score1q3 score2q3

score1q4 score2q4;

s{qtr,sc} = score;

end;

if not missing(score1q0) and not missing(score2q0) then output;

drop qtr sc score;

run;

 

And the result looks like this:

 

CustomerDate0Score1Q0Score2Q2Score1Q1Score2Q1Score1Q2Score2Q2Score1Q3Score2Q3Score1Q4Score2Q4
John2009-09-30CABAABAABB
John2009-12-31BAABAABB D
John2010-03-31ABAABB DEA
John2010-06-30AABB DEA  
John2010-09-30BB DEA    
John2011-03-31EA        
Mary2009-12-31AADECBABDA
Mary2010-03-31DECBABDAAA
Mary2010-06-30CBABDAAAD 
Mary2010-09-30ABDAAAD   
Mary2010-12-31DAAAD     
Mary2011-03-31AAD       
Ted2010-03-31FACBDBAEBB
Ted2010-06-30CBDBAEBB C
Ted2010-09-30DBAEBB CDB
Ted2010-12-31AEBB CDB  
Ted2011-03-31BB CDB    
Ted2011-09-30DB        

 

 

Thank you so very much!!

Ken

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 1316 views
  • 1 like
  • 5 in conversation