I have 8Billion records and I have to do a self-join to the same data to get an overlap. This code needs to run approximately 41 times. Each iteration is taking approximately 2 hours. Is there a way to run it parallely.. like 5 channels at a time rather than wait for each one to complete?
Any suggestions are appreciated.
%macro mcchnl(channel);
proc sql;
create table &channel._overlap as
select
var1,
var2,
var3,
var4,
..
..
..
count(1) as overlap_cnt
from gmog.loaded_detail_&livedate. a , gmog.loaded_detail_&livedate. b
where a.pty_id=b.pty_id
and a.sub_channel=&channel. and b.sub_channel=&channel.
group by 1,2,3,…………;
quit;
%mend;
%mcchnl(chn1);
%mcchnl(chn2);
..
..
%mcchnl(chnl41);
SAS has many ways to count things. Perhaps SQL is the wrong tool for the job this time.
Describe what you mean by "overlap" and what you are trying to count, and there will probably be a much faster way to get the job done.
I suspect you mean run code in parallel (several streams at the same time) not sequential (one after the other).
agreed.. I meant parallel.. haha
You are doing a cartesian product of a data set with itself within each pty_id/sub_channel combination.
So let's say, out of your 8 billion records, you have 10 records of a specific subchannel within a specific pty_id, for which the Cartesian product would be 100 records.
So is it really your intention to produce and count 100 records, grouped by columns 1,2,3 ...? I do not understand the purpose of your code.
Also you have "a.subchannel=&subchannel and b.subchannel=&subchannel". Why not apply the subchannel filter as a "where=" parameter to the data set, as in
from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b
where a.pty_id=b.pty_id
group by 1,2,3 ...
This forces the sas data engine to do the subchannel filtering rather than reading in a record only to have the filter applied later on by sql.
Or better yet, do all your subchannels in one pass by
from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b
where a.pty_id=b.pty_id and a.subchannel=b.subchannel
group by subchannel,1,2,3,..
Perhaps you are running into problems with the resources needed for the Cartesian crossing covering all 41 subchannels. If so, then change the "where subchannel in (....)" parameters to include a shorter list of subchannels.
But there may be more efficient ways to do this. Here are some questions whose answers might enable a more efficient program:
I ask this because - if you really are trying to do Cartesian product of a dataset with itself, it might be easier to pass through the data once to get frequencies of each combination of subchannel, col1, col2, ... for every pty_id, generating a possibly much smaller aggregate dataset. Then you could to a Cartesian self-crossing of that aggregate data, where you apply a weight based on the original frequencies. Possibly much faster.
But this conjecture just makes me wonder what measure you are trying to develop by this program.
In other words, what do you mean by "overlap"?
Thanks Mkeintz for your reply.
the purpose is to do a Cartesian join,. We have offers for customers and we would like to see what other offers the same customer has , which we call overlap offers.
Your suggestion 1 with including where clause along with the table name is definitely efficient. I will implement that.
from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b
where a.pty_id=b.pty_id
group by 1,2,3 ...
Coming to
from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b
where a.pty_id=b.pty_id and a.subchannel=b.subchannel group by subchannel,1,2,3,..
the dataset has all the 41 channels and the doing a join by pty_id and subchannel wasn't working because the data is huge. I tried that and had no luck. so had to do it by each individual channel.
thanks for the inputs.
If 41 subchannels is too much, you might be able to do, say, 5 subchannels at a time
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.