BookmarkSubscribeRSS Feed
Tanvi99
Calcite | Level 5

I have 8Billion records and I have to do a self-join to the same data to get an overlap. This code needs to run approximately 41 times. Each iteration is taking approximately 2 hours. Is there a way to run it parallely.. like 5 channels at a time rather than wait for each one to complete?

Any suggestions are appreciated.

 

%macro mcchnl(channel);

proc sql;

create table  &channel._overlap as

select

var1,

var2,

var3,

var4,

..

..

..

 

count(1) as overlap_cnt

from gmog.loaded_detail_&livedate. a , gmog.loaded_detail_&livedate.  b 

where a.pty_id=b.pty_id

     and a.sub_channel=&channel. and b.sub_channel=&channel.

group by 1,2,3,…………;

quit;

%mend;

%mcchnl(chn1);

%mcchnl(chn2);

..

..

%mcchnl(chnl41);

 

7 REPLIES 7
Astounding
PROC Star

SAS has many ways to count things.  Perhaps SQL is the wrong tool for the job this time.

 

Describe what you mean by "overlap" and what you are trying to count, and there will probably be a much faster way to get the job done.

SASKiwi
PROC Star

I suspect you mean run code in parallel (several streams at the same time) not sequential (one after the other).

Tanvi99
Calcite | Level 5

agreed.. I meant parallel.. haha

mkeintz
PROC Star

You are doing a cartesian product of a data set with itself within each pty_id/sub_channel combination.

 

So let's say, out of your 8 billion records, you have 10 records of a specific subchannel within a specific pty_id, for which the Cartesian product would be 100 records.

 

So is it really your intention to produce and count 100 records, grouped by columns 1,2,3 ...?   I do not understand the purpose of your code.

 

Also you have "a.subchannel=&subchannel and b.subchannel=&subchannel".  Why not apply the subchannel filter as a "where=" parameter to the data set, as in

 

 

   from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
            gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b 
    where a.pty_id=b.pty_id
    group by 1,2,3 ...

This forces the sas data engine to do the subchannel filtering rather than reading in a record only to have the filter applied later on by sql.

 

 

Or better yet, do all your subchannels in one pass by

  1. making the "where=" parameters to include all the subchannels of interest
  2. modify the where clause to force equality of a.subchannel with b.subchannel
  3. add subchannel as a group dimension:
    from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
         gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b 
    where a.pty_id=b.pty_id and a.subchannel=b.subchannel
    group by subchannel,1,2,3,..

Perhaps you are running into problems with the resources needed for the Cartesian crossing covering all 41 subchannels.  If so, then change the "where subchannel in (....)" parameters to include a shorter list of subchannels.

 

But there may be more efficient ways to do this.  Here are some questions whose answers might enable a more efficient program:

  1. Are the data sorted by pty_id?  by subchannel?  or both?
  2. Does every pty_id have only one subchannel?

I ask this because - if you really are trying to do Cartesian product of a dataset with itself, it might be easier to pass through the data once to get frequencies of each combination of subchannel, col1, col2, ...  for every pty_id, generating a possibly much smaller aggregate dataset.  Then you could to a Cartesian self-crossing of that aggregate data, where you apply a weight based on the original frequencies.  Possibly much faster.

 

But this conjecture just makes me wonder what measure you are trying to develop by this program.

 

In other words, what do you mean by "overlap"?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Tanvi99
Calcite | Level 5

Thanks Mkeintz for your reply.

the purpose is to do a Cartesian join,. We have offers for customers and we would like to see what other offers the same customer has , which we call overlap offers.

Your suggestion 1 with including where clause along with the table name is definitely efficient. I will implement that.

from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
            gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b
    where a.pty_id=b.pty_id
    group by 1,2,3 ...

 

Coming to

from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
         gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b
    where a.pty_id=b.pty_id and a.subchannel=b.subchannel group by subchannel,1,2,3,..

 

the dataset has all the 41 channels and the doing a join by pty_id and subchannel wasn't working because the data is huge. I tried that and had no luck. so had to do it by each individual channel.

 

thanks for the inputs.

mkeintz
PROC Star

If 41 subchannels is too much, you might be able to do, say, 5 subchannels at a time

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Satish_Parida
Lapis Lazuli | Level 10
1. Is all these runs independent of each other?
2. Can you check your IO, CPU and memory consumption during one of the run.
Running parallel is a tricky business, if your IO capacity is overloaded then you will have no benefit by creating parallel sessions.
However it is very easy to create a child session and mange it in Base SAS.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 7 replies
  • 1236 views
  • 1 like
  • 5 in conversation