topic Re: Run code sequential in New SAS User

Run code in parallel

Tanvi99 — Tue, 19 Feb 2019 19:36:52 GMT

I have 8Billion records and I have to do a self-join to the same data to get an overlap. This code needs to run approximately 41 times. Each iteration is taking approximately 2 hours. Is there a way to run it parallely.. like 5 channels at a time rather than wait for each one to complete?

Any suggestions are appreciated.

%macro mcchnl(channel);

proc sql;

create table &channel._overlap as

select

var1,

var2,

var3,

var4,

count(1) as overlap_cnt

from gmog.loaded_detail_&livedate. a , gmog.loaded_detail_&livedate. b

where a.pty_id=b.pty_id

and a.sub_channel=&channel. and b.sub_channel=&channel.

group by 1,2,3,…………;

quit;

%mend;

%mcchnl(chn1);

%mcchnl(chn2);

%mcchnl(chnl41);

Re: Run code sequential

Astounding — Tue, 19 Feb 2019 18:54:43 GMT

SAS has many ways to count things. Perhaps SQL is the wrong tool for the job this time.

Describe what you mean by "overlap" and what you are trying to count, and there will probably be a much faster way to get the job done.

Re: Run code sequential

SASKiwi — Tue, 19 Feb 2019 19:02:54 GMT

I suspect you mean run code in parallel (several streams at the same time) not sequential (one after the other).

Re: Run code sequential

Tanvi99 — Tue, 19 Feb 2019 19:34:12 GMT

agreed.. I meant parallel.. haha

Re: Run code sequential

mkeintz — Tue, 19 Feb 2019 19:42:46 GMT

You are doing a cartesian product of a data set with itself within each pty_id/sub_channel combination.

So let's say, out of your 8 billion records, you have 10 records of a specific subchannel within a specific pty_id, for which the Cartesian product would be 100 records.

So is it really your intention to produce and count 100 records, grouped by columns 1,2,3 ...? I do not understand the purpose of your code.

Also you have "a.subchannel=&subchannel and b.subchannel=&subchannel". Why not apply the subchannel filter as a "where=" parameter to the data set, as in

   from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
            gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b 
    where a.pty_id=b.pty_id
    group by 1,2,3 ...

This forces the sas data engine to do the subchannel filtering rather than reading in a record only to have the filter applied later on by sql.

Or better yet, do all your subchannels in one pass by

making the "where=" parameters to include all the subchannels of interest
modify the where clause to force equality of a.subchannel with b.subchannel
add subchannel as a group dimension:

    from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
         gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b 
    where a.pty_id=b.pty_id and a.subchannel=b.subchannel
    group by subchannel,1,2,3,..

Perhaps you are running into problems with the resources needed for the Cartesian crossing covering all 41 subchannels. If so, then change the "where subchannel in (....)" parameters to include a shorter list of subchannels.

But there may be more efficient ways to do this. Here are some questions whose answers might enable a more efficient program:

Are the data sorted by pty_id? by subchannel? or both?
Does every pty_id have only one subchannel?

I ask this because - if you really are trying to do Cartesian product of a dataset with itself, it might be easier to pass through the data once to get frequencies of each combination of subchannel, col1, col2, ... for every pty_id, generating a possibly much smaller aggregate dataset. Then you could to a Cartesian self-crossing of that aggregate data, where you apply a weight based on the original frequencies. Possibly much faster.

But this conjecture just makes me wonder what measure you are trying to develop by this program.

In other words, what do you mean by "overlap"?

Re: Run code sequential

Tanvi99 — Tue, 19 Feb 2019 20:18:41 GMT

Thanks Mkeintz for your reply.

the purpose is to do a Cartesian join,. We have offers for customers and we would like to see what other offers the same customer has , which we call overlap offers.

Your suggestion 1 with including where clause along with the table name is definitely efficient. I will implement that.

from gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) a ,
            gmog.loaded_detail_&livedate (where=(subchannel=&subchannel)) b
    where a.pty_id=b.pty_id
    group by 1,2,3 ...

Coming to

from gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) a ,
gmog.loaded_detail_&livedate (where=(subchannel in (chn1,chn2,...chn141))) b
where a.pty_id=b.pty_id and a.subchannel=b.subchannel group by subchannel,1,2,3,..

the dataset has all the 41 channels and the doing a join by pty_id and subchannel wasn't working because the data is huge. I tried that and had no luck. so had to do it by each individual channel.

thanks for the inputs.

Re: Run code sequential

mkeintz — Tue, 19 Feb 2019 22:06:40 GMT

If 41 subchannels is too much, you might be able to do, say, 5 subchannels at a time

Re: Run code in parallel

Satish_Parida — Fri, 22 Feb 2019 07:25:02 GMT

1. Is all these runs independent of each other?
2. Can you check your IO, CPU and memory consumption during one of the run.
Running parallel is a tricky business, if your IO capacity is overloaded then you will have no benefit by creating parallel sessions.
However it is very easy to create a child session and mange it in Base SAS.