Solved: Re: proc sql or data step

woo · Posted 03-21-2014 01:17 PM

Hi Friends i am confused on what to use and which one would run fast..., proc sql or data step

I am doing inner join between two sas datasets on one variable which is running very long, of course because i have around 10 millions of records in each sas datasets, i am using following proc sql,

please let me know if there is any alternative...

Thanks in advance...

proc sql;

create table test_a as select * from work.a inner join work.b

on a.test_var=b.test_var;

quit;

Tom · Posted 03-21-2014 09:46 PM

Keep your data sorted (or at least indexed) and use a data step. This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

merge a(in=ina) b(in=inb);

by test_var ;

if ina and inb ;

run;

View solution in original post

DBailey · Posted 03-21-2014 01:34 PM

what sorts/indexes are available?

woo · Posted 03-21-2014 04:34 PM

nop...

Reeza · Posted 03-21-2014 06:12 PM

Depends. Can you explain more? Specifically:

1. Are the sizes of the data set different?

2. How many variables per dataset?

Tom · Posted 03-21-2014 09:46 PM

Keep your data sorted (or at least indexed) and use a data step. This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

merge a(in=ina) b(in=inb);

by test_var ;

if ina and inb ;

run;

LinusH · Posted 03-22-2014 05:37 AM

Sorry Tom, but I have never come across a situation where the combination of index and BY is a good solution, it's really a performance killer...

As usual, we know too little of this situation to give any good advice.

For starters, what is the "hit-rate" in the join? If almost all records is included in the join result, there's little to do in the code. Pre-sorting the tables would improve performance, but that depends on how often the source tables are updated, and how, and how often this join (and like) will occur.

Moving the source table to an engine that supports multi-threading could one idea. That is SPDE or SPD Server if you want to keep the data in the SAS domain. That will give multi-threaded (parallel) disk reads and implicit BY sorting.

Data never sleeps

woo · Posted 03-24-2014 01:02 PM

I am sorry if i have not provided enough info. but here it is;

8 same variables for both datasets

atleast 9-10 millions of obs for each sas datasets

both are sorted with key variable let's say, test_var

For now i am using Tom's alternative - and really working good so far...but what i have in mind that PROC SQL would pulled query faster than DATA step but it seems proven wrong in this case, i am not sure if i missing anything...i have all same variables so i did rename for one dataset except one key variable, "test_var" and then merging as Tom mentioned here, it is giving me output faster than my above proc sql step...but i am preety sure we have some SQL alternative that we can use here that i am not aware of...

Thanks!

Astounding · Posted 03-24-2014 01:29 PM

Quite often, it's the joining process that takes time (whether in MERGE or in SQL). Since you don't need any variables from the second table, other than to see if a match exists, it might be faster to create a format. The limiting factor would be whether you have enough memory to load the format. For example:

data b2;

set b (keep=test_var rename=(test_var=start));

retain label 'Found a match'

fmtname '$match';

run;

proc format cntlin=b2;

run;

data want;

set test_a;

if put(test_var, $match.) = 'Found a match';

run;

If TEST_VAR is numeric rather than character, the name of the format would be MATCH instead of $MATCH. The resources required would not be very different compared to creating a hash table from b and then using the CHECK method to see if a match can be found.

Classroom Training Available!