topic Re: proc sql or data step in SAS Programming

proc sql or data step

woo — Fri, 21 Mar 2014 17:17:40 GMT

Hi Friends i am confused on what to use and which one would run fast..., proc sql or data step

I am doing inner join between two sas datasets on one variable which is running very long, of course because i have around 10 millions of records in each sas datasets, i am using following proc sql,

please let me know if there is any alternative...

Thanks in advance...

proc sql;

create table test_a as select * from work.a inner join work.b

on a.test_var=b.test_var;

quit;

Re: proc sql or data step

DBailey — Fri, 21 Mar 2014 17:34:54 GMT

what sorts/indexes are available?

Re: proc sql or data step

woo — Fri, 21 Mar 2014 20:34:13 GMT

nop...

Re: proc sql or data step

Reeza — Fri, 21 Mar 2014 22:12:15 GMT

Depends. Can you explain more? Specifically:

1. Are the sizes of the data set different?

2. How many variables per dataset?

Re: proc sql or data step

Tom — Sat, 22 Mar 2014 01:46:03 GMT

Keep your data sorted (or at least indexed) and use a data step. This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

merge a(in=ina) b(in=inb);

by test_var ;

if ina and inb ;

run;

Re: proc sql or data step

LinusH — Sat, 22 Mar 2014 09:37:42 GMT

Sorry Tom, but I have never come across a situation where the combination of index and BY is a good solution, it's really a performance killer...

As usual, we know too little of this situation to give any good advice.

For starters, what is the "hit-rate" in the join? If almost all records is included in the join result, there's little to do in the code. Pre-sorting the tables would improve performance, but that depends on how often the source tables are updated, and how, and how often this join (and like) will occur.

Moving the source table to an engine that supports multi-threading could one idea. That is SPDE or SPD Server if you want to keep the data in the SAS domain. That will give multi-threaded (parallel) disk reads and implicit BY sorting.

Re: proc sql or data step

woo — Mon, 24 Mar 2014 17:02:21 GMT

I am sorry if i have not provided enough info. but here it is;

8 same variables for both datasets

atleast 9-10 millions of obs for each sas datasets

both are sorted with key variable let's say, test_var

For now i am using Tom's alternative - and really working good so far...but what i have in mind that PROC SQL would pulled query faster than DATA step but it seems proven wrong in this case, i am not sure if i missing anything...i have all same variables so i did rename for one dataset except one key variable, "test_var" and then merging as Tom mentioned here, it is giving me output faster than my above proc sql step...but i am preety sure we have some SQL alternative that we can use here that i am not aware of...

Thanks!

Re: proc sql or data step

Astounding — Mon, 24 Mar 2014 17:29:52 GMT

Quite often, it's the joining process that takes time (whether in MERGE or in SQL). Since you don't need any variables from the second table, other than to see if a match exists, it might be faster to create a format. The limiting factor would be whether you have enough memory to load the format. For example:

data b2;

set b (keep=test_var rename=(test_var=start));

retain label 'Found a match'

fmtname '$match';

run;

proc format cntlin=b2;

run;

data want;

set test_a;

if put(test_var, $match.) = 'Found a match';

run;

If TEST_VAR is numeric rather than character, the name of the format would be MATCH instead of $MATCH. The resources required would not be very different compared to creating a hash table from b and then using the CHECK method to see if a match can be found.