proc sql or data step

Accepted Solution Solved
Reply
Super Contributor
Super Contributor
Posts: 358
Accepted Solution

proc sql or data step

Hi Friends i am confused on what to use and which one would run fast..., proc sql or data step

I am doing inner join between two sas datasets on one variable which is running very long, of course because i have around 10 millions of records in each sas datasets, i am using following proc sql,

please let me know if there is any alternative...

Thanks in advance...

proc sql;

create table test_a as select * from work.a inner join work.b

on a.test_var=b.test_var;

quit;


Accepted Solutions
Solution
‎03-21-2014 09:46 PM
Super User
Super User
Posts: 7,042

Re: proc sql or data step

Keep your data sorted (or at least indexed) and use a data step.  This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

  merge a(in=ina) b(in=inb);

  by test_var ;

  if ina and inb ;

run;

View solution in original post


All Replies
Super Contributor
Posts: 578

Re: proc sql or data step

what sorts/indexes are available?

Super Contributor
Super Contributor
Posts: 358

Re: proc sql or data step

nop...

Super User
Posts: 19,791

Re: proc sql or data step

Depends. Can you explain more? Specifically:

1. Are the sizes of the data set different?

2. How many variables per dataset?

Solution
‎03-21-2014 09:46 PM
Super User
Super User
Posts: 7,042

Re: proc sql or data step

Keep your data sorted (or at least indexed) and use a data step.  This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

  merge a(in=ina) b(in=inb);

  by test_var ;

  if ina and inb ;

run;

Super User
Posts: 5,429

Re: proc sql or data step

Sorry Tom, but I have never come across a situation where the combination of index and BY is a good solution, it's really a performance killer...

As usual, we know too little of this situation to give any good advice.

For starters, what is the "hit-rate" in the join? If almost all records is included in the join result, there's little to do in the code. Pre-sorting the tables would improve performance, but that depends on how often the source tables are updated, and how, and how often this join (and like) will occur.

Moving the source table to an engine that supports multi-threading could one idea. That is SPDE or SPD Server if you want to keep the data in the SAS domain. That will give multi-threaded (parallel) disk reads and implicit BY sorting.

Data never sleeps
Super Contributor
Super Contributor
Posts: 358

Re: proc sql or data step

I am sorry if i have not provided enough info. but here it is;

8 same variables for both datasets

atleast 9-10 millions of obs for each sas datasets

both are sorted with key variable let's say, test_var

For now i am using Tom's alternative - and really working good so far...but what i have in mind that PROC SQL would pulled query faster than DATA step but it seems proven wrong in this case, i am not sure if i missing anything...i have all same variables so i did rename for one dataset except one key variable, "test_var" and then merging as Tom mentioned here, it is giving me output faster than my above proc sql step...but i am preety sure we have some SQL alternative that we can use here that i am not aware of...

Thanks!

Super User
Posts: 5,504

Re: proc sql or data step

Quite often, it's the joining process that takes time (whether in MERGE or in SQL).  Since you don't need any variables from the second table, other than to see if a match exists, it might be faster to create a format.  The limiting factor would be whether you have enough memory to load the format.  For example:

data b2;

   set b (keep=test_var rename=(test_var=start));

   retain label 'Found a match'

              fmtname '$match';

run;

proc format cntlin=b2;

run;

data want;

   set test_a;

   if put(test_var, $match.) = 'Found a match';

run;

If TEST_VAR is numeric rather than character, the name of the format would be MATCH instead of $MATCH.  The resources required would not be very different compared to creating a hash table from b and then using the CHECK method to see if a match can be found.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 348 views
  • 3 likes
  • 6 in conversation