Hi Friends i am confused on what to use and which one would run fast..., proc sql or data step
I am doing inner join between two sas datasets on one variable which is running very long, of course because i have around 10 millions of records in each sas datasets, i am using following proc sql,
please let me know if there is any alternative...
Thanks in advance...
proc sql;
create table test_a as select * from work.a inner join work.b
on a.test_var=b.test_var;
quit;
Keep your data sorted (or at least indexed) and use a data step. This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).
data test_a ;
merge a(in=ina) b(in=inb);
by test_var ;
if ina and inb ;
run;
what sorts/indexes are available?
nop...
Depends. Can you explain more? Specifically:
1. Are the sizes of the data set different?
2. How many variables per dataset?
Keep your data sorted (or at least indexed) and use a data step. This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).
data test_a ;
merge a(in=ina) b(in=inb);
by test_var ;
if ina and inb ;
run;
Sorry Tom, but I have never come across a situation where the combination of index and BY is a good solution, it's really a performance killer...
As usual, we know too little of this situation to give any good advice.
For starters, what is the "hit-rate" in the join? If almost all records is included in the join result, there's little to do in the code. Pre-sorting the tables would improve performance, but that depends on how often the source tables are updated, and how, and how often this join (and like) will occur.
Moving the source table to an engine that supports multi-threading could one idea. That is SPDE or SPD Server if you want to keep the data in the SAS domain. That will give multi-threaded (parallel) disk reads and implicit BY sorting.
I am sorry if i have not provided enough info. but here it is;
8 same variables for both datasets
atleast 9-10 millions of obs for each sas datasets
both are sorted with key variable let's say, test_var
For now i am using Tom's alternative - and really working good so far...but what i have in mind that PROC SQL would pulled query faster than DATA step but it seems proven wrong in this case, i am not sure if i missing anything...i have all same variables so i did rename for one dataset except one key variable, "test_var" and then merging as Tom mentioned here, it is giving me output faster than my above proc sql step...but i am preety sure we have some SQL alternative that we can use here that i am not aware of...
Thanks!
Quite often, it's the joining process that takes time (whether in MERGE or in SQL). Since you don't need any variables from the second table, other than to see if a match exists, it might be faster to create a format. The limiting factor would be whether you have enough memory to load the format. For example:
data b2;
set b (keep=test_var rename=(test_var=start));
retain label 'Found a match'
fmtname '$match';
run;
proc format cntlin=b2;
run;
data want;
set test_a;
if put(test_var, $match.) = 'Found a match';
run;
If TEST_VAR is numeric rather than character, the name of the format would be MATCH instead of $MATCH. The resources required would not be very different compared to creating a hash table from b and then using the CHECK method to see if a match can be found.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.