BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
woo
Lapis Lazuli | Level 10 woo
Lapis Lazuli | Level 10

Hi Friends i am confused on what to use and which one would run fast..., proc sql or data step

I am doing inner join between two sas datasets on one variable which is running very long, of course because i have around 10 millions of records in each sas datasets, i am using following proc sql,

please let me know if there is any alternative...

Thanks in advance...

proc sql;

create table test_a as select * from work.a inner join work.b

on a.test_var=b.test_var;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Keep your data sorted (or at least indexed) and use a data step.  This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

  merge a(in=ina) b(in=inb);

  by test_var ;

  if ina and inb ;

run;

View solution in original post

7 REPLIES 7
DBailey
Lapis Lazuli | Level 10

what sorts/indexes are available?

woo
Lapis Lazuli | Level 10 woo
Lapis Lazuli | Level 10

nop...

Reeza
Super User

Depends. Can you explain more? Specifically:

1. Are the sizes of the data set different?

2. How many variables per dataset?

Tom
Super User Tom
Super User

Keep your data sorted (or at least indexed) and use a data step.  This will run in linear time to the size of datasets. (that is 10million obs should take twice as long as 5 million obs).

data test_a ;

  merge a(in=ina) b(in=inb);

  by test_var ;

  if ina and inb ;

run;

LinusH
Tourmaline | Level 20

Sorry Tom, but I have never come across a situation where the combination of index and BY is a good solution, it's really a performance killer...

As usual, we know too little of this situation to give any good advice.

For starters, what is the "hit-rate" in the join? If almost all records is included in the join result, there's little to do in the code. Pre-sorting the tables would improve performance, but that depends on how often the source tables are updated, and how, and how often this join (and like) will occur.

Moving the source table to an engine that supports multi-threading could one idea. That is SPDE or SPD Server if you want to keep the data in the SAS domain. That will give multi-threaded (parallel) disk reads and implicit BY sorting.

Data never sleeps
woo
Lapis Lazuli | Level 10 woo
Lapis Lazuli | Level 10

I am sorry if i have not provided enough info. but here it is;

8 same variables for both datasets

atleast 9-10 millions of obs for each sas datasets

both are sorted with key variable let's say, test_var

For now i am using Tom's alternative - and really working good so far...but what i have in mind that PROC SQL would pulled query faster than DATA step but it seems proven wrong in this case, i am not sure if i missing anything...i have all same variables so i did rename for one dataset except one key variable, "test_var" and then merging as Tom mentioned here, it is giving me output faster than my above proc sql step...but i am preety sure we have some SQL alternative that we can use here that i am not aware of...

Thanks!

Astounding
PROC Star

Quite often, it's the joining process that takes time (whether in MERGE or in SQL).  Since you don't need any variables from the second table, other than to see if a match exists, it might be faster to create a format.  The limiting factor would be whether you have enough memory to load the format.  For example:

data b2;

   set b (keep=test_var rename=(test_var=start));

   retain label 'Found a match'

              fmtname '$match';

run;

proc format cntlin=b2;

run;

data want;

   set test_a;

   if put(test_var, $match.) = 'Found a match';

run;

If TEST_VAR is numeric rather than character, the name of the format would be MATCH instead of $MATCH.  The resources required would not be very different compared to creating a hash table from b and then using the CHECK method to see if a match can be found.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 738 views
  • 3 likes
  • 6 in conversation