BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rapt1
Obsidian | Level 7

Hi guys,

 

I have been trying to join/match a table (about 10,000 rows) to a table with about 5 million rows. It has been running for hours and I'm afraid I may have done something to prevent this.

 

Any best practice tips?

1 ACCEPTED SOLUTION

Accepted Solutions
JeffMeyers
Barite | Level 11

I'm a bit confused at the final goal of the merge.  Is it to find out how many patients in ClientList have MainPlatform of XOZ?  There's a couple of other strategies I'd use if so:

 

The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging.  Again I don't use big data so I don't know if memory wise this is worse or not.

PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN 
  (select distinct ID, MainPlatform from MainPlatform 
        WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;

The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:

PROC SQL;
ALTER TABLE merged_table 
    ADD MainPlatform char(3);
UPDATE merged_table
    set MainPlatform='XOZ'
    where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;

Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.

 

View solution in original post

8 REPLIES 8
JeffMeyers
Barite | Level 11
I don't deal with big data, but I would think if you're doing a big join you could take a smaller sample (1000pts?) Of the big dataset to try it on if you're not confident.
What is the join code you used? Did you join on enough variables?
rapt1
Obsidian | Level 7

Here's the code:

PROC SQL;

CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN MainPlatform as t2
ON t1.ID = t2.ID
WHERE t2.MainPlatform = 'XOZ';
QUIT;

so like, T1 had only 10,000 points but T2 had around 5 million

JeffMeyers
Barite | Level 11

I'm a bit confused at the final goal of the merge.  Is it to find out how many patients in ClientList have MainPlatform of XOZ?  There's a couple of other strategies I'd use if so:

 

The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging.  Again I don't use big data so I don't know if memory wise this is worse or not.

PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN 
  (select distinct ID, MainPlatform from MainPlatform 
        WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;

The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:

PROC SQL;
ALTER TABLE merged_table 
    ADD MainPlatform char(3);
UPDATE merged_table
    set MainPlatform='XOZ'
    where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;

Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.

 

rapt1
Obsidian | Level 7

You're correct in the overall goal of the code block. I need to determine the subset of patients in the ClientList that XOZ MainPlatforms (which can be found in the other table).

 

I think those are good alternatives. Honestly, I forgot what an inline view is and may have to review it again. I've read about inline before I am thinking the first one might do the trick. The second one is entirely new for me.

 

I have just been learning SAS for a couple months and now is the only time I have to use it in practice. Guess I am overwhelmed at how it actually works on a larger data set as opposed to just using sashelp tables 😅

LinusH
Tourmaline | Level 20

5 million records doesn't sound very large, so unless you have a very slow system, it seems that PROC SQL is not able to optimize the query at all.

And it's all about the data. Do you have any duplicates on ID in either table? If so, how many?

Inner joins is easier to optimize, so figure out if you really need a left join.

Indexing on ID <might> help (index join).

Indexing on MainPlatform might help, especially if it has high number of discrete values.

To understand how PROC SQL is doing, try add these options:

optoins msglevel=i fullstimer;
proc sql _method _tree;

You can also paly with buffersize option to try to get a hash join (since your t1 table has so few rows) and magic= (influences which join algorithm PROC SQL will use).

Data never sleeps
rapt1
Obsidian | Level 7
Thank you for this alternative. Will try to read into these additional options further.
Kurt_Bremser
Super User

Whenever you need to subset a large dataset along the data in a smaller dataset, consider using a data step and a hash object:

data merged_table;
set mainplatform;
where mainplatform = " XOZ";
if _n_ = 1
then do;
  length
    name
    location /* set the lengths as needed */
  ;
  declare hash t1 (dataset:"clientlist");
  t1.definekey("id");
  t1.definedata("name","location");
  t1.definedone();
  call missing(name,location);
end;
if t1.find() = 0;
run;

Caution: this will only work correctly if id is unique in clientlist.

The code is not tested in any way.

rapt1
Obsidian | Level 7

Thanks Kurt! and Thanks everyone for their responses. There is a lot to learn!

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 10356 views
  • 4 likes
  • 4 in conversation