Hi guys,
I have been trying to join/match a table (about 10,000 rows) to a table with about 5 million rows. It has been running for hours and I'm afraid I may have done something to prevent this.
Any best practice tips?
I'm a bit confused at the final goal of the merge. Is it to find out how many patients in ClientList have MainPlatform of XOZ? There's a couple of other strategies I'd use if so:
The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging. Again I don't use big data so I don't know if memory wise this is worse or not.
PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN
(select distinct ID, MainPlatform from MainPlatform
WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;
The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:
PROC SQL;
ALTER TABLE merged_table
ADD MainPlatform char(3);
UPDATE merged_table
set MainPlatform='XOZ'
where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;
Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.
Here's the code:
PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN MainPlatform as t2
ON t1.ID = t2.ID
WHERE t2.MainPlatform = 'XOZ';
QUIT;
so like, T1 had only 10,000 points but T2 had around 5 million
I'm a bit confused at the final goal of the merge. Is it to find out how many patients in ClientList have MainPlatform of XOZ? There's a couple of other strategies I'd use if so:
The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging. Again I don't use big data so I don't know if memory wise this is worse or not.
PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN
(select distinct ID, MainPlatform from MainPlatform
WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;
The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:
PROC SQL;
ALTER TABLE merged_table
ADD MainPlatform char(3);
UPDATE merged_table
set MainPlatform='XOZ'
where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;
Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.
You're correct in the overall goal of the code block. I need to determine the subset of patients in the ClientList that XOZ MainPlatforms (which can be found in the other table).
I think those are good alternatives. Honestly, I forgot what an inline view is and may have to review it again. I've read about inline before I am thinking the first one might do the trick. The second one is entirely new for me.
I have just been learning SAS for a couple months and now is the only time I have to use it in practice. Guess I am overwhelmed at how it actually works on a larger data set as opposed to just using sashelp tables 😅
5 million records doesn't sound very large, so unless you have a very slow system, it seems that PROC SQL is not able to optimize the query at all.
And it's all about the data. Do you have any duplicates on ID in either table? If so, how many?
Inner joins is easier to optimize, so figure out if you really need a left join.
Indexing on ID <might> help (index join).
Indexing on MainPlatform might help, especially if it has high number of discrete values.
To understand how PROC SQL is doing, try add these options:
optoins msglevel=i fullstimer;
proc sql _method _tree;
You can also paly with buffersize option to try to get a hash join (since your t1 table has so few rows) and magic= (influences which join algorithm PROC SQL will use).
Whenever you need to subset a large dataset along the data in a smaller dataset, consider using a data step and a hash object:
data merged_table;
set mainplatform;
where mainplatform = " XOZ";
if _n_ = 1
then do;
length
name
location /* set the lengths as needed */
;
declare hash t1 (dataset:"clientlist");
t1.definekey("id");
t1.definedata("name","location");
t1.definedone();
call missing(name,location);
end;
if t1.find() = 0;
run;
Caution: this will only work correctly if id is unique in clientlist.
The code is not tested in any way.
Thanks Kurt! and Thanks everyone for their responses. There is a lot to learn!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.