Solved: Re: SAS left join via proc sql

rapt1 · Posted 07-01-2021 10:16 AM

Hi guys,

I have been trying to join/match a table (about 10,000 rows) to a table with about 5 million rows. It has been running for hours and I'm afraid I may have done something to prevent this.

Any best practice tips?

JeffMeyers · Posted 07-01-2021 10:58 AM

I'm a bit confused at the final goal of the merge. Is it to find out how many patients in ClientList have MainPlatform of XOZ? There's a couple of other strategies I'd use if so:

The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging. Again I don't use big data so I don't know if memory wise this is worse or not.

PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN 
  (select distinct ID, MainPlatform from MainPlatform 
        WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;

The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:

PROC SQL;
ALTER TABLE merged_table 
    ADD MainPlatform char(3);
UPDATE merged_table
    set MainPlatform='XOZ'
    where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;

Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.

View solution in original post

JeffMeyers · Posted 07-01-2021 10:24 AM

I don't deal with big data, but I would think if you're doing a big join you could take a smaller sample (1000pts?) Of the big dataset to try it on if you're not confident.
What is the join code you used? Did you join on enough variables?

rapt1 · Posted 07-01-2021 10:30 AM

Here's the code:

PROC SQL;

CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN MainPlatform as t2
ON t1.ID = t2.ID
WHERE t2.MainPlatform = 'XOZ';
QUIT;

so like, T1 had only 10,000 points but T2 had around 5 million

JeffMeyers · Posted 07-01-2021 10:58 AM

I'm a bit confused at the final goal of the merge. Is it to find out how many patients in ClientList have MainPlatform of XOZ? There's a couple of other strategies I'd use if so:

The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging. Again I don't use big data so I don't know if memory wise this is worse or not.

PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN 
  (select distinct ID, MainPlatform from MainPlatform 
        WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;

The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:

PROC SQL;
ALTER TABLE merged_table 
    ADD MainPlatform char(3);
UPDATE merged_table
    set MainPlatform='XOZ'
    where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;

Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.

rapt1 · Posted 07-01-2021 11:31 AM

You're correct in the overall goal of the code block. I need to determine the subset of patients in the ClientList that XOZ MainPlatforms (which can be found in the other table).

I think those are good alternatives. Honestly, I forgot what an inline view is and may have to review it again. I've read about inline before I am thinking the first one might do the trick. The second one is entirely new for me.

I have just been learning SAS for a couple months and now is the only time I have to use it in practice. Guess I am overwhelmed at how it actually works on a larger data set as opposed to just using sashelp tables 😅

LinusH · Posted 07-02-2021 04:39 AM

5 million records doesn't sound very large, so unless you have a very slow system, it seems that PROC SQL is not able to optimize the query at all.

And it's all about the data. Do you have any duplicates on ID in either table? If so, how many?

Inner joins is easier to optimize, so figure out if you really need a left join.

Indexing on ID <might> help (index join).

Indexing on MainPlatform might help, especially if it has high number of discrete values.

To understand how PROC SQL is doing, try add these options:

optoins msglevel=i fullstimer;
proc sql _method _tree;

You can also paly with buffersize option to try to get a hash join (since your t1 table has so few rows) and magic= (influences which join algorithm PROC SQL will use).

Data never sleeps

rapt1 · Posted 07-05-2021 08:15 AM

Thank you for this alternative. Will try to read into these additional options further.

Kurt_Bremser · Posted 07-05-2021 08:30 AM

Whenever you need to subset a large dataset along the data in a smaller dataset, consider using a data step and a hash object:

data merged_table;
set mainplatform;
where mainplatform = " XOZ";
if _n_ = 1
then do;
  length
    name
    location /* set the lengths as needed */
  ;
  declare hash t1 (dataset:"clientlist");
  t1.definekey("id");
  t1.definedata("name","location");
  t1.definedone();
  call missing(name,location);
end;
if t1.find() = 0;
run;

Caution: this will only work correctly if id is unique in clientlist.

The code is not tested in any way.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

rapt1 · Posted 07-05-2021 08:39 AM

Thanks Kurt! and Thanks everyone for their responses. There is a lot to learn!

SAS Innovate 2025: Call for Content