topic Re: SAS left join via proc sql in New SAS User

SAS left join via proc sql

rapt1 — Thu, 01 Jul 2021 14:16:07 GMT

Hi guys,

I have been trying to join/match a table (about 10,000 rows) to a table with about 5 million rows. It has been running for hours and I'm afraid I may have done something to prevent this.

Any best practice tips?

Re: SAS left join via proc sql

JeffMeyers — Thu, 01 Jul 2021 14:24:21 GMT

I don't deal with big data, but I would think if you're doing a big join you could take a smaller sample (1000pts?) Of the big dataset to try it on if you're not confident.
What is the join code you used? Did you join on enough variables?

Re: SAS left join via proc sql

rapt1 — Thu, 01 Jul 2021 14:33:34 GMT

Here's the code:

PROC SQL;

CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN MainPlatform as t2
ON t1.ID = t2.ID
WHERE t2.MainPlatform = 'XOZ';
QUIT;

so like, T1 had only 10,000 points but T2 had around 5 million

Re: SAS left join via proc sql

JeffMeyers — Thu, 01 Jul 2021 14:58:26 GMT

I'm a bit confused at the final goal of the merge. Is it to find out how many patients in ClientList have MainPlatform of XOZ? There's a couple of other strategies I'd use if so:

The next example subsets the bigger dataset with an inline view that only grabs the two variables you're using and takes only the unique combinations of them prior to merging. Again I don't use big data so I don't know if memory wise this is worse or not.

PROC SQL;
CREATE TABLE merged_table AS
SELECT
t1.ID,
t1.Name,
t1.Location,
t2.MainPlatform
FROM ClientList as t1
LEFT JOIN 
  (select distinct ID, MainPlatform from MainPlatform 
        WHERE MainPlatform = 'XOZ') as t2
ON t1.ID = t2.ID;
QUIT;

The next example would add the MainPlatform variable to your ClientList dataset and then updates it wherever the ID variable is in MainPlatform instead of merging:

PROC SQL;
ALTER TABLE merged_table 
    ADD MainPlatform char(3);
UPDATE merged_table
    set MainPlatform='XOZ'
    where ID in (select ID from MainPlatform where MainPlatform='XOZ');
QUIT;

Neither of these examples help as much if the number of XOZ rows in MainPlatform matter for a particular ID.

Re: SAS left join via proc sql

rapt1 — Thu, 01 Jul 2021 15:31:16 GMT

You're correct in the overall goal of the code block. I need to determine the subset of patients in the ClientList that XOZ MainPlatforms (which can be found in the other table).

I think those are good alternatives. Honestly, I forgot what an inline view is and may have to review it again. I've read about inline before I am thinking the first one might do the trick. The second one is entirely new for me.

I have just been learning SAS for a couple months and now is the only time I have to use it in practice. Guess I am overwhelmed at how it actually works on a larger data set as opposed to just using sashelp tables 😅

Re: SAS left join via proc sql

LinusH — Fri, 02 Jul 2021 08:39:53 GMT

5 million records doesn't sound very large, so unless you have a very slow system, it seems that PROC SQL is not able to optimize the query at all.

And it's all about the data. Do you have any duplicates on ID in either table? If so, how many?

Inner joins is easier to optimize, so figure out if you really need a left join.

Indexing on ID <might> help (index join).

Indexing on MainPlatform might help, especially if it has high number of discrete values.

To understand how PROC SQL is doing, try add these options:

optoins msglevel=i fullstimer;
proc sql _method _tree;

You can also paly with buffersize option to try to get a hash join (since your t1 table has so few rows) and magic= (influences which join algorithm PROC SQL will use).

Re: SAS left join via proc sql

rapt1 — Mon, 05 Jul 2021 12:15:34 GMT

Thank you for this alternative. Will try to read into these additional options further.

Re: SAS left join via proc sql

Kurt_Bremser — Mon, 05 Jul 2021 12:30:18 GMT

Whenever you need to subset a large dataset along the data in a smaller dataset, consider using a data step and a hash object:

data merged_table;
set mainplatform;
where mainplatform = " XOZ";
if _n_ = 1
then do;
  length
    name
    location /* set the lengths as needed */
  ;
  declare hash t1 (dataset:"clientlist");
  t1.definekey("id");
  t1.definedata("name","location");
  t1.definedone();
  call missing(name,location);
end;
if t1.find() = 0;
run;

Caution: this will only work correctly if id is unique in clientlist.

The code is not tested in any way.

Re: SAS left join via proc sql

rapt1 — Mon, 05 Jul 2021 12:39:51 GMT

Thanks Kurt! and Thanks everyone for their responses. There is a lot to learn!