topic Re: How to convert SQL to SAS code efficiently in SAS Enterprise Guide

How to convert SQL to SAS code efficiently

Martin_Bryant — Mon, 15 Feb 2021 17:31:48 GMT

Hi,

I'd like to convert the following SQL to SAS efficiently.

I have two datasets. smallTable (1000 rows, 1 Col), bigTable (10,000,000 rows 100 cols). I want to merge these into one dataset. The SQL would be...

create table usefulStuff as
    select smallTable.PrimaryId, bigTable.Field1, bigTable.Field2, bigTable.FieldN
    from smallTable left join bigTable on smallTable.PrimaryId = bigTable.ForeignId
    order by smallTable.PrimaryId

At the moment the tables are not sorted on the IDs. Therefore my understanding is that I need to first sort each table on the ID and rename the ID fields to the same name. Then I can do a Merge selecting only records in bigTable where there is one in smallTable. The sorting on bigTable seems very inefficient unless I can sort it using a where clause that says something like Where ForeignId in (select PrimaryId from smallTable).

The reason for doing this is that we have multiple records which in standard SQL can only be removed using group by and either max() or min(). This isn't what I want. I want to be able to select first.smallTable or last.smallTable.

Many thanks for all constructive suggestions. - Martin

Re: How to convert SQL to SAS code efficiently

PeterClemmensen — Mon, 15 Feb 2021 17:34:29 GMT

Do you want all 100 columns or just 1, 2 and N?

Re: How to convert SQL to SAS code efficiently

Martin_Bryant — Mon, 15 Feb 2021 17:37:38 GMT

I would like to know a good method to do this in multiple situations. For this specific example, let's say that I want about 10 columns.

Re: How to convert SQL to SAS code efficiently

PeterClemmensen — Mon, 15 Feb 2021 18:20:54 GMT

First, let us create some data that replicates your actual data. smallTable below has 1000 obs and 1 variable (PrimaryID) and bigTable has 101 cols: ForeignID, field1, field2, ..., field100. I assume that the ID variables are character, but it will work with numeric as well.

data smallTable(rename = id = PrimaryId keep = id       )
     bigTable  (rename = id = ForeignId keep = id field:);
   do i = 1 to 1e7;
      id = uuidgen();
      array f field1 - field100;
      do over f;
         f = ceil(rand('uniform') * 10);
      end;
      output bigTable;
      if mod(i, 1e4) = 0 then output smallTable;
   end;
run;

Next, I try to replicate your SQL code. I do not use Sorting since you mention a time heavy sort procedure. Instead I use hashing. This takes about 15 sec on my system.

Let me know if this works for you.

data want(drop = PrimaryId);

   if _N_ = 1 then do;
      dcl hash h(dataset : "smallTable", hashexp : 20);
      h.definekey("PrimaryId");
      h.definedone();
   end;

   set bigTable(keep = ForeignId Field1 Field2 Field100);
   if 0 then set smallTable;

   if h.check(key : ForeignId) = 0;

run;

Re: How to convert SQL to SAS code efficiently

Kurt_Bremser — Tue, 16 Feb 2021 09:15:25 GMT

For lookups like this, I also strongly recommend the hash object approach, as it will outperform any other method and does not need any sort of the large table.

Re: How to convert SQL to SAS code efficiently

Martin_Bryant — Tue, 16 Feb 2021 10:25:51 GMT

Thanks, that looks like something that I really want to understand fully. A question that I can't find in the docs. What does the zero in if 0 then set smallTable; refer to?

Also, having read the docs on hashing, it suggests that I load the large dataset into the hash table rather than the small one. What are your thoughts on this?

Thanks for the help. Greatly appreciated.

Re: How to convert SQL to SAS code efficiently

Kurt_Bremser — Tue, 16 Feb 2021 10:31:27 GMT

if 0 then set smallTable;

Since 0 is false, this statement is never executed, but its declarative element is recognized by the data step compiler, so it reads the dataset's metadata and prepares its columns in the PDV.

This makes it unnecessary to define the variables "manually" by using a LENGTH statement (just naming the variables in the DEFINEKEY and/or DEFINEDATA methods is not sufficient to declare them correctly).

The PDV is set up at data step compile time, but the hash object is created entirely at runtime, so the data step compiler does not have an idea what the hash methods will do.

Re: How to convert SQL to SAS code efficiently

Kurt_Bremser — Tue, 16 Feb 2021 10:39:14 GMT

And you definitely want to load the small table into the hash. Put the lookup into memory, and process the large table sequentially.

Re: How to convert SQL to SAS code efficiently

Martin_Bryant — Tue, 16 Feb 2021 11:16:01 GMT

Thanks. Found out what a PDV is and why you do the if 0 . Looks good. Will try that solution.

Re: How to convert SQL to SAS code efficiently

PeterClemmensen — Tue, 16 Feb 2021 11:44:01 GMT

Regarding the "if 0 then set" stuff: What @Kurt_Bremser said.

Regaring loading the big data into the hash object: In this case, I agree that the smart thing is to read the small data into the hash and read the big data sequentially using the Set Statement. However, there could be cases where it is the other way around.

Btw, the reverse logic looks like this. You can compare the run times yourself 🙂

data want2(drop = ForeignId rc);

   if _N_ = 1 then do;
      dcl hash h(dataset : "bigTable");
      h.definekey("ForeignId");
      h.definedata("Field1", "Field2", "Field100");
      h.definedone();
   end;

   set smallTable;
   if 0 then set bigTable(keep = ForeignId Field1 Field2 Field100);

   rc = h.find(key : PrimaryId);
run;