Re: How to convert SQL to SAS code efficiently

Martin_Bryant · Posted 02-15-2021 12:31 PM

Hi,

I'd like to convert the following SQL to SAS efficiently.

I have two datasets. smallTable (1000 rows, 1 Col), bigTable (10,000,000 rows 100 cols). I want to merge these into one dataset. The SQL would be...

create table usefulStuff as
    select smallTable.PrimaryId, bigTable.Field1, bigTable.Field2, bigTable.FieldN
    from smallTable left join bigTable on smallTable.PrimaryId = bigTable.ForeignId
    order by smallTable.PrimaryId

At the moment the tables are not sorted on the IDs. Therefore my understanding is that I need to first sort each table on the ID and rename the ID fields to the same name. Then I can do a Merge selecting only records in bigTable where there is one in smallTable. The sorting on bigTable seems very inefficient unless I can sort it using a where clause that says something like Where ForeignId in (select PrimaryId from smallTable).

The reason for doing this is that we have multiple records which in standard SQL can only be removed using group by and either max() or min(). This isn't what I want. I want to be able to select first.smallTable or last.smallTable.

Many thanks for all constructive suggestions. - Martin

PeterClemmensen · Posted 02-15-2021 12:34 PM

Do you want all 100 columns or just 1, 2 and N?

The DATA to DATA Step Macro
Blog: SASnrd

Martin_Bryant · Posted 02-15-2021 12:37 PM

I would like to know a good method to do this in multiple situations. For this specific example, let's say that I want about 10 columns.

PeterClemmensen · Posted 02-15-2021 01:19 PM

First, let us create some data that replicates your actual data. smallTable below has 1000 obs and 1 variable (PrimaryID) and bigTable has 101 cols: ForeignID, field1, field2, ..., field100. I assume that the ID variables are character, but it will work with numeric as well.

data smallTable(rename = id = PrimaryId keep = id       )
     bigTable  (rename = id = ForeignId keep = id field:);
   do i = 1 to 1e7;
      id = uuidgen();
      array f field1 - field100;
      do over f;
         f = ceil(rand('uniform') * 10);
      end;
      output bigTable;
      if mod(i, 1e4) = 0 then output smallTable;
   end;
run;

Next, I try to replicate your SQL code. I do not use Sorting since you mention a time heavy sort procedure. Instead I use hashing. This takes about 15 sec on my system.

Let me know if this works for you.

data want(drop = PrimaryId);

   if _N_ = 1 then do;
      dcl hash h(dataset : "smallTable", hashexp : 20);
      h.definekey("PrimaryId");
      h.definedone();
   end;

   set bigTable(keep = ForeignId Field1 Field2 Field100);
   if 0 then set smallTable;

   if h.check(key : ForeignId) = 0;

run;

The DATA to DATA Step Macro
Blog: SASnrd

Martin_Bryant · Posted 02-16-2021 05:22 AM

Thanks, that looks like something that I really want to understand fully. A question that I can't find in the docs. What does the zero in if 0 then set smallTable; refer to?

Also, having read the docs on hashing, it suggests that I load the large dataset into the hash table rather than the small one. What are your thoughts on this?

Thanks for the help. Greatly appreciated.

Kurt_Bremser · Posted 02-16-2021 05:31 AM

if 0 then set smallTable;

Since 0 is false, this statement is never executed, but its declarative element is recognized by the data step compiler, so it reads the dataset's metadata and prepares its columns in the PDV.

This makes it unnecessary to define the variables "manually" by using a LENGTH statement (just naming the variables in the DEFINEKEY and/or DEFINEDATA methods is not sufficient to declare them correctly).

The PDV is set up at data step compile time, but the hash object is created entirely at runtime, so the data step compiler does not have an idea what the hash methods will do.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 02-16-2021 05:39 AM

And you definitely want to load the small table into the hash. Put the lookup into memory, and process the large table sequentially.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

PeterClemmensen · Posted 02-16-2021 06:44 AM

Regarding the "if 0 then set" stuff: What @Kurt_Bremser said.

Regaring loading the big data into the hash object: In this case, I agree that the smart thing is to read the small data into the hash and read the big data sequentially using the Set Statement. However, there could be cases where it is the other way around.

Btw, the reverse logic looks like this. You can compare the run times yourself 🙂

data want2(drop = ForeignId rc);

   if _N_ = 1 then do;
      dcl hash h(dataset : "bigTable");
      h.definekey("ForeignId");
      h.definedata("Field1", "Field2", "Field100");
      h.definedone();
   end;

   set smallTable;
   if 0 then set bigTable(keep = ForeignId Field1 Field2 Field100);

   rc = h.find(key : PrimaryId);
run;

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 02-16-2021 04:15 AM

For lookups like this, I also strongly recommend the hash object approach, as it will outperform any other method and does not need any sort of the large table.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Martin_Bryant · Posted 02-16-2021 06:16 AM

Thanks. Found out what a PDV is and why you do the if 0 . Looks good. Will try that solution.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away