BookmarkSubscribeRSS Feed
Martin_Bryant
Quartz | Level 8

Hi,

 

I'd like to convert the following SQL to SAS efficiently.

 

I have two datasets. smallTable (1000 rows, 1 Col), bigTable (10,000,000 rows 100 cols). I want to merge these into one dataset. The SQL would be...

 

create table usefulStuff as
    select smallTable.PrimaryId, bigTable.Field1, bigTable.Field2, bigTable.FieldN
    from smallTable left join bigTable on smallTable.PrimaryId = bigTable.ForeignId
order by smallTable.PrimaryId

At the moment the tables are not sorted on the IDs. Therefore my understanding is that I need to first sort each table on the ID and rename the ID fields to the same name. Then I can do a Merge selecting only records in bigTable where there is one in smallTable. The sorting on bigTable seems very inefficient unless I can sort it using a where clause that says something like Where ForeignId in (select PrimaryId from smallTable).

 

The reason for doing this is that we have multiple records which in standard SQL can only be removed using group by and either max() or min(). This isn't what I want. I want to be able to select first.smallTable or last.smallTable.

 

Many thanks for all constructive suggestions. - Martin

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

Do you want all 100 columns or just 1, 2 and N?

Martin_Bryant
Quartz | Level 8
I would like to know a good method to do this in multiple situations. For this specific example, let's say that I want about 10 columns.
PeterClemmensen
Tourmaline | Level 20

First, let us create some data that replicates your actual data. smallTable below has 1000 obs and 1 variable (PrimaryID) and bigTable has 101 cols: ForeignID, field1, field2, ..., field100. I assume that the ID variables are character, but it will work with numeric as well.

data smallTable(rename = id = PrimaryId keep = id       )
     bigTable  (rename = id = ForeignId keep = id field:);
   do i = 1 to 1e7;
      id = uuidgen();
      array f field1 - field100;
      do over f;
         f = ceil(rand('uniform') * 10);
      end;
      output bigTable;
      if mod(i, 1e4) = 0 then output smallTable;
   end;
run;

Next, I try to replicate your SQL code. I do not use Sorting since you mention a time heavy sort procedure. Instead I use hashing. This takes about 15 sec on my system.

 

Let me know if this works for you.

 

data want(drop = PrimaryId);

   if _N_ = 1 then do;
      dcl hash h(dataset : "smallTable", hashexp : 20);
      h.definekey("PrimaryId");
      h.definedone();
   end;

   set bigTable(keep = ForeignId Field1 Field2 Field100);
   if 0 then set smallTable;

   if h.check(key : ForeignId) = 0;

run;
Martin_Bryant
Quartz | Level 8

Thanks, that looks like something that I really want to understand fully. A question that I can't find in the docs. What does the zero in if 0 then set smallTable; refer to?

 

Also, having read the docs on hashing, it suggests that I load the large dataset into the hash table rather than the small one. What are your thoughts on this?

 

Thanks for the help. Greatly appreciated.

 

Kurt_Bremser
Super User
if 0 then set smallTable;

Since 0 is false, this statement is never executed, but its declarative element is recognized by the data step compiler, so it reads the dataset's metadata and prepares its columns in the PDV.

This makes it unnecessary to define the variables "manually" by using a LENGTH statement (just naming the variables in the DEFINEKEY and/or DEFINEDATA methods is not sufficient to declare them correctly).

 

The PDV is set up at data step compile time, but the hash object is created entirely at runtime, so the data step compiler does not have an idea what the hash methods will do.

PeterClemmensen
Tourmaline | Level 20

Regarding the "if 0 then set" stuff: What @Kurt_Bremser said.

 

Regaring loading the big data into the hash object: In this case, I agree that the smart thing is to read the small data into the hash and read the big data sequentially using the Set Statement. However, there could be cases where it is the other way around.

 

Btw, the reverse logic looks like this. You can compare the run times yourself 🙂

 

data want2(drop = ForeignId rc);

   if _N_ = 1 then do;
      dcl hash h(dataset : "bigTable");
      h.definekey("ForeignId");
      h.definedata("Field1", "Field2", "Field100");
      h.definedone();
   end;

   set smallTable;
   if 0 then set bigTable(keep = ForeignId Field1 Field2 Field100);

   rc = h.find(key : PrimaryId);
run;
Kurt_Bremser
Super User

For lookups like this, I also strongly recommend the hash object approach, as it will outperform any other method and does not need any sort of the large table.

Martin_Bryant
Quartz | Level 8

Thanks. Found out what a PDV is and why you do the if 0 . Looks good. Will try that solution.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

Creating Custom Steps in SAS Studio

Check out this tutorial series to learn how to build your own steps in SAS Studio.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1932 views
  • 2 likes
  • 3 in conversation