Hi,
I'd like to convert the following SQL to SAS efficiently.
I have two datasets. smallTable (1000 rows, 1 Col), bigTable (10,000,000 rows 100 cols). I want to merge these into one dataset. The SQL would be...
create table usefulStuff as select smallTable.PrimaryId, bigTable.Field1, bigTable.Field2, bigTable.FieldN from smallTable left join bigTable on smallTable.PrimaryId = bigTable.ForeignId
order by smallTable.PrimaryId
At the moment the tables are not sorted on the IDs. Therefore my understanding is that I need to first sort each table on the ID and rename the ID fields to the same name. Then I can do a Merge selecting only records in bigTable where there is one in smallTable. The sorting on bigTable seems very inefficient unless I can sort it using a where clause that says something like Where ForeignId in (select PrimaryId from smallTable).
The reason for doing this is that we have multiple records which in standard SQL can only be removed using group by and either max() or min(). This isn't what I want. I want to be able to select first.smallTable or last.smallTable.
Many thanks for all constructive suggestions. - Martin
Do you want all 100 columns or just 1, 2 and N?
First, let us create some data that replicates your actual data. smallTable below has 1000 obs and 1 variable (PrimaryID) and bigTable has 101 cols: ForeignID, field1, field2, ..., field100. I assume that the ID variables are character, but it will work with numeric as well.
data smallTable(rename = id = PrimaryId keep = id )
bigTable (rename = id = ForeignId keep = id field:);
do i = 1 to 1e7;
id = uuidgen();
array f field1 - field100;
do over f;
f = ceil(rand('uniform') * 10);
end;
output bigTable;
if mod(i, 1e4) = 0 then output smallTable;
end;
run;
Next, I try to replicate your SQL code. I do not use Sorting since you mention a time heavy sort procedure. Instead I use hashing. This takes about 15 sec on my system.
Let me know if this works for you.
data want(drop = PrimaryId);
if _N_ = 1 then do;
dcl hash h(dataset : "smallTable", hashexp : 20);
h.definekey("PrimaryId");
h.definedone();
end;
set bigTable(keep = ForeignId Field1 Field2 Field100);
if 0 then set smallTable;
if h.check(key : ForeignId) = 0;
run;
Thanks, that looks like something that I really want to understand fully. A question that I can't find in the docs. What does the zero in if 0 then set smallTable; refer to?
Also, having read the docs on hashing, it suggests that I load the large dataset into the hash table rather than the small one. What are your thoughts on this?
Thanks for the help. Greatly appreciated.
if 0 then set smallTable;
Since 0 is false, this statement is never executed, but its declarative element is recognized by the data step compiler, so it reads the dataset's metadata and prepares its columns in the PDV.
This makes it unnecessary to define the variables "manually" by using a LENGTH statement (just naming the variables in the DEFINEKEY and/or DEFINEDATA methods is not sufficient to declare them correctly).
The PDV is set up at data step compile time, but the hash object is created entirely at runtime, so the data step compiler does not have an idea what the hash methods will do.
And you definitely want to load the small table into the hash. Put the lookup into memory, and process the large table sequentially.
Regarding the "if 0 then set" stuff: What @Kurt_Bremser said.
Regaring loading the big data into the hash object: In this case, I agree that the smart thing is to read the small data into the hash and read the big data sequentially using the Set Statement. However, there could be cases where it is the other way around.
Btw, the reverse logic looks like this. You can compare the run times yourself 🙂
data want2(drop = ForeignId rc);
if _N_ = 1 then do;
dcl hash h(dataset : "bigTable");
h.definekey("ForeignId");
h.definedata("Field1", "Field2", "Field100");
h.definedone();
end;
set smallTable;
if 0 then set bigTable(keep = ForeignId Field1 Field2 Field100);
rc = h.find(key : PrimaryId);
run;
For lookups like this, I also strongly recommend the hash object approach, as it will outperform any other method and does not need any sort of the large table.
Thanks. Found out what a PDV is and why you do the if 0 . Looks good. Will try that solution.
Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.
Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.
Check out this tutorial series to learn how to build your own steps in SAS Studio.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.