Say I have 2 datasets.
Dataset 1 (all IDs and their gender)
ID Gender
1 M
2 F
3 M
4 F
Dataset 2 (IDs I'm not interested in)
ID
2
3
I want to get the info from dataset 1 with IDs not found in dataset 2, basically get
ID Gender
1 M
4 F
This can be done with data step and merge (if in dataset 1 but not in dataset 2). How can this be done with proc sql join? When I set an inequality statement using proc sql, the resulted data simply blows up into a cartesian product. Thanks.
A reasonable SQL approach:
select * from dataset1 where ID not in (select distinct(ID) from dataset2);
A reasonable SQL approach:
select * from dataset1 where ID not in (select distinct(ID) from dataset2);
Simply a left join with where condition can result only the records that are in dataset1.
proc sql;
select a.* from dataset1 a
left join dataset2 b on a.id=b.id
where b.id is null
;
quit;
@apolitical In my opinion Joins are not ideal for this task. Rather you are better off dealing with subquery as @Astounding demonstrated. A look up is far convenient than "joining and then filtering"
@novinosrin in() clauses are much slower than joins. It doesn't matter for small tables but does when the volume increases.
Data steps are much faster if the table is sorted.
data BASE LOOKUP;
do I=1 to 1e7;
output BASE;
if mod(I,10)=0 then output LOOKUP;
end;
run;
proc sql; * 8.4 s;
create table FILTERED as
select *
from BASE
where I not in (select I from LOOKUP);
quit;
proc sql; * 3.3 s;
create table FILTERED as
select *
from BASE
left join
LOOKUP
on BASE.I=LOOKUP.I
where LOOKUP.I is null;
quit;
data FILTERED; *1.7 s;
merge BASE LOOKUP(in=B);
by I;
if not B;
run;
@ChrisNZ Thank you. I learned something very interesting. Your merge even beats hash in my college lab pc sas.
1138 data BASE LOOKUP;
1139 do I=1 to 1e7;
1140 output BASE;
1141 if mod(I,10)=0 then output LOOKUP;
1142 end;
1143 run;
NOTE: The data set WORK.BASE has 10000000 observations and 1 variables.
NOTE: The data set WORK.LOOKUP has 1000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.51 seconds
cpu time 0.51 seconds
1144
1145 data want;
1146 if _n_=1 then do;
1147 dcl hash h(dataset:'lookup');
1148 h.definekey ('I') ;
1149 h.definedone () ;
1150 end;
1151 set base;
1152 if h.check() ne 0;
1153 run;
NOTE: There were 1000000 observations read from the data set WORK.LOOKUP.
NOTE: There were 10000000 observations read from the data set WORK.BASE.
NOTE: The data set WORK.WANT has 9000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.45 seconds
cpu time 2.45 seconds
@novinosrin Yes indeed, MERGE BY is extremely efficient.
It is only (largely) beaten by MERGE without BY, when both tables are known to have exactly the same record keys.
That's one fast machine you have, where the disks have to wait for the CPU!
Yes and great. But wouldn't the requirement of presort kill the "merge" charm a bit?
@novinosrin Yes and no. 🙂
1. Sometimes you don't have to sort. Data steps usually create a sorted table.
2. If you want a sorted output, then there is no loss.
3. If SQL chooses to join by sorting, then there is no loss and you gain a sorted output.
MERGE is less advantageous if the input is not sorted and if you don't want a sorted output for further processing and if SQL is faster (maybe because it joins with a hash).
That's a downside of SQL: it creates non-sorted data from sorted inputs.
Sorted tables are usually much more efficient to process, and I like to keep them that way if I can. That's one of the advantages of a DWH/BI database like SAS when compared to transactional databases. It can keep the data sorted for smarter or more efficient processing.
Depending on the size of the lookup table and the shape of the base table, modifying in place can be even faster than MERGE.
data BASE;
set LOOKUP;
modify BASE key=I;
if _IORC_= %sysrc(_sok) then remove;
else _ERROR_=0;
run;
For the very narrow table here, MERGE is faster, but if you make the BASE table wider or if you decrease the size of the LOOKUP table, MODIFY KEY= becomes much faster than MERGE.
An index must exist on table BASE.
If we're going to imagine that the data sets are sorted, SET will usually far outperform MERGE:
data want;
set lookup base (in=keepme);
by id;
if keepme and first.id;
run;
This particular version requires no duplicate IDs within BASE.
That's not my experience. SET BY is usually much slower than MERGE BY.
With this test, about the same, within a margin of error.
With a fuller test, there is a definitive difference.
data _null_; * 23s;
set BASE BASE BASE BASE BASE BASE BASE BASE;
by I;
run;
data _null_; * 8s;
merge BASE BASE BASE BASE BASE BASE BASE BASE;
by I;
run;
Those results are so different than my past experience, I had to run my own test. I used only two data sets, but increased the number of records to 1B and 100M. Also using _NULL_ as the output, I found less than 1% difference between SET and MERGE. Color me Astounded.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.