Solved: Re: Proc Sql Join with Inequality Condition

apolitical · Posted 05-30-2018 03:17 PM

Say I have 2 datasets.

Dataset 1 (all IDs and their gender)

ID Gender

1 M

2 F

3 M

4 F

Dataset 2 (IDs I'm not interested in)

ID

2

3

I want to get the info from dataset 1 with IDs not found in dataset 2, basically get

ID Gender

1 M

4 F

This can be done with data step and merge (if in dataset 1 but not in dataset 2). How can this be done with proc sql join? When I set an inequality statement using proc sql, the resulted data simply blows up into a cartesian product. Thanks.

Astounding · Posted 05-30-2018 03:35 PM

A reasonable SQL approach:

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

View solution in original post

Astounding · Posted 05-30-2018 03:35 PM

A reasonable SQL approach:

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

apolitical · Posted 05-30-2018 06:44 PM

Works great, thanks！

SuryaKiran · Posted 05-30-2018 04:23 PM

Simply a left join with where condition can result only the records that are in dataset1.

proc sql;

select a.* from dataset1 a

left join dataset2 b on a.id=b.id

where b.id is null

;

quit;

Thanks,
Suryakiran

apolitical · Posted 05-30-2018 06:43 PM

This one also works fine. Thanks!

novinosrin · Posted 05-30-2018 04:36 PM

@apolitical In my opinion Joins are not ideal for this task. Rather you are better off dealing with subquery as @Astounding demonstrated. A look up is far convenient than "joining and then filtering"

ChrisNZ · Posted 05-30-2018 07:08 PM

@novinosrin in() clauses are much slower than joins. It doesn't matter for small tables but does when the volume increases.

Data steps are much faster if the table is sorted.

data BASE LOOKUP;
  do I=1 to 1e7;
    output BASE;
    if mod(I,10)=0 then output LOOKUP;
  end;
run;
   
proc sql;       * 8.4 s;
  create table FILTERED as 
  select * 
  from BASE 
  where I not in (select I from LOOKUP);
quit;

proc sql;        * 3.3 s;
  create table FILTERED as 
  select * 
  from BASE
         left join 
       LOOKUP
         on BASE.I=LOOKUP.I
  where LOOKUP.I is null;
quit;

data FILTERED;   *1.7 s;
  merge BASE LOOKUP(in=B);
  by I;
  if not B;
run;

High-Performance SAS Coding - Third Edition

novinosrin · Posted 05-30-2018 07:17 PM

@ChrisNZ Thank you. I learned something very interesting. Your merge even beats hash in my college lab pc sas.

1138 data BASE LOOKUP;
1139 do I=1 to 1e7;
1140 output BASE;
1141 if mod(I,10)=0 then output LOOKUP;
1142 end;
1143 run;

NOTE: The data set WORK.BASE has 10000000 observations and 1 variables.
NOTE: The data set WORK.LOOKUP has 1000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.51 seconds
cpu time 0.51 seconds

1144
1145 data want;
1146 if _n_=1 then do;
1147 dcl hash h(dataset:'lookup');
1148 h.definekey ('I') ;
1149 h.definedone () ;
1150 end;
1151 set base;
1152 if h.check() ne 0;
1153 run;

NOTE: There were 1000000 observations read from the data set WORK.LOOKUP.
NOTE: There were 10000000 observations read from the data set WORK.BASE.
NOTE: The data set WORK.WANT has 9000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.45 seconds
cpu time 2.45 seconds

ChrisNZ · Posted 05-30-2018 07:24 PM

@novinosrin Yes indeed, MERGE BY is extremely efficient.

It is only (largely) beaten by MERGE without BY, when both tables are known to have exactly the same record keys.

That's one fast machine you have, where the disks have to wait for the CPU!

High-Performance SAS Coding - Third Edition

novinosrin · Posted 05-30-2018 07:31 PM

Yes and great. But wouldn't the requirement of presort kill the "merge" charm a bit?

ChrisNZ · Posted 05-30-2018 08:10 PM

@novinosrin Yes and no. 🙂

1. Sometimes you don't have to sort. Data steps usually create a sorted table.

2. If you want a sorted output, then there is no loss.

3. If SQL chooses to join by sorting, then there is no loss and you gain a sorted output.

MERGE is less advantageous if the input is not sorted and if you don't want a sorted output for further processing and if SQL is faster (maybe because it joins with a hash).

That's a downside of SQL: it creates non-sorted data from sorted inputs.

Sorted tables are usually much more efficient to process, and I like to keep them that way if I can. That's one of the advantages of a DWH/BI database like SAS when compared to transactional databases. It can keep the data sorted for smarter or more efficient processing.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 05-30-2018 10:38 PM

Depending on the size of the lookup table and the shape of the base table, modifying in place can be even faster than MERGE.

data BASE; 
  set LOOKUP; 
  modify BASE key=I;  
  if _IORC_= %sysrc(_sok) then remove; 
  else _ERROR_=0;
run;

For the very narrow table here, MERGE is faster, but if you make the BASE table wider or if you decrease the size of the LOOKUP table, MODIFY KEY= becomes much faster than MERGE.

An index must exist on table BASE.

High-Performance SAS Coding - Third Edition

Astounding · Posted 05-30-2018 10:47 PM

If we're going to imagine that the data sets are sorted, SET will usually far outperform MERGE:

data want;

set lookup base (in=keepme);

by id;

if keepme and first.id;

run;

This particular version requires no duplicate IDs within BASE.

ChrisNZ · Posted 05-30-2018 11:47 PM

@Astounding

That's not my experience. SET BY is usually much slower than MERGE BY.

With this test, about the same, within a margin of error.

With a fuller test, there is a definitive difference.

data _null_; * 23s;
  set BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

data _null_; * 8s;
  merge BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

High-Performance SAS Coding - Third Edition

Astounding · Posted 05-31-2018 09:40 AM

@ChrisNZ

Those results are so different than my past experience, I had to run my own test. I used only two data sets, but increased the number of records to 1B and 100M. Also using _NULL_ as the output, I found less than 1% difference between SET and MERGE. Color me Astounded.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away