DATA Step, Macro, Functions and more

Proc Sql Join with Inequality Condition

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 92
Accepted Solution

Proc Sql Join with Inequality Condition

Say I have 2 datasets.

Dataset 1 (all IDs and their gender)

ID     Gender

1       M

2       F

3       M

4       F

 

Dataset 2 (IDs I'm not interested in)

ID

2

3

 

I want to get the info from dataset 1 with IDs not found in dataset 2, basically get

ID     Gender

1       M

4       F

 

This can be done with data step and merge (if in dataset 1 but not in dataset 2). How can this be done with proc sql join? When I set an inequality statement using proc sql, the resulted data simply blows up into a cartesian product. Thanks.


Accepted Solutions
Solution
4 weeks ago
Super User
Posts: 6,644

Re: Proc Sql Join with Inequality Condition

Posted in reply to apolitical

A reasonable SQL approach:

 

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

View solution in original post


All Replies
Solution
4 weeks ago
Super User
Posts: 6,644

Re: Proc Sql Join with Inequality Condition

Posted in reply to apolitical

A reasonable SQL approach:

 

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

Frequent Contributor
Posts: 92

Re: Proc Sql Join with Inequality Condition

Posted in reply to Astounding
Works great, thanks!
Valued Guide
Posts: 560

Re: Proc Sql Join with Inequality Condition

Posted in reply to apolitical

Simply a left join with where condition can result only the records that are in dataset1.

 

proc sql;

select a.* from dataset1 a 

left join dataset2 b on a.id=b.id

where b.id is null

;

quit;

Thanks,
Suryakiran
Frequent Contributor
Posts: 92

Re: Proc Sql Join with Inequality Condition

Posted in reply to SuryaKiran
This one also works fine. Thanks!
PROC Star
Posts: 1,604

Re: Proc Sql Join with Inequality Condition

Posted in reply to apolitical

@apolitical In my opinion Joins are not ideal for this  task. Rather you are better off dealing with subquery as @Astounding demonstrated. A look up is far convenient than "joining and then filtering"

PROC Star
Posts: 2,319

Re: Proc Sql Join with Inequality Condition

Posted in reply to novinosrin

@novinosrin in() clauses are much slower than joins. It doesn't matter for small tables for does when the volume increases.

 

Data steps are much faster if the table is sorted.

data BASE LOOKUP;
  do I=1 to 1e7;
    output BASE;
    if mod(I,10)=0 then output LOOKUP;
  end;
run;
   
proc sql;       * 8.4 s;
  create table FILTERED as 
  select * 
  from BASE 
  where I not in (select I from LOOKUP);
quit;

proc sql;        * 3.3 s;
  create table FILTERED as 
  select * 
  from BASE
         left join 
       LOOKUP
         on BASE.I=LOOKUP.I
  where LOOKUP.I is null;
quit;

data FILTERED;   *1.7 s;
  merge BASE LOOKUP(in=B);
  by I;
  if not B;
run;

 

 

 

PROC Star
Posts: 1,604

Re: Proc Sql Join with Inequality Condition

@ChrisNZ  Thank you. I learned something very interesting. Your merge even beats hash in my college lab pc sas.

 

1138 data BASE LOOKUP;
1139 do I=1 to 1e7;
1140 output BASE;
1141 if mod(I,10)=0 then output LOOKUP;
1142 end;
1143 run;

NOTE: The data set WORK.BASE has 10000000 observations and 1 variables.
NOTE: The data set WORK.LOOKUP has 1000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.51 seconds
cpu time 0.51 seconds


1144
1145 data want;
1146 if _n_=1 then do;
1147 dcl hash h(dataset:'lookup');
1148 h.definekey ('I') ;
1149 h.definedone () ;
1150 end;
1151 set base;
1152 if h.check() ne 0;
1153 run;

NOTE: There were 1000000 observations read from the data set WORK.LOOKUP.
NOTE: There were 10000000 observations read from the data set WORK.BASE.
NOTE: The data set WORK.WANT has 9000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.45 seconds
cpu time 2.45 seconds

PROC Star
Posts: 2,319

Re: Proc Sql Join with Inequality Condition

Posted in reply to novinosrin

@novinosrin Yes indeed, MERGE BY is extremely efficient.

 

It is only (largely) beaten by MERGE without BY, when both tables are known to have exactly the same record keys.

 

That's one fast machine you have, where the disks have to wait for the CPU!

 

 

PROC Star
Posts: 1,604

Re: Proc Sql Join with Inequality Condition

Yes and great. But wouldn't the  requirement of presort kill the "merge" charm a bit?

 

PROC Star
Posts: 2,319

Re: Proc Sql Join with Inequality Condition

Posted in reply to novinosrin

@novinosrin Yes and no. Smiley Happy

1. Sometimes you don't have to sort. Data steps usually create a sorted table.

2. If you want a sorted output, then there is no loss. 

3. If SQL chooses to join by sorting, then there is no loss and you gain a sorted output.

 

MERGE is less advantageous if the input is not sorted and if you don't want a sorted output for further processing and if SQL is faster (maybe because it joins with a hash).

 

That's a downside of SQL: it creates non-sorted data from sorted inputs.

Sorted tables are usually much more efficient to process, and I like to keep them that way if I can. That's one of the advantages of a DWH/BI database like SAS when compared to transactional databases. It can keep the data sorted for smarter or more efficient processing.

PROC Star
Posts: 2,319

Re: Proc Sql Join with Inequality Condition

Depending on the size of the lookup table and the shape of the base table, modifying in place can be even faster than MERGE.

data BASE; 
  set LOOKUP; 
  modify BASE key=I;  
  if _IORC_= %sysrc(_sok) then remove; 
  else _ERROR_=0;
run;

For the very narrow table here, MERGE is faster, but if you make the BASE table wider or if you decrease the size of the LOOKUP table, MODIFY KEY= becomes much faster than MERGE. 

An index must exist on table BASE.

 

 

 

 

Super User
Posts: 6,644

Re: Proc Sql Join with Inequality Condition

If we're going to imagine that the data sets are sorted, SET will usually far outperform MERGE:

 

data want;

set lookup base (in=keepme);

by id;

if keepme and first.id;

run;

 

This particular version requires no duplicate IDs within BASE.

PROC Star
Posts: 2,319

Re: Proc Sql Join with Inequality Condition

Posted in reply to Astounding

@Astounding

That's not my experience. SET BY is usually much slower than MERGE BY.

With this test, about the same, within a margin of error.

With a fuller test, there is a definitive difference.

data _null_; * 23s;
  set BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

data _null_; * 8s;
  merge BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

 

 

 

 

 

Super User
Posts: 6,644

Re: Proc Sql Join with Inequality Condition

@ChrisNZ

Those results are so different than my past experience, I had to run my own test.  I used only two data sets, but increased the number of records to 1B and 100M.  Also using _NULL_ as the output, I found less than 1% difference between SET and MERGE.  Color me Astounded.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 191 views
  • 3 likes
  • 5 in conversation