BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
apolitical
Obsidian | Level 7

Say I have 2 datasets.

Dataset 1 (all IDs and their gender)

ID     Gender

1       M

2       F

3       M

4       F

 

Dataset 2 (IDs I'm not interested in)

ID

2

3

 

I want to get the info from dataset 1 with IDs not found in dataset 2, basically get

ID     Gender

1       M

4       F

 

This can be done with data step and merge (if in dataset 1 but not in dataset 2). How can this be done with proc sql join? When I set an inequality statement using proc sql, the resulted data simply blows up into a cartesian product. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

A reasonable SQL approach:

 

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

View solution in original post

15 REPLIES 15
Astounding
PROC Star

A reasonable SQL approach:

 

select * from dataset1 where ID not in (select distinct(ID) from dataset2);

apolitical
Obsidian | Level 7
Works great, thanks!
SuryaKiran
Meteorite | Level 14

Simply a left join with where condition can result only the records that are in dataset1.

 

proc sql;

select a.* from dataset1 a 

left join dataset2 b on a.id=b.id

where b.id is null

;

quit;

Thanks,
Suryakiran
apolitical
Obsidian | Level 7
This one also works fine. Thanks!
novinosrin
Tourmaline | Level 20

@apolitical In my opinion Joins are not ideal for this  task. Rather you are better off dealing with subquery as @Astounding demonstrated. A look up is far convenient than "joining and then filtering"

ChrisNZ
Tourmaline | Level 20

@novinosrin in() clauses are much slower than joins. It doesn't matter for small tables but does when the volume increases.

 

Data steps are much faster if the table is sorted.

data BASE LOOKUP;
  do I=1 to 1e7;
    output BASE;
    if mod(I,10)=0 then output LOOKUP;
  end;
run;
   
proc sql;       * 8.4 s;
  create table FILTERED as 
  select * 
  from BASE 
  where I not in (select I from LOOKUP);
quit;

proc sql;        * 3.3 s;
  create table FILTERED as 
  select * 
  from BASE
         left join 
       LOOKUP
         on BASE.I=LOOKUP.I
  where LOOKUP.I is null;
quit;

data FILTERED;   *1.7 s;
  merge BASE LOOKUP(in=B);
  by I;
  if not B;
run;

 

 

 

novinosrin
Tourmaline | Level 20

@ChrisNZ  Thank you. I learned something very interesting. Your merge even beats hash in my college lab pc sas.

 

1138 data BASE LOOKUP;
1139 do I=1 to 1e7;
1140 output BASE;
1141 if mod(I,10)=0 then output LOOKUP;
1142 end;
1143 run;

NOTE: The data set WORK.BASE has 10000000 observations and 1 variables.
NOTE: The data set WORK.LOOKUP has 1000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.51 seconds
cpu time 0.51 seconds


1144
1145 data want;
1146 if _n_=1 then do;
1147 dcl hash h(dataset:'lookup');
1148 h.definekey ('I') ;
1149 h.definedone () ;
1150 end;
1151 set base;
1152 if h.check() ne 0;
1153 run;

NOTE: There were 1000000 observations read from the data set WORK.LOOKUP.
NOTE: There were 10000000 observations read from the data set WORK.BASE.
NOTE: The data set WORK.WANT has 9000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 2.45 seconds
cpu time 2.45 seconds

ChrisNZ
Tourmaline | Level 20

@novinosrin Yes indeed, MERGE BY is extremely efficient.

 

It is only (largely) beaten by MERGE without BY, when both tables are known to have exactly the same record keys.

 

That's one fast machine you have, where the disks have to wait for the CPU!

 

 

novinosrin
Tourmaline | Level 20

Yes and great. But wouldn't the  requirement of presort kill the "merge" charm a bit?

 

ChrisNZ
Tourmaline | Level 20

@novinosrin Yes and no. 🙂

1. Sometimes you don't have to sort. Data steps usually create a sorted table.

2. If you want a sorted output, then there is no loss. 

3. If SQL chooses to join by sorting, then there is no loss and you gain a sorted output.

 

MERGE is less advantageous if the input is not sorted and if you don't want a sorted output for further processing and if SQL is faster (maybe because it joins with a hash).

 

That's a downside of SQL: it creates non-sorted data from sorted inputs.

Sorted tables are usually much more efficient to process, and I like to keep them that way if I can. That's one of the advantages of a DWH/BI database like SAS when compared to transactional databases. It can keep the data sorted for smarter or more efficient processing.

ChrisNZ
Tourmaline | Level 20

Depending on the size of the lookup table and the shape of the base table, modifying in place can be even faster than MERGE.

data BASE; 
  set LOOKUP; 
  modify BASE key=I;  
  if _IORC_= %sysrc(_sok) then remove; 
  else _ERROR_=0;
run;

For the very narrow table here, MERGE is faster, but if you make the BASE table wider or if you decrease the size of the LOOKUP table, MODIFY KEY= becomes much faster than MERGE. 

An index must exist on table BASE.

 

 

 

 

Astounding
PROC Star

If we're going to imagine that the data sets are sorted, SET will usually far outperform MERGE:

 

data want;

set lookup base (in=keepme);

by id;

if keepme and first.id;

run;

 

This particular version requires no duplicate IDs within BASE.

ChrisNZ
Tourmaline | Level 20

@Astounding

That's not my experience. SET BY is usually much slower than MERGE BY.

With this test, about the same, within a margin of error.

With a fuller test, there is a definitive difference.

data _null_; * 23s;
  set BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

data _null_; * 8s;
  merge BASE BASE BASE BASE BASE BASE BASE BASE;
  by I;
run;

 

 

 

 

 

Astounding
PROC Star

@ChrisNZ

Those results are so different than my past experience, I had to run my own test.  I used only two data sets, but increased the number of records to 1B and 100M.  Also using _NULL_ as the output, I found less than 1% difference between SET and MERGE.  Color me Astounded.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 2691 views
  • 4 likes
  • 5 in conversation