Solved: Re: Vertical SQL combine

Batman · Posted 08-12-2019 04:06 PM

I'm learning about combining datasets vertically using SQL. I'm trying to understand why the output includes a row where X=2 but no row where X=3. Anyone have any idea?

data one;

input X 1. A $1.;

datalines;

1a

1b

2c

3v

4e

6g

;

data two;

input X 1. B $1.;

datalines;

1x

2y

3z

3v

5w

;

run;

proc sql;

select *

from one

except

select *

from two;

Tom · Posted 08-13-2019 12:18 PM

SQL is a set operation language. In general a "SET" of values does not include duplicate values. So it makes no sense to have two exact duplicates be the result of such a set operation.

View solution in original post

novinosrin · Posted 08-12-2019 04:29 PM

Hello @Batman

You are basically asking the SQL processor to exclude records that match two and only output the nonmatches from one

Also, since you didn't use ALL i.e except all, you are telling SQL to do a double pass to exclude duplicate records in the nonmatches from one.

HTH

In essence, this is what you are doing

Set Difference
Difference between sets is denoted by ‘A – B’, is the set containing elements of set A but not in B. i.e all elements of A except the element of B.

Above is the Venn Diagram of A-B.

ballardw · Posted 08-12-2019 07:47 PM

Since your only common variable between the data sets is X that is the only variable considered for the comparison. The variable A will not be compared to B for determining inclusion/exclusion with the Except operation.

~~Name B to A in the set Two and see what happens.~~

BTW, it is not a good idea to start your datalines with a blank unless you really truly actually want to have a row of missing values.

PGStats · Posted 08-13-2019 12:04 AM

Contrary to what is stated above (sorry @ballardw), SQL set operations without the CORR keyword operate on columns by positions, not by name. So the comparison involves column A against column B. The names are taken from the first table involved. So 2c is kept because it is different from 2y but 3v is removed because it is present in table two.

Try EXCEPT CORR and see what happens.

PG

ballardw · Posted 08-13-2019 10:52 AM

@PGStats wrote:

Contrary to what is stated above (sorry @ballardw), SQL set operations without the CORR keyword operate on columns by positions, not by name. So the comparison involves column A against column B. The names are taken from the first table involved. So 2c is kept because it is different from 2y but 3v is removed because it is present in table two.

Try EXCEPT CORR and see what happens.

I sit corrected. Thank you.

Tom · Posted 08-13-2019 11:18 AM

Because '2c' is in one and not in two, but the only value with X=3 in one is '3v' and that value is in two so it is deleted by EXCEPT.

Batman · Posted 08-13-2019 12:12 PM

Ok, that makes sense, why is the additional value of "1a" also dropped.

Tom · Posted 08-13-2019 12:18 PM

SQL is a set operation language. In general a "SET" of values does not include duplicate values. So it makes no sense to have two exact duplicates be the result of such a set operation.

Registration is open