Solved: Flagging SQL Join result

teelov · Posted 06-04-2020 06:46 AM

Hello,

I'm hoping that someone can help me create a flag after joining two tables together and omitting the records which were not in B (IF A AND NOT B) - Identify and flag the KEY whereby there are additional REF_NOs in dataset 1 that do not exist in dataset 2 (but the KEY does exist in both) – e.g. ref 20 and 30. IF there are KEYS in dataset 1 that aren’t in 2 then I’m not interested in those (KEY 3 in this example)

DATA WORK.DS1;
    LENGTH
        key                8
        ref_no             8
        colour           $ 6 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        colour           : $CHAR6. ;
DATALINES4;
1,10,red
1,20,black
1,30,blue
1,50,red
2,70,pink
2,75,blue
3,100,black
3,110,blue
4,250,red
4,275,green
4,250,orange
4,275,purple
;;;;

DATA WORK.DS2;
    LENGTH
        key                8
        ref_no             8
        make             $ 8
        model            $ 1 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        make             : $CHAR8.
        model            : $CHAR1. ;
DATALINES4;
1,10,sony, 
1,50,casio, 
2,70,sony, 
2,75,samsung, 
4,200,phillips, 
4,225,sony, 
4,250,samsung, 
4,275,phillips, 
;;;;

code for the join

proc sql ;
     create table MS_OUTPUT as
     select L.KEY,
              L.REF_NO,
              L.Colour,
              R.Make,
              R.Model
              
              from DS1 L 
           left join
           DS2 R
           on L.KEY=R.KEY
           and L.REF_NO=R.REF_NO
           where L.KEY in (select KEY from DS2)
           order by KEY, REF_NO;
quit;

WANT

PeterClemmensen · Posted 06-04-2020 08:11 AM

data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 06-04-2020 07:10 AM

Is it a requirement to use Proc SQL?

The DATA to DATA Step Macro
Blog: SASnrd

teelov · Posted 06-04-2020 07:20 AM

No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.

Kurt_Bremser · Posted 06-04-2020 07:29 AM

@teelov wrote:
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.

A join in SQL will need an implicit sort to be done anyway (unless you have indexes defined, or your libraries point to a DBMS where the sort can be done in-memory); very often a SQL join on big datasets is less performant than the sorts and a DATA step MERGE.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 06-04-2020 07:18 AM

If you provide your datasets in usable form (data steps with datalines), I'll be happy to rub my brain against this.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

teelov · Posted 06-04-2020 07:21 AM

Sure thing, i will change the post

novinosrin · Posted 06-04-2020 07:56 AM

Hi @teelov If I understand your requirement correctly, it's pretty straight forward SQL


data have1;
input KEY	REF_NO	Colour :$10.	Type	Dimensions;
cards;
1	10	red	.	.
1	20	black	.	.
1	30	blue	.	.
1	50	red	.	.
2	70	pink	.	.
2	75	blue	.	.
3	100	black	.	.
3	110	blue	.	.
4	200	red	.	.
4	225	green	.	.
4	250	orange	.	.
4	275	purple	.	.
;

data have2;
input KEY	REF_NO	Make :$10.	Model;
cards;
1	10	sony	.
1	50	casio	.
2	70	sony	.
2	75	samsung	.
4	200	phillips	.
4	225	sony	.
4	250	samsung	.
4	275	phillips	.
;

proc sql;
create table want as
select a.*,make, model,n(b.ref_no) and sum(a.ref_no ne b.ref_no) as flag
from have1 a left join have2 b
on a.key=b.key and a.ref_no=b.ref_no
group by a.key
having n(b.ref_no);
quit;

KEY	REF_NO	Colour	Type	Dimensions	Make	Model	flag
1	10	red	.	.	sony	.	1
1	20	black	.	.		.	1
1	30	blue	.	.		.	1
1	50	red	.	.	casio	.	1
2	70	pink	.	.	sony	.	0
2	75	blue	.	.	samsung	.	0
4	200	red	.	.	phillips	.	0
4	225	green	.	.	sony	.	0
4	250	orange	.	.	samsung	.	0
4	275	purple	.	.	phillips	.	0

teelov · Posted 06-04-2020 11:42 AM

Thank you for engaging with me on this. really good example to help me using SQL more

PeterClemmensen · Posted 06-04-2020 08:11 AM

data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;

The DATA to DATA Step Macro
Blog: SASnrd

teelov · Posted 06-04-2020 08:18 AM

Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table

Kurt_Bremser · Posted 06-04-2020 11:48 AM

@teelov wrote:
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table

Anytime you can fit a hash object into memory, it will provide the fastest way to do a join or a lookup (unless the datasets are already sorted and can be read sequentially).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

teelov · Posted 06-04-2020 11:43 AM

over 30M records this ran in under 10 seconds, thank you for your help, very easy to understand and something i can no implement in DI Studio in the Lookup transformation

Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

Re: Flagging SQL Join result

SAS Innovate 2025: Call for Content

Classroom Training Available!