BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
teelov
Quartz | Level 8

Hello,

I'm hoping that someone can help me create a flag after joining two tables together and omitting the records which were not in B (IF A AND NOT B) - Identify and flag the KEY whereby there are additional REF_NOs in dataset 1 that do not exist in dataset 2 (but the KEY does exist in both) – e.g. ref 20 and 30. IF there are KEYS in dataset 1 that aren’t in 2 then I’m not interested in those (KEY 3 in this example)

 

DATA WORK.DS1;
    LENGTH
        key                8
        ref_no             8
        colour           $ 6 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        colour           : $CHAR6. ;
DATALINES4;
1,10,red
1,20,black
1,30,blue
1,50,red
2,70,pink
2,75,blue
3,100,black
3,110,blue
4,250,red
4,275,green
4,250,orange
4,275,purple
;;;;
DATA WORK.DS2;
    LENGTH
        key                8
        ref_no             8
        make             $ 8
        model            $ 1 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        make             : $CHAR8.
        model            : $CHAR1. ;
DATALINES4;
1,10,sony, 
1,50,casio, 
2,70,sony, 
2,75,samsung, 
4,200,phillips, 
4,225,sony, 
4,250,samsung, 
4,275,phillips, 
;;;;

code for the join

 

proc sql ;
     create table MS_OUTPUT as
     select L.KEY,
              L.REF_NO,
              L.Colour,
              R.Make,
              R.Model
              
              from DS1 L 
           left join
           DS2 R
           on L.KEY=R.KEY
           and L.REF_NO=R.REF_NO
           where L.KEY in (select KEY from DS2)
           order by KEY, REF_NO;
quit;

 

WANT

teelov_0-1591271211937.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Is it a requirement to use Proc SQL?

teelov
Quartz | Level 8
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.
Kurt_Bremser
Super User

@teelov wrote:
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.

A join in SQL will need an implicit sort to be done anyway (unless you have indexes defined, or your libraries point to a DBMS where the sort can be done in-memory); very often a SQL join on big datasets is less performant than the sorts and a DATA step MERGE.

teelov
Quartz | Level 8
Sure thing, i will change the post
novinosrin
Tourmaline | Level 20

Hi @teelov  If I understand your requirement correctly, it's pretty straight forward SQL

 


data have1;
input KEY	REF_NO	Colour :$10.	Type	Dimensions;
cards;
1	10	red	.	.
1	20	black	.	.
1	30	blue	.	.
1	50	red	.	.
2	70	pink	.	.
2	75	blue	.	.
3	100	black	.	.
3	110	blue	.	.
4	200	red	.	.
4	225	green	.	.
4	250	orange	.	.
4	275	purple	.	.
;

data have2;
input KEY	REF_NO	Make :$10.	Model;
cards;
1	10	sony	.
1	50	casio	.
2	70	sony	.
2	75	samsung	.
4	200	phillips	.
4	225	sony	.
4	250	samsung	.
4	275	phillips	.
;

proc sql;
create table want as
select a.*,make, model,n(b.ref_no) and sum(a.ref_no ne b.ref_no) as flag
from have1 a left join have2 b
on a.key=b.key and a.ref_no=b.ref_no
group by a.key
having n(b.ref_no);
quit;
KEY REF_NO Colour Type Dimensions Make Model flag
1 10 red . . sony . 1
1 20 black . .   . 1
1 30 blue . .   . 1
1 50 red . . casio . 1
2 70 pink . . sony . 0
2 75 blue . . samsung . 0
4 200 red . . phillips . 0
4 225 green . . sony . 0
4 250 orange . . samsung . 0
4 275 purple . . phillips . 0
teelov
Quartz | Level 8

Thank you for engaging with me on this. really good example to help me using SQL more

PeterClemmensen
Tourmaline | Level 20
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;
teelov
Quartz | Level 8
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table
Kurt_Bremser
Super User

@teelov wrote:
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table

Anytime you can fit a hash object into memory, it will provide the fastest way to do a join or a lookup (unless the datasets are already sorted and can be read sequentially).

teelov
Quartz | Level 8

over 30M records this ran in under 10 seconds, thank you for your help, very easy to understand and something i can no implement in DI Studio in the Lookup transformation

sas-innovate-white.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.

 

Early bird rate extended! Save $200 when you sign up by March 31.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2392 views
  • 1 like
  • 4 in conversation