BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
teelov
Quartz | Level 8

Hello,

I'm hoping that someone can help me create a flag after joining two tables together and omitting the records which were not in B (IF A AND NOT B) - Identify and flag the KEY whereby there are additional REF_NOs in dataset 1 that do not exist in dataset 2 (but the KEY does exist in both) – e.g. ref 20 and 30. IF there are KEYS in dataset 1 that aren’t in 2 then I’m not interested in those (KEY 3 in this example)

 

DATA WORK.DS1;
    LENGTH
        key                8
        ref_no             8
        colour           $ 6 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        colour           $CHAR6. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        colour           : $CHAR6. ;
DATALINES4;
1,10,red
1,20,black
1,30,blue
1,50,red
2,70,pink
2,75,blue
3,100,black
3,110,blue
4,250,red
4,275,green
4,250,orange
4,275,purple
;;;;
DATA WORK.DS2;
    LENGTH
        key                8
        ref_no             8
        make             $ 8
        model            $ 1 ;
    FORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFORMAT
        key              BEST12.
        ref_no           BEST12.
        make             $CHAR8.
        model            $CHAR1. ;
    INFILE DATALINES4
        DLM=','
        MISSOVER
        DSD ;
    INPUT
        key              : BEST32.
        ref_no           : BEST32.
        make             : $CHAR8.
        model            : $CHAR1. ;
DATALINES4;
1,10,sony, 
1,50,casio, 
2,70,sony, 
2,75,samsung, 
4,200,phillips, 
4,225,sony, 
4,250,samsung, 
4,275,phillips, 
;;;;

code for the join

 

proc sql ;
     create table MS_OUTPUT as
     select L.KEY,
              L.REF_NO,
              L.Colour,
              R.Make,
              R.Model
              
              from DS1 L 
           left join
           DS2 R
           on L.KEY=R.KEY
           and L.REF_NO=R.REF_NO
           where L.KEY in (select KEY from DS2)
           order by KEY, REF_NO;
quit;

 

WANT

teelov_0-1591271211937.png

 

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Is it a requirement to use Proc SQL?

teelov
Quartz | Level 8
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.
Kurt_Bremser
Super User

@teelov wrote:
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.

A join in SQL will need an implicit sort to be done anyway (unless you have indexes defined, or your libraries point to a DBMS where the sort can be done in-memory); very often a SQL join on big datasets is less performant than the sorts and a DATA step MERGE.

teelov
Quartz | Level 8
Sure thing, i will change the post
novinosrin
Tourmaline | Level 20

Hi @teelov  If I understand your requirement correctly, it's pretty straight forward SQL

 


data have1;
input KEY	REF_NO	Colour :$10.	Type	Dimensions;
cards;
1	10	red	.	.
1	20	black	.	.
1	30	blue	.	.
1	50	red	.	.
2	70	pink	.	.
2	75	blue	.	.
3	100	black	.	.
3	110	blue	.	.
4	200	red	.	.
4	225	green	.	.
4	250	orange	.	.
4	275	purple	.	.
;

data have2;
input KEY	REF_NO	Make :$10.	Model;
cards;
1	10	sony	.
1	50	casio	.
2	70	sony	.
2	75	samsung	.
4	200	phillips	.
4	225	sony	.
4	250	samsung	.
4	275	phillips	.
;

proc sql;
create table want as
select a.*,make, model,n(b.ref_no) and sum(a.ref_no ne b.ref_no) as flag
from have1 a left join have2 b
on a.key=b.key and a.ref_no=b.ref_no
group by a.key
having n(b.ref_no);
quit;
KEY REF_NO Colour Type Dimensions Make Model flag
1 10 red . . sony . 1
1 20 black . .   . 1
1 30 blue . .   . 1
1 50 red . . casio . 1
2 70 pink . . sony . 0
2 75 blue . . samsung . 0
4 200 red . . phillips . 0
4 225 green . . sony . 0
4 250 orange . . samsung . 0
4 275 purple . . phillips . 0
teelov
Quartz | Level 8

Thank you for engaging with me on this. really good example to help me using SQL more

PeterClemmensen
Tourmaline | Level 20
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10  red     
1 20  black   
1 30  blue    
1 50  red     
2 70  pink    
2 75  blue    
3 100 black   
3 110 blue    
4 200 red     
4 225 green   
4 250 orange  
4 275 purple  
;

data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10  sony     
1 50  casio    
2 70  sony     
2 75  samsung  
4 200 phillips
4 225 sony    
4 250 samsung 
4 275 phillips
;

data want(drop=rc);
    if _N_ = 1 then do;
        declare hash h1 (dataset : 'ds2');
        h1.definekey ('key');
        h1.definedone (); 
        declare hash h2 (dataset : 'ds2', ordered : 'A');
        h2.definekey ('key', 'ref_no');
        h2.definedata (all : 'Y');
        h2.definedone ();
        if 0 then set ds2;
    end;
    do _N_ = 1 by 1 until (last.key);
        set ds1;
        by key;
        if h2.check() then flag = 1;
    end;
    do _N_ = 1 to _N_;
        set ds1;
        call missing (Make, Model);
        rc = h2.find();
        if h1.check() = 0 then output;
    end;
run;
teelov
Quartz | Level 8
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table
Kurt_Bremser
Super User

@teelov wrote:
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table

Anytime you can fit a hash object into memory, it will provide the fastest way to do a join or a lookup (unless the datasets are already sorted and can be read sequentially).

teelov
Quartz | Level 8

over 30M records this ran in under 10 seconds, thank you for your help, very easy to understand and something i can no implement in DI Studio in the Lookup transformation

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 1818 views
  • 1 like
  • 4 in conversation