Hello,
I'm hoping that someone can help me create a flag after joining two tables together and omitting the records which were not in B (IF A AND NOT B) - Identify and flag the KEY whereby there are additional REF_NOs in dataset 1 that do not exist in dataset 2 (but the KEY does exist in both) – e.g. ref 20 and 30. IF there are KEYS in dataset 1 that aren’t in 2 then I’m not interested in those (KEY 3 in this example)
DATA WORK.DS1;
LENGTH
key 8
ref_no 8
colour $ 6 ;
FORMAT
key BEST12.
ref_no BEST12.
colour $CHAR6. ;
INFORMAT
key BEST12.
ref_no BEST12.
colour $CHAR6. ;
INFILE DATALINES4
DLM=','
MISSOVER
DSD ;
INPUT
key : BEST32.
ref_no : BEST32.
colour : $CHAR6. ;
DATALINES4;
1,10,red
1,20,black
1,30,blue
1,50,red
2,70,pink
2,75,blue
3,100,black
3,110,blue
4,250,red
4,275,green
4,250,orange
4,275,purple
;;;;
DATA WORK.DS2;
LENGTH
key 8
ref_no 8
make $ 8
model $ 1 ;
FORMAT
key BEST12.
ref_no BEST12.
make $CHAR8.
model $CHAR1. ;
INFORMAT
key BEST12.
ref_no BEST12.
make $CHAR8.
model $CHAR1. ;
INFILE DATALINES4
DLM=','
MISSOVER
DSD ;
INPUT
key : BEST32.
ref_no : BEST32.
make : $CHAR8.
model : $CHAR1. ;
DATALINES4;
1,10,sony,
1,50,casio,
2,70,sony,
2,75,samsung,
4,200,phillips,
4,225,sony,
4,250,samsung,
4,275,phillips,
;;;;
code for the join
proc sql ;
create table MS_OUTPUT as
select L.KEY,
L.REF_NO,
L.Colour,
R.Make,
R.Model
from DS1 L
left join
DS2 R
on L.KEY=R.KEY
and L.REF_NO=R.REF_NO
where L.KEY in (select KEY from DS2)
order by KEY, REF_NO;
quit;
WANT
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10 red
1 20 black
1 30 blue
1 50 red
2 70 pink
2 75 blue
3 100 black
3 110 blue
4 200 red
4 225 green
4 250 orange
4 275 purple
;
data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10 sony
1 50 casio
2 70 sony
2 75 samsung
4 200 phillips
4 225 sony
4 250 samsung
4 275 phillips
;
data want(drop=rc);
if _N_ = 1 then do;
declare hash h1 (dataset : 'ds2');
h1.definekey ('key');
h1.definedone ();
declare hash h2 (dataset : 'ds2', ordered : 'A');
h2.definekey ('key', 'ref_no');
h2.definedata (all : 'Y');
h2.definedone ();
if 0 then set ds2;
end;
do _N_ = 1 by 1 until (last.key);
set ds1;
by key;
if h2.check() then flag = 1;
end;
do _N_ = 1 to _N_;
set ds1;
call missing (Make, Model);
rc = h2.find();
if h1.check() = 0 then output;
end;
run;
Is it a requirement to use Proc SQL?
@teelov wrote:
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.
A join in SQL will need an implicit sort to be done anyway (unless you have indexes defined, or your libraries point to a DBMS where the sort can be done in-memory); very often a SQL join on big datasets is less performant than the sorts and a DATA step MERGE.
If you provide your datasets in usable form (data steps with datalines), I'll be happy to rub my brain against this.
Hi @teelov If I understand your requirement correctly, it's pretty straight forward SQL
data have1;
input KEY REF_NO Colour :$10. Type Dimensions;
cards;
1 10 red . .
1 20 black . .
1 30 blue . .
1 50 red . .
2 70 pink . .
2 75 blue . .
3 100 black . .
3 110 blue . .
4 200 red . .
4 225 green . .
4 250 orange . .
4 275 purple . .
;
data have2;
input KEY REF_NO Make :$10. Model;
cards;
1 10 sony .
1 50 casio .
2 70 sony .
2 75 samsung .
4 200 phillips .
4 225 sony .
4 250 samsung .
4 275 phillips .
;
proc sql;
create table want as
select a.*,make, model,n(b.ref_no) and sum(a.ref_no ne b.ref_no) as flag
from have1 a left join have2 b
on a.key=b.key and a.ref_no=b.ref_no
group by a.key
having n(b.ref_no);
quit;
KEY | REF_NO | Colour | Type | Dimensions | Make | Model | flag |
---|---|---|---|---|---|---|---|
1 | 10 | red | . | . | sony | . | 1 |
1 | 20 | black | . | . | . | 1 | |
1 | 30 | blue | . | . | . | 1 | |
1 | 50 | red | . | . | casio | . | 1 |
2 | 70 | pink | . | . | sony | . | 0 |
2 | 75 | blue | . | . | samsung | . | 0 |
4 | 200 | red | . | . | phillips | . | 0 |
4 | 225 | green | . | . | sony | . | 0 |
4 | 250 | orange | . | . | samsung | . | 0 |
4 | 275 | purple | . | . | phillips | . | 0 |
Thank you for engaging with me on this. really good example to help me using SQL more
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10 red
1 20 black
1 30 blue
1 50 red
2 70 pink
2 75 blue
3 100 black
3 110 blue
4 200 red
4 225 green
4 250 orange
4 275 purple
;
data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10 sony
1 50 casio
2 70 sony
2 75 samsung
4 200 phillips
4 225 sony
4 250 samsung
4 275 phillips
;
data want(drop=rc);
if _N_ = 1 then do;
declare hash h1 (dataset : 'ds2');
h1.definekey ('key');
h1.definedone ();
declare hash h2 (dataset : 'ds2', ordered : 'A');
h2.definekey ('key', 'ref_no');
h2.definedata (all : 'Y');
h2.definedone ();
if 0 then set ds2;
end;
do _N_ = 1 by 1 until (last.key);
set ds1;
by key;
if h2.check() then flag = 1;
end;
do _N_ = 1 to _N_;
set ds1;
call missing (Make, Model);
rc = h2.find();
if h1.check() = 0 then output;
end;
run;
@teelov wrote:
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table
Anytime you can fit a hash object into memory, it will provide the fastest way to do a join or a lookup (unless the datasets are already sorted and can be read sequentially).
over 30M records this ran in under 10 seconds, thank you for your help, very easy to understand and something i can no implement in DI Studio in the Lookup transformation
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.