- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm hoping that someone can help me create a flag after joining two tables together and omitting the records which were not in B (IF A AND NOT B) - Identify and flag the KEY whereby there are additional REF_NOs in dataset 1 that do not exist in dataset 2 (but the KEY does exist in both) – e.g. ref 20 and 30. IF there are KEYS in dataset 1 that aren’t in 2 then I’m not interested in those (KEY 3 in this example)
DATA WORK.DS1;
LENGTH
key 8
ref_no 8
colour $ 6 ;
FORMAT
key BEST12.
ref_no BEST12.
colour $CHAR6. ;
INFORMAT
key BEST12.
ref_no BEST12.
colour $CHAR6. ;
INFILE DATALINES4
DLM=','
MISSOVER
DSD ;
INPUT
key : BEST32.
ref_no : BEST32.
colour : $CHAR6. ;
DATALINES4;
1,10,red
1,20,black
1,30,blue
1,50,red
2,70,pink
2,75,blue
3,100,black
3,110,blue
4,250,red
4,275,green
4,250,orange
4,275,purple
;;;;DATA WORK.DS2;
LENGTH
key 8
ref_no 8
make $ 8
model $ 1 ;
FORMAT
key BEST12.
ref_no BEST12.
make $CHAR8.
model $CHAR1. ;
INFORMAT
key BEST12.
ref_no BEST12.
make $CHAR8.
model $CHAR1. ;
INFILE DATALINES4
DLM=','
MISSOVER
DSD ;
INPUT
key : BEST32.
ref_no : BEST32.
make : $CHAR8.
model : $CHAR1. ;
DATALINES4;
1,10,sony,
1,50,casio,
2,70,sony,
2,75,samsung,
4,200,phillips,
4,225,sony,
4,250,samsung,
4,275,phillips,
;;;;code for the join
proc sql ;
create table MS_OUTPUT as
select L.KEY,
L.REF_NO,
L.Colour,
R.Make,
R.Model
from DS1 L
left join
DS2 R
on L.KEY=R.KEY
and L.REF_NO=R.REF_NO
where L.KEY in (select KEY from DS2)
order by KEY, REF_NO;
quit;
WANT
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10 red
1 20 black
1 30 blue
1 50 red
2 70 pink
2 75 blue
3 100 black
3 110 blue
4 200 red
4 225 green
4 250 orange
4 275 purple
;
data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10 sony
1 50 casio
2 70 sony
2 75 samsung
4 200 phillips
4 225 sony
4 250 samsung
4 275 phillips
;
data want(drop=rc);
if _N_ = 1 then do;
declare hash h1 (dataset : 'ds2');
h1.definekey ('key');
h1.definedone ();
declare hash h2 (dataset : 'ds2', ordered : 'A');
h2.definekey ('key', 'ref_no');
h2.definedata (all : 'Y');
h2.definedone ();
if 0 then set ds2;
end;
do _N_ = 1 by 1 until (last.key);
set ds1;
by key;
if h2.check() then flag = 1;
end;
do _N_ = 1 to _N_;
set ds1;
call missing (Make, Model);
rc = h2.find();
if h1.check() = 0 then output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is it a requirement to use Proc SQL?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@teelov wrote:
No, but my thinking was these datasets are in the 10s of millions so sorting them before hand could be an overhead, but open to suggestions.
A join in SQL will need an implicit sort to be done anyway (unless you have indexes defined, or your libraries point to a DBMS where the sort can be done in-memory); very often a SQL join on big datasets is less performant than the sorts and a DATA step MERGE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you provide your datasets in usable form (data steps with datalines), I'll be happy to rub my brain against this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @teelov If I understand your requirement correctly, it's pretty straight forward SQL
data have1;
input KEY REF_NO Colour :$10. Type Dimensions;
cards;
1 10 red . .
1 20 black . .
1 30 blue . .
1 50 red . .
2 70 pink . .
2 75 blue . .
3 100 black . .
3 110 blue . .
4 200 red . .
4 225 green . .
4 250 orange . .
4 275 purple . .
;
data have2;
input KEY REF_NO Make :$10. Model;
cards;
1 10 sony .
1 50 casio .
2 70 sony .
2 75 samsung .
4 200 phillips .
4 225 sony .
4 250 samsung .
4 275 phillips .
;
proc sql;
create table want as
select a.*,make, model,n(b.ref_no) and sum(a.ref_no ne b.ref_no) as flag
from have1 a left join have2 b
on a.key=b.key and a.ref_no=b.ref_no
group by a.key
having n(b.ref_no);
quit;
| KEY | REF_NO | Colour | Type | Dimensions | Make | Model | flag |
|---|---|---|---|---|---|---|---|
| 1 | 10 | red | . | . | sony | . | 1 |
| 1 | 20 | black | . | . | . | 1 | |
| 1 | 30 | blue | . | . | . | 1 | |
| 1 | 50 | red | . | . | casio | . | 1 |
| 2 | 70 | pink | . | . | sony | . | 0 |
| 2 | 75 | blue | . | . | samsung | . | 0 |
| 4 | 200 | red | . | . | phillips | . | 0 |
| 4 | 225 | green | . | . | sony | . | 0 |
| 4 | 250 | orange | . | . | samsung | . | 0 |
| 4 | 275 | purple | . | . | phillips | . | 0 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for engaging with me on this. really good example to help me using SQL more
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data ds1;
infile datalines missover;
input key ref_no Colour $ Type $ Dimensions;
datalines;
1 10 red
1 20 black
1 30 blue
1 50 red
2 70 pink
2 75 blue
3 100 black
3 110 blue
4 200 red
4 225 green
4 250 orange
4 275 purple
;
data ds2;
infile datalines missover;
input key ref_no Make $ Model $;
datalines;
1 10 sony
1 50 casio
2 70 sony
2 75 samsung
4 200 phillips
4 225 sony
4 250 samsung
4 275 phillips
;
data want(drop=rc);
if _N_ = 1 then do;
declare hash h1 (dataset : 'ds2');
h1.definekey ('key');
h1.definedone ();
declare hash h2 (dataset : 'ds2', ordered : 'A');
h2.definekey ('key', 'ref_no');
h2.definedata (all : 'Y');
h2.definedone ();
if 0 then set ds2;
end;
do _N_ = 1 by 1 until (last.key);
set ds1;
by key;
if h2.check() then flag = 1;
end;
do _N_ = 1 to _N_;
set ds1;
call missing (Make, Model);
rc = h2.find();
if h1.check() = 0 then output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@teelov wrote:
Amazing - i didnt think of using Hash tables, but thinking about it me left table will always be 4 to 5 x bigger than migh right table
Anytime you can fit a hash object into memory, it will provide the fastest way to do a join or a lookup (unless the datasets are already sorted and can be read sequentially).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
over 30M records this ran in under 10 seconds, thank you for your help, very easy to understand and something i can no implement in DI Studio in the Lookup transformation