Preface:
I've been working with SAS for 20 years but slightly longer with sql and databases. So when I need to join data, I use proc sql. But SAS is a multi-headed beast (or maybe a multi-function tool like a Leatherman) and I think there is a data step/merge technique that can help me. I'm reaching out before jumping right into the code because I'm afraid of being bitten in the backside by one of SAS's many heads. (It's happened before so once bitten, twice shy.)
Here's the problem: My data comes from a third party and is not well conditioned.
I have a dataset of orders that have an ID (a unique key) and some other fields that can identify a row (part number, order time, and quantity) but not uniquely. I also have a dataset of order cancellations, but they do not refer back to the order ID being cancelled, instead the cancellation contains some of the non-unique fields (part number and quantity). It is true, however, that an order can only be cancelled once and that a cancellation refers to only one order (i.e. cancellations get used up once they cancel their order).
You can see how sql (left) joining cancellations to orders on part number and qty won't work: the non-uniqueness will cause a multiplicity in the output. But you can also see how one could match a cancel to its order, say, by matching it with the oldest order that has the same part number and quantity and marking both the order and cancellation as "matched" so they will not match again. This requires a kind of loop that sql does not do, but that a crafty data step probably does well. But I've avoided crafty data steps (mostly because I used to work with data designed to work in sql databases) until now and lack the craft.
Would anyone like to enlighten me?
Use a many-many merge of the orders and cancellations. Reset the IN= flags.
data orders ;
ordernum +1 ;
input part qty @@ ;
cards;
A 10 B 4 B 4 B 5 C 3 C 12 D 5 D 6
run;
data cancels ;
cancelnum +1 ;
input part qty @@ ;
cards;
B 4 C 12 D 9
run;
data cancelled (keep=ordernum cancelnum);
call missing(in1,in2);
merge orders(in=in1) cancels(in=in2);
by part qty ;
if in2 and not in1 then put cancelnum= 'of ' part= qty= 'does not match any order.';
if in1 and in2 ;
put ordernum= 'of ' part= qty= 'was cancelled by ' cancelnum= ;
run;
ordernum=2 of part=B qty=4 was cancelled by cancelnum=1
ordernum=6 of part=C qty=12 was cancelled by cancelnum=2
cancelnum=3 of part=D qty=9 does not match any order.
so some sample data would help...but..
how do you know which order gets cancelled? There has to be something in the cancellation data that you can use to uniquely identify the record, even it it isn't id. Would it be sufficient to match on part#, order time, and quantity?
rereading your op...the question I would have is how would I know that I'm cancelling the correct order if I use the loop you describe?
>>> There has to be something in the cancellation data that you can use to uniquely identify the record, even it it isn't id.
Nope, there isn't. If there were, I could do a sql join on it.
This means I may incorrectly apply a cancellation to an order, but I don't think it matters because orders are fungible. All unfilled orders will be cancelled, so the a mismatch only means that I might mis-assign the cancellation time. That's a price I can pay given the ill-conditioned data.
Are you perhaps just trying to count the unfulfilled orders? if so, then you might could just sum the part# and Quantity from the orders and subtract the cancels to get to the # currently ordered...Also, do you have to worry about true duplicate orders in the cancellations (two separate orders for the same part/quantity get cancelled)?
Here's one approach that doesn't care about potential duplicate cancellations.
proc sql;
create table cancels as
select
t1.partno,
t1.quantity,
min(t2.id) as FirstCancelID
from
dsCancels t1
inner join dsOrders t2
on t1.partno=t2.partno
and t1.quantity=t2.quantity
group by t1.partno, t1.quantity;
create table non_cancelled_orders as
select
t1.*
from
dsOrders t1
left outer join cancels t2
on t1.id = t2.firstcancelid
where
t2.id is null;
quit;
>>> Also, do you have to worry about true duplicate orders in the cancellations (two separate orders for the same part/quantity get cancelled)?
I do care and this is the crux of the matter, and I think this is why the sql approach does not work. I tried your approach on my data and I get 2 cancellations applying to the same original order because "min(t2.id) as FirstCancelID" was the same for both of them. Any subsequent joining on FirstCancelID yields dups in the result.
BTW, the reason that I did not supply simplified test data is that I feel that it will induce simplified responses that work on the test case but not in real life. (I.e. I was trying to keep the solution abstract.) I know it's an imposition, but no one is forced to respond.
Thanks for your replies so far.
I would vote for using a hash table to implement this. Without sample data I can't really supply sample code, but you can look about on the web; Paul Dorfman has written a few hundred papers on the subject, among others. Basically load the cancellation table into a hash table, then as you go through the data step query the hash table, and if you find a cancellation then both mark that dataset record as cancelled, and remove the record in the hash table. You have to be a little careful here if you might have duplicate cancellations, which I assume is a possibility; you probably need 9.2 or newer to do this properly (or even 9.3, I forget when they added the more advanced treatment of duplicate records in a hash table). This requires the ability of the cancellation table to fit entirely into memory, by the way, so be aware of that if it's really big.
This is the kind of thing I was looking for. Thanks. I have lot's of RAM 😉
FYI, I spend time reading about updating/merging in data steps, but that's not the answer. DOW loops might be, but I find them opaque and the Dorfman paper I have is all about computing summary stats so I'm not sure the merge is DOW-able.
At the moment I've implemented a form of DBaily's approach, but removed the dups which also means some cancels get ignored, so I iterate to apply unused cancels. It's clunky.
DoW loop doesn't make any sense here, that's for iterating through one dataset by group usually (SQL equivalent would be doing a group by summarization then merging back to original table). Doesn't really have anything to do with merging. You really need some method of separately iterating through two datasets, which SAS really only gives you hash tables for. It's possible to do something with POINT= or something similar, but the querying is just far, far easier in hash than in a second set statement or any sort of merge.
Use a many-many merge of the orders and cancellations. Reset the IN= flags.
data orders ;
ordernum +1 ;
input part qty @@ ;
cards;
A 10 B 4 B 4 B 5 C 3 C 12 D 5 D 6
run;
data cancels ;
cancelnum +1 ;
input part qty @@ ;
cards;
B 4 C 12 D 9
run;
data cancelled (keep=ordernum cancelnum);
call missing(in1,in2);
merge orders(in=in1) cancels(in=in2);
by part qty ;
if in2 and not in1 then put cancelnum= 'of ' part= qty= 'does not match any order.';
if in1 and in2 ;
put ordernum= 'of ' part= qty= 'was cancelled by ' cancelnum= ;
run;
ordernum=2 of part=B qty=4 was cancelled by cancelnum=1
ordernum=6 of part=C qty=12 was cancelled by cancelnum=2
cancelnum=3 of part=D qty=9 does not match any order.
This is what I was looking for. I've implemented this on my full data and I think it's working.
hello all,
I am just wondering how call missing() works.As a sas learner I know just defination of this function as it assigns missing values to variables.I would like learn how it is useful in this programme .
Thank you
That is what is allowing it to make sure that each order and each cancellation is only used to match one time.
What need to understand is how the IN= variables are populated. You also need to understand what SAS does when merging where two or more of the datasets involved have multiple observations for the same BY group. You also need to realize that SAS will treat both IN1=0 and IN1=. as false. I could have used IN1=0; IN2=0; in place of CALL MISSING(IN1,IN2); and gotten the same results.
Normally when a dataset contributes ANY observations to that BY group the IN= variable will be 1 (TRUE) for all of the observations. By setting it to missing (or setting it 0) then SAS will not reset it to 1 until it reads another observation from that datasets.
Here is a simple example that creates by groups that have 1-1, 2-1, 3-2 and 1-0 matches. Run it with and without the CALL MISSING statement and see how it changes the value of the IN1 and IN2 variables. Look closely at by group where ID=2 or 3.
data one;
input id @@;
x+1;
cards;
1 2 2 3 3 3 4
run;
data two;
input id @@;
y+1;
cards;
1 2 3 3
run;
data both ;
call missing(in1,in2);
merge one(in=in1) two (in=in2);
by id;
if first.id then put / id=;
put (in1 x in2 y) (=);
run;
id=1
in1=1 x=1 in2=1 y=1
id=2
in1=1 x=2 in2=1 y=2
in1=1 x=3 in2=. y=2
id=3
in1=1 x=4 in2=1 y=3
in1=1 x=5 in2=1 y=4
in1=1 x=6 in2=. y=4
id=4
in1=1 x=7 in2=0 y=.
Thank you Tom. Appreciate your prompt response
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.