@lydiawawa:
What you are asking to do is easy by assigning an appropriate expression to the KEY argument tag when the CHECK method is called, for example:
data have ;
input @1 date $23. x y :$1. z :$3. ;
cards ;
2018-04-03 03:44:18.728 1 A A01
2018-04-03 07:40:02.221 2 B B02
2018-05-03 09:20:20.135 3 C C03
2018-06-03 14:50:11.752 4 D D04
2018-07-03 02:42:17.005 5 E E05
2018-08-05 01:22:20.264 6 F F06
2018-01-06 04:45:49.402 7 G G07
2018-11-06 04:09:50.710 8 H H08
2018-07-07 04:12:31.623 9 I I09
2018-12-11 04:11:01.528 10 J J10
;
run ;
data dates ;
input m_yy $4. ;
cards ;
4-03
1-06
;
run ;
data want (drop = m_yy) ;
if _n_ = 1 then do ;
if 0 then set dates ;
dcl hash h (dataset: "dates") ;
h.definekey ("m_yy") ;
h.definedone () ;
end ;
set have ;
if h.check (key: put (substr (date, 7), $4.)) = 0 ;
run ;
However, from what you've said I suspect it's not the real snag. If you run into a problem just applying a WHERE clause to the input, hash code akin to the above isn't going to help you much. Most likely, in your input you have a "tail" of satellite variables, much more numerous than X Y Z I've included for the sake of a demo; and though WHERE moves only the records that qualify from the buffer to the PDV, it fails to relieve the I/O burden created by the satellites enough. With the hash code like above, the things are even worse because every record gets moved from the buffer into the PDV before the unwanted ones get discarded by the subsetting IF.
Hence, the strategy needs to be adjusted depending on the nature of your input data. Just to outline two extreme scenarios, suppose that the records you end up selecting constitute:
but a fraction of the whole input or
a lion's share of the input
In both cases, you have very few records to either (1) filter in or (2) filter out. And in both cases, it makes sense to first identify those records by observation number with as little computer resource pain as possible and (1) employ some tactic to get those you need or (2) mark those you don't want as deleted. Either way, at this stage it makes sense to drop all the satellite variables, reading in only the key, and apply the subsetting criteria to it, so that in the end you end up with a list of record IDs (i.e. the observation numbers) you either (1) want or (2) don't want.
Let's first look at #1:
data want (drop = m_yy) ;
* hash to lookup m_yy ;
dcl hash h (dataset: "dates") ;
h.definekey ("m_yy") ;
h.definedone () ;
dcl hash r () ;
* hash to store filtered-in RIDs ;
r.definekey ("rid") ;
r.definedone () ;
dcl hiter ir ("r") ;
* find needed RIDs ;
do rid = 1 by 1 until (lr) ;
* KEEP is critical in SET below ;
set have (keep = date) end = lr ;
if h.check (key: put (substr (date, 7), $4.)) = 0 then r.add() ;
end ;
* select only records with RIDs in hash R from HAVE ;
do while (ir.next() = 0) ;
set have point = rid ;
output ;
end ;
stop ;
set dates ;
run ;
In the extreme scenario #2, you don't want any record with M_YY in file DATES:
data discard (keep = rid) ;
if _n_ = 1 then do ;
if 0 the set dates ;
dcl hash h (dataset: "dates") ;
h.definekey ("m_yy") ;
h.definedone () ;
end ;
* KEEP is critical in SET below ;
set have (keep = date) ;
* find UNneeded RIDs ;
if h.check (key: put (substr (date, 7), $4.)) = 0 ;
rid = _n_ ;
run ;
data have ;
set discard ;
modify have point = rid ;
remove ;
run ;
In this case, you just make a list of the unwanted RIDs in the first step (which reads nothing but the key) and in the second step, use that list to mark the respective records in the data set HAVE itself as "deleted". In this case, you (a) still have never read anything from HAVE except the key, (b) never written out a huge data set with all the satellites and only a few records discarded. You've merely marked the unwanted records as "deleted" in HAVE. So, in your program downstream you will just read the data set HAVE; and all the records marked for deletion will be automatically ignored.
Of course, there're other scenarios in between these two extremes. But you should be getting the drift. When you deal with voluminous data, it's not always clearly cut and one has to be inventive; it's an art as much as a science. At times, to engineer a successful ETL, one needs to do a distribution analysis on the keys only first and then write a dynamic program smart enough to choose a subsetting tactic based on the distribution.
Kind regards
Paul D.
... View more