@davidsmarch:
To find an answer, let's ask a question first: Suppose you have a series of values:
1 3 7 11 19
and want to find a value closest to K=10. It's obvious that the answer is 11, but let's consider how we arrive at that, and if we did, we would realize that we actually insert 10 into the sorted list:
1 3 7 10 11 19
and then examine its neighbors, then select the one whose distance from K is shorter. If we had a tie (for example, if K=9), we would choose depending on an extra criterion. Now we could structure a program the same way by inserting K into a sorted array from left to right and shifting the items > K upwards after we have found where to stop. However, if the array were large (and in your case it would be), that would be costly.
A much better way is to perform a binary search on K=10. Of course, it will not be found; but at the end, the binary search midpoint index will settle either on either A[3]=7 or A[4]=11, depending on how you code the search. Now knowing both, we can easily decide which one is the closest to K=10 without doing a costly insertion.
In your case, the file from which an array has to be populated for each separate SUBJ is #1, so let's assume it is sorted by SUBJ and TIME_N. The other file has to be sorted only by SUBJ, as each ONLOADTIME_N will have to be examined against the sorted TIME_N for a given SUBJ. Having decided on that, let's translate it into SAS code. Note that below, the sample files contain much simpler values for the time variables to facilitate eyeball testing and also include cases where the two times match exactly - in which case, we can simply pick the item with the index where the binary search has landed. The array dimension 999999 is selected in hope that the number of records in your largest BY group by SUBJ never exceeds this value.
data trial ;
input subj trial onloadtime_n ;
cards ;
1 11 1
1 12 3
1 13 5
1 14 7
1 15 11
2 21 13
2 22 17
2 23 19
2 24 23
2 25 29
;
run ;
data col ;
input subj col time_n ;
cards ;
1 1 10
1 2 3
1 3 8
1 4 0
1 5 12
1 6 4
2 1 24
2 2 20
2 3 12
2 4 27
2 5 17
2 6 10
;
run ;
data want (drop = f i j k n) ;
array t [2,0:99999] _temporary_ ;
do n = 1 by 1 until (last.subj) ;
set trial ;
by subj ;
t[1,n] = onloadtime_n ;
t[2,n] = trial ;
end ;
do i = 1, 2 ;
t[i,1-1] = t[i,1] ;
t[i,n+1] = t[i,n] ;
end ;
do until (last.subj) ;
set col ;
by subj ;
i = 1 ;
j = n ;
do f = 0 by 0 until (i > j or f) ;
k = floor (divide (i + j, 2)) ;
if time_n < t[1,k] then j = k - 1 ;
else if time_n > t[1,k] then i = k + 1 ;
else f = 1 ;
end ;
if f or n = 1 then k = k ;
else if time_n < t[1,k] then k = ifn (t[1,k-1] + t[1,k+0] > 2 * time_n, k-1, k+0) ;
else k = ifn (t[1,k-0] + t[1,k+1] > 2 * time_n, k-0, k+1) ;
onloadtime_n = t[1,k] ;
trial = t[2,k] ;
output ;
end ;
run ;
The DO I=1,2 loop is used to set two sentinels at the bottom and top of the array to avoid writing extra logic for the case when at the end of the binary search K lands on the first or last item.
Now the idea of inserting K into a sorted list would look much brighter if instead of the array we used an ordered hash table, as in this case the insertion is done in O(1) time, rather than O(N) time, simply by calling the ADD method if K is not in the table (if it is, it's trivial since the two time values just match). So, the insertion is not a problem in this case. However, the logic of examining the neighbors of the inserted key, though not particularly complicated, is much less straightforward than in the case of the array. If someone should like to use this problem as a vehicle for a nice exercise at using a hash iterator to implement it, be my guest. A hint:
After inserting the key, you will need to call the SETCUR iterator method to latch onto the inserted item
Then make the iterator dance up and down from there using the PREV/NEXT iterator methods to compare the SETCUR item with its neighbors
Kind regards
Paul D.
... View more