Finding the closest value from two datasets

davidsmarch · Posted 11-17-2019 05:16 PM

I saw some previous solutions that don't quite fit, so I though I'd try to see if I could get some help.

I have two datasets. One dataset (time.dat) is very long (millions of rows) and contains three variables: subject, column, and time_n in this format:

Obs subj col time_n
1 2 1 1564623266038
2 2 2 1564623266044
3 2 3 1564623266058
4 2 4 1564623266073
5 2 5 1564623266090
6 2 6 1564623266106

Each subject has 1000's of columns that each have a unique time_n.

The second set (onloadtime.dat) has all of the same subjects and three variables, subject, trial, and onloadtime_n in this format:

Obs subj trial onloadtime_n
1 2 1 1564623265779
2 2 2 1564623272356
3 2 3 1564623275373
4 2 4 1564623278728
5 2 5 1564623282137

Each subject only has 180 onloadtime's. What I need to do is find the closest time_n to each onloadtime_n and to indicate which trial corresponds to that time. Such that the resultant data looks like:

Obs subj col time_n onloadtime_n trial
1 2 1 1564623266038 1564623265779 1
2 2 2 1564623266044
3 2 3 1564623266058
4 2 4 1564623266073
5 2 5 1564623266090
6 2 6 1564623266106

.

450 2 450 1564623272324 1564623272356 2

And to repeat this per subject. Essentially I need to know what trial in the shorter set (onloadtime) corresponds to what closest column in the long set (time), and to the indicate what time trial that actually is. I hope that makes sense. I'd love any support and would be happy to answer any questions.

hashman · Posted 11-19-2019 11:26 PM

@davidsmarch:

To find an answer, let's ask a question first: Suppose you have a series of values:

1 3 7 11 19

and want to find a value closest to K=10. It's obvious that the answer is 11, but let's consider how we arrive at that, and if we did, we would realize that we actually insert 10 into the sorted list:

1 3 7 10 11 19

and then examine its neighbors, then select the one whose distance from K is shorter. If we had a tie (for example, if K=9), we would choose depending on an extra criterion. Now we could structure a program the same way by inserting K into a sorted array from left to right and shifting the items > K upwards after we have found where to stop. However, if the array were large (and in your case it would be), that would be costly.

A much better way is to perform a binary search on K=10. Of course, it will not be found; but at the end, the binary search midpoint index will settle either on either A[3]=7 or A[4]=11, depending on how you code the search. Now knowing both, we can easily decide which one is the closest to K=10 without doing a costly insertion.

In your case, the file from which an array has to be populated for each separate SUBJ is #1, so let's assume it is sorted by SUBJ and TIME_N. The other file has to be sorted only by SUBJ, as each ONLOADTIME_N will have to be examined against the sorted TIME_N for a given SUBJ. Having decided on that, let's translate it into SAS code. Note that below, the sample files contain much simpler values for the time variables to facilitate eyeball testing and also include cases where the two times match exactly - in which case, we can simply pick the item with the index where the binary search has landed. The array dimension 999999 is selected in hope that the number of records in your largest BY group by SUBJ never exceeds this value.

data trial ;                                                                                                                            
  input subj trial onloadtime_n ;                                                                                                       
  cards ;                                                                                                                               
1  11   1                                                                                                                               
1  12   3                                                                                                                               
1  13   5                                                                                                                               
1  14   7                                                                                                                               
1  15  11                                                                                                                               
2  21  13                                                                                                                               
2  22  17                                                                                                                               
2  23  19                                                                                                                               
2  24  23                                                                                                                               
2  25  29                                                                                                                               
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data col ;                                                                                                                              
  input subj col time_n ;                                                                                                               
  cards ;                                                                                                                               
1  1  10                                                                                                                                
1  2   3                                                                                                                                
1  3   8                                                                                                                                
1  4   0                                                                                                                                
1  5  12                                                                                                                                
1  6   4                                                                                                                                
2  1  24                                                                                                                                
2  2  20                                                                                                                                
2  3  12                                                                                                                                
2  4  27                                                                                                                                
2  5  17                                                                                                                                
2  6  10                                                                                                                                
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = f i j k n) ;                                                                                                          
  array t [2,0:99999] _temporary_ ;                                                                                                     
  do n = 1 by 1 until (last.subj) ;                                                                                                     
    set trial ;                                                                                                                         
    by subj ;                                                                                                                           
    t[1,n] = onloadtime_n ;                                                                                                             
    t[2,n] = trial ;                                                                                                                    
  end ;                                                                                                                                 
  do i = 1, 2 ;                                                                                                                         
    t[i,1-1] = t[i,1] ;                                                                                                                 
    t[i,n+1] = t[i,n] ;                                                                                                                 
  end ;                                                                                                                                 
  do until (last.subj) ;                                                                                                                
    set col ;                                                                                                                           
    by subj ;                                                                                                                           
    i = 1 ;                                                                                                                             
    j = n ;                                                                                                                             
    do f = 0 by 0 until (i > j or f) ;                                                                                                  
      k = floor (divide (i + j, 2)) ;                                                                                                   
      if      time_n < t[1,k] then j = k - 1 ;                                                                                          
      else if time_n > t[1,k] then i = k + 1 ;                                                                                          
      else f = 1 ;                                                                                                                      
    end ;                                                                                                                               
    if f or n = 1 then k = k ;                                                                                                          
    else if time_n < t[1,k] then k = ifn (t[1,k-1] + t[1,k+0] > 2 * time_n, k-1, k+0) ;                                                 
    else                         k = ifn (t[1,k-0] + t[1,k+1] > 2 * time_n, k-0, k+1) ;                                                 
    onloadtime_n = t[1,k] ;                                                                                                             
    trial        = t[2,k] ;                                                                                                             
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;

The DO I=1,2 loop is used to set two sentinels at the bottom and top of the array to avoid writing extra logic for the case when at the end of the binary search K lands on the first or last item.

Now the idea of inserting K into a sorted list would look much brighter if instead of the array we used an ordered hash table, as in this case the insertion is done in O(1) time, rather than O(N) time, simply by calling the ADD method if K is not in the table (if it is, it's trivial since the two time values just match). So, the insertion is not a problem in this case. However, the logic of examining the neighbors of the inserted key, though not particularly complicated, is much less straightforward than in the case of the array. If someone should like to use this problem as a vehicle for a nice exercise at using a hash iterator to implement it, be my guest. A hint:

After inserting the key, you will need to call the SETCUR iterator method to latch onto the inserted item
Then make the iterator dance up and down from there using the PREV/NEXT iterator methods to compare the SETCUR item with its neighbors

Kind regards

Paul D.

Finding the closest value from two datasets

Re: Finding the closest value from two datasets

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!