Solved: Finding the closest value from two datasets

davidsmarch · Posted 11-17-2019 09:39 PM

I saw some previous solutions that don't quite fit, so I though I'd try to see if I could get some help.

I have two datasets. One dataset (time.dat) is very long (millions of rows) and contains three variables: subject, column, and time_n in this format:

subj col time_n
2 1 1564623266038
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106

.

Each subject has 1000's of columns that each have a unique time_n. There are also many subjects.

The second set (onloadtime.dat) has all of the same subjects and three variables, subject, trial, and onloadtime_n in this format:

subj trial onloadtime_n
2 1 1564623265779
2 2 1564623272356
2 3 1564623275373
2 4 1564623278728

2 5 1564623282137

Each subject only has 180 onloadtime's. What I need to do is find the closest time_n to each onloadtime_n and to indicate which trial corresponds to that time. Such that the resultant data looks like:

subj col time_n onloadtime_n trial
2 1 1564623266038 1564623265779 1
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106

.

2 450 1564623272324 1564623272356 2

And to repeat this per subject. Essentially I need to know what trial in the shorter set (onloadtime) corresponds to what closest column in the long set (time), and to the indicate what time trial that actually is. I hope that makes sense. I'd love any support and would be happy to answer any questions.

s_lassen · Posted 11-18-2019 08:55 AM

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
  input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
  set time;
  by subj;
  if first.subj then do;
    next_time=time_n;
      time_n=-1E38;
      output;
      time_n=next_time;
      end;
  if not last.subj then do;
    _N_=_N_+1;
    set time(keep=time_n rename=(time_n=next_time)) point=_N_;
    end;
  else next_time=1E38;
  output;
run;



data want;
  set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
  by subj onloadtime_n;
  if load then do;
    if onloadtime_n-_last_time<_next_time-onloadtime_n then
      time_n=_last_time;
    else 
      time_n=_next_time;
´   output;
    end;
  else do;
    _next_time=next_time;
    _last_time=onloadtime_n;
    _col=col;
    end;
  retain _:;
  keep subj time_n onloadtime_n trial _col;
  rename _col=col;
run;

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

View solution in original post

s_lassen · Posted 11-18-2019 08:55 AM

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
  input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
  set time;
  by subj;
  if first.subj then do;
    next_time=time_n;
      time_n=-1E38;
      output;
      time_n=next_time;
      end;
  if not last.subj then do;
    _N_=_N_+1;
    set time(keep=time_n rename=(time_n=next_time)) point=_N_;
    end;
  else next_time=1E38;
  output;
run;



data want;
  set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
  by subj onloadtime_n;
  if load then do;
    if onloadtime_n-_last_time<_next_time-onloadtime_n then
      time_n=_last_time;
    else 
      time_n=_next_time;
´   output;
    end;
  else do;
    _next_time=next_time;
    _last_time=onloadtime_n;
    _col=col;
    end;
  retain _:;
  keep subj time_n onloadtime_n trial _col;
  rename _col=col;
run;

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

davidsmarch · Posted 11-21-2019 09:36 AM

That's fantastic. Super quick process, thanks for the solution!

I'm trying to understand how it works and my ability there is limited. I follow about half of what's going on.

Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Catch up on SAS Innovate 2026

Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away