BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Calcite | Level 5

Finding the closest value from two datasets

I saw some previous solutions that don't quite fit, so I though I'd try to see if I could get some help.

I have two datasets. One dataset (time.dat) is very long (millions of rows) and contains three variables: subject, column, and time_n in this format:

subj col   time_n
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106

.

.

.

Each subject has 1000's of columns that each have a unique time_n. There are also many subjects.

The second set (onloadtime.dat) has all of the same subjects and three variables, subject, trial, and onloadtime_n in this format:

subj trial onloadtime_n
2      1    1564623265779
2      2    1564623272356
2      3    1564623275373
2      4    1564623278728

2      5    1564623282137

Each subject only has 180 onloadtime's. What I need to do is find the closest time_n to each onloadtime_n and to indicate which trial corresponds to that time. Such that the resultant data looks like:

subj col    time_n                 onloadtime_n       trial
2      1   1564623266038    1564623265779     1
2      2   1564623266044
2      3   1564623266058
2      4   1564623266073
2      5   1564623266090
2      6   1564623266106

.

.

.

2    450  1564623272324    1564623272356   2

And to repeat this per subject. Essentially I need to know what trial in the shorter set (onloadtime) corresponds to what closest column in the long set (time), and to the indicate what time trial that actually is. I hope that makes sense. I'd love any support and would be happy to answer any questions.

1 ACCEPTED SOLUTION

Accepted Solutions
Meteorite | Level 14

Re: Finding the closest value from two datasets

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

``````data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
set time;
by subj;
if first.subj then do;
next_time=time_n;
time_n=-1E38;
output;
time_n=next_time;
end;
if not last.subj then do;
_N_=_N_+1;
set time(keep=time_n rename=(time_n=next_time)) point=_N_;
end;
else next_time=1E38;
output;
run;

data want;
set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
by subj onloadtime_n;
if load then do;
if onloadtime_n-_last_time<_next_time-onloadtime_n then
time_n=_last_time;
else
time_n=_next_time;
´   output;
end;
else do;
_next_time=next_time;
_last_time=onloadtime_n;
_col=col;
end;
retain _:;
keep subj time_n onloadtime_n trial _col;
rename _col=col;
run;``````

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

2 REPLIES 2
Meteorite | Level 14

Re: Finding the closest value from two datasets

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

``````data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
set time;
by subj;
if first.subj then do;
next_time=time_n;
time_n=-1E38;
output;
time_n=next_time;
end;
if not last.subj then do;
_N_=_N_+1;
set time(keep=time_n rename=(time_n=next_time)) point=_N_;
end;
else next_time=1E38;
output;
run;

data want;
set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
by subj onloadtime_n;
if load then do;
if onloadtime_n-_last_time<_next_time-onloadtime_n then
time_n=_last_time;
else
time_n=_next_time;
´   output;
end;
else do;
_next_time=next_time;
_last_time=onloadtime_n;
_col=col;
end;
retain _:;
keep subj time_n onloadtime_n trial _col;
rename _col=col;
run;``````

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

Calcite | Level 5

Re: Finding the closest value from two datasets

That's fantastic. Super quick process, thanks for the solution!

I'm trying to understand how it works and my ability there is limited. I follow about half of what's going on.

Discussion stats
• 2 replies
• 535 views
• 0 likes
• 2 in conversation