I have two datasets. One dataset (time.dat) is very long (millions of rows) and contains three variables: subject, column, and time_n in this format:
subj col time_n
2 1 1564623266038
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106
.
.
.
Each subject has 1000's of columns that each have a unique time_n. There are also many subjects.
The second set (onloadtime.dat) has all of the same subjects and three variables, subject, trial, and onloadtime_n in this format:
subj trial onloadtime_n
2 1 1564623265779
2 2 1564623272356
2 3 1564623275373
2 4 1564623278728
2 5 1564623282137
Each subject only has 180 onloadtime's. What I need to do is find the closest time_n to each onloadtime_n and to indicate which trial corresponds to that time. Such that the resultant data looks like:
subj col time_n onloadtime_n trial
2 1 1564623266038 1564623265779 1
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106
.
.
.
2 450 1564623272324 1564623272356 2
And to repeat this per subject. Essentially I need to know what trial in the shorter set (onloadtime) corresponds to what closest column in the long set (time), and to the indicate what time trial that actually is. I hope that makes sense. I'd love any support and would be happy to answer any questions.
You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.
An alternative may be a data step solution something like this (probably a lot faster):
data time;
input subj col time_n;
cards;
2 1 1564623266038
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106
;run;
data onloadtime;
input subj trial onloadtime_n;
cards;
2 1 1564623265779
;run;
Data v_interval/view=v_interval;
set time;
by subj;
if first.subj then do;
next_time=time_n;
time_n=-1E38;
output;
time_n=next_time;
end;
if not last.subj then do;
_N_=_N_+1;
set time(keep=time_n rename=(time_n=next_time)) point=_N_;
end;
else next_time=1E38;
output;
run;
data want;
set v_interval(rename=(time_n=onloadtime_n)) onloadtime(in=load);
by subj onloadtime_n;
if load then do;
if onloadtime_n-_last_time<_next_time-onloadtime_n then
time_n=_last_time;
else
time_n=_next_time;
´ output;
end;
else do;
_next_time=next_time;
_last_time=onloadtime_n;
_col=col;
end;
retain _:;
keep subj time_n onloadtime_n trial _col;
rename _col=col;
run;
Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.
Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.
You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.
An alternative may be a data step solution something like this (probably a lot faster):
data time;
input subj col time_n;
cards;
2 1 1564623266038
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106
;run;
data onloadtime;
input subj trial onloadtime_n;
cards;
2 1 1564623265779
;run;
Data v_interval/view=v_interval;
set time;
by subj;
if first.subj then do;
next_time=time_n;
time_n=-1E38;
output;
time_n=next_time;
end;
if not last.subj then do;
_N_=_N_+1;
set time(keep=time_n rename=(time_n=next_time)) point=_N_;
end;
else next_time=1E38;
output;
run;
data want;
set v_interval(rename=(time_n=onloadtime_n)) onloadtime(in=load);
by subj onloadtime_n;
if load then do;
if onloadtime_n-_last_time<_next_time-onloadtime_n then
time_n=_last_time;
else
time_n=_next_time;
´ output;
end;
else do;
_next_time=next_time;
_last_time=onloadtime_n;
_col=col;
end;
retain _:;
keep subj time_n onloadtime_n trial _col;
rename _col=col;
run;
Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.
Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.
That's fantastic. Super quick process, thanks for the solution!
I'm trying to understand how it works and my ability there is limited. I follow about half of what's going on.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.