SAS Programming

davidsmarch · Posted 11-17-2019 09:39 PM

I saw some previous solutions that don't quite fit, so I though I'd try to see if I could get some help.

I have two datasets. One dataset (time.dat) is very long (millions of rows) and contains three variables: subject, column, and time_n in this format:

subj col time_n
2 1 1564623266038
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106

.

Each subject has 1000's of columns that each have a unique time_n. There are also many subjects.

The second set (onloadtime.dat) has all of the same subjects and three variables, subject, trial, and onloadtime_n in this format:

subj trial onloadtime_n
2 1 1564623265779
2 2 1564623272356
2 3 1564623275373
2 4 1564623278728

2 5 1564623282137

Each subject only has 180 onloadtime's. What I need to do is find the closest time_n to each onloadtime_n and to indicate which trial corresponds to that time. Such that the resultant data looks like:

subj col time_n onloadtime_n trial
2 1 1564623266038 1564623265779 1
2 2 1564623266044
2 3 1564623266058
2 4 1564623266073
2 5 1564623266090
2 6 1564623266106

.

2 450 1564623272324 1564623272356 2

And to repeat this per subject. Essentially I need to know what trial in the shorter set (onloadtime) corresponds to what closest column in the long set (time), and to the indicate what time trial that actually is. I hope that makes sense. I'd love any support and would be happy to answer any questions.

s_lassen · Posted 11-18-2019 08:55 AM

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
  input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
  set time;
  by subj;
  if first.subj then do;
    next_time=time_n;
      time_n=-1E38;
      output;
      time_n=next_time;
      end;
  if not last.subj then do;
    _N_=_N_+1;
    set time(keep=time_n rename=(time_n=next_time)) point=_N_;
    end;
  else next_time=1E38;
  output;
run;



data want;
  set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
  by subj onloadtime_n;
  if load then do;
    if onloadtime_n-_last_time<_next_time-onloadtime_n then
      time_n=_last_time;
    else 
      time_n=_next_time;
´   output;
    end;
  else do;
    _next_time=next_time;
    _last_time=onloadtime_n;
    _col=col;
    end;
  retain _:;
  keep subj time_n onloadtime_n trial _col;
  rename _col=col;
run;

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

View solution in original post

s_lassen · Posted 11-18-2019 08:55 AM

You can, of course, write an elegant solution in SQL, submit it, and wait a couple of days before getting the result.

An alternative may be a data step solution something like this (probably a lot faster):

data time;
input subj col   time_n;
cards;
2      1    1564623266038
2      2    1564623266044
2      3    1564623266058
2      4    1564623266073
2      5    1564623266090
2      6    1564623266106
;run;

data onloadtime;
  input subj trial onloadtime_n;
cards;
2      1    1564623265779
;run;

Data v_interval/view=v_interval;
  set time;
  by subj;
  if first.subj then do;
    next_time=time_n;
      time_n=-1E38;
      output;
      time_n=next_time;
      end;
  if not last.subj then do;
    _N_=_N_+1;
    set time(keep=time_n rename=(time_n=next_time)) point=_N_;
    end;
  else next_time=1E38;
  output;
run;



data want;
  set v_interval(rename=(time_n=onloadtime_n))  onloadtime(in=load);
  by subj onloadtime_n;
  if load then do;
    if onloadtime_n-_last_time<_next_time-onloadtime_n then
      time_n=_last_time;
    else 
      time_n=_next_time;
´   output;
    end;
  else do;
    _next_time=next_time;
    _last_time=onloadtime_n;
    _col=col;
    end;
  retain _:;
  keep subj time_n onloadtime_n trial _col;
  rename _col=col;
run;

Both tables have til be sorted by SUBJ and their time variable for this to work. The datastep view creates a table of time intervals using SET with POINT= for look-ahead. The values -1E38 and 1E38 are supposed to be "infinitely" lower resp. higher than any actual time values in the data.

Then it is just a matter of setting the view and the load-table in order, and assigning the right value to the TIME_N variable.

davidsmarch · Posted 11-21-2019 09:36 AM

That's fantastic. Super quick process, thanks for the solution!

I'm trying to understand how it works and my ability there is limited. I follow about half of what's going on.

SAS Programming

Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Re: Finding the closest value from two datasets

Finding common value from two datasets

Finding the duplicate values between two datasets

ModelOps, MLOps, and Finding Value in Analytics

Array to find closest data

Find duplicated values in SAS EG

Follow Us

What is...

SAS Programming

Join us for our biggest event of the year!

SAS Training: Just a Click Away

Follow Us

What is...