Re: Count backward and distribute variable

dsm · Posted 05-14-2018 07:25 PM

I am having a hell of a time with this dataset.

What I have is data that looks like the below. The first column is time in minute values, the second column is skin conductance levels. This is a continuous stream of data collected at 250 samples a second, and critical trials (i.e., when an image was presented) are spread out, so much of the data is useless. Each participants file ends up being over a million lines long, so things in SAS go slow.

The next 4 columns are event markers that indicate when a trial occurred for 4 trials types corresponding to each column. As the software records only voltage levels (i.e., a 0 if no trial maker is occurring, or the 5 you see if a trial marker is occurring). I've converted that 5 into a string that indicates the trial type. Thr_trial is a count of those trial types so I know the order of trials within each type of trial. There will be 12 of these in each trial type, but so far I've only created it for thr_trial to simplify things. char_thr_trial is a string of the first thr_trial so that I could concatenate these two variables in the last column thr_trial_count. I'll do the same thing for each other trial type. I'm sure this is all not very eloquent and I'm open to any suggestions on this part of things.

What I need to do next is :

1. The data I need for each trial encompasses both the 1000 rows before and the 1000 rows after each marker, and I need to label those rows -1000 through 1000. A new variable indicating this information would be necessary.

2. The 1 which indicates the trial order number needs to be distributed to those 2000 rows. The same thing will need to occur for the next trial, which is labeled 2, and so on.

I've attached this truncated dataset in case that is any help.

Any and all help is very much appreciated!

Patrick · Posted 05-14-2018 08:47 PM

@dsm

Below code logic assumes that you've got always at least 1000 rows before and after your indicator record for the same subject and that indicator records are always at least 2000 rows apart from each other. If you can't rely on that then you'll have to add some additional logic.

data want;
  length thr_trial_row_cnt 8;
  if 0 then set have;

  set have(keep=time subj trial_type_thr thr_trial char_thr_trial thr_trial_count );
  by subj time;

  if not missing(thr_trial) then
    do;
      thr_trial_row_cnt=-1000;
      do i=_n_-1000 to _n_+1000 ;
        set have(drop=trial_type_thr thr_trial char_thr_trial thr_trial_count) point=i nobs=nobs;
        output;
        thr_trial_row_cnt+1;
      end;
    end;

run;

As for any additional variable you want to create: I'd do this on the result data set (sub-set of rows) and not on your source.

You could create such additional variables in the same data step where you select the +-1000 rows (where possible).

PGStats · Posted 05-15-2018 12:35 AM

I would also favour random access. You can avoid collisions with a bit of logic:

data want;
retain lastEvent 0;
set have nobs=nobs;
if not missing(trial_type_thr) then do;
    readFrom = max(lastEvent+1, _n_-1000);
    readTo = min(nobs, _n_+1000);
    do i = readFrom to readTo until(i > _n_ and not missing(trial_type_thr));
        set have(drop=thr_trial) point=i;
        order = i - _n_;
        if i <= _n_ or missing(trial_type_thr) then output;
        end;
    lastEvent = _n_;
    end;
drop i lastEvent readFrom readTo;
run;

PG

dsm · Posted 05-15-2018 11:35 AM

That seems to be working wonderfully! Thanks so much!! I know it's asking a lot, but would you mind explaining some of what is actually going on in the code so that I can learn something from this and not just take the code and run with it?

PGStats · Posted 05-15-2018 02:37 PM

I don't know where to start. The key in this small algorithm is the use of two set statements. One reads the dataset sequentially, the other uses random access ( with the point= option) to read in the desired window. The rest of the code is mostly about setting the limits of that window.

PG

dsm · Posted 05-15-2018 04:56 PM

I mostly ask because there is one more step that I need to do, and I can't figure out in your code how to accomplish it. You managed to distribute the repeating "thr_trial" forward and backward. The count also worked great. I realized I only needed -500 to +500 and adjusted accordingly. After a few more steps, the data now look like the below. The new variable, "Order" is a count of the total number of trials that ranges from 1-64, and tells me what order everything occurred in. Given that the length of each trial is the 1000 row (500 prior and after 0), I need to distribute that number along the length of each trial. Is this possible?

Patrick · Posted 05-15-2018 06:24 PM

@dsm

If I understand your requirement right wouldn't below (untested) amendment to @PGStats's code do?

data want;
  retain lastEvent 0;
  set have nobs=nobs;
  by subj time;
  if first.subje then order=0;
  if not missing(trial_type_thr) then do;
      readFrom = max(lastEvent+1, _n_-1000);
      readTo = min(nobs, _n_+1000);
      order+1;
      do i = readFrom to readTo until(i > _n_ and not missing(trial_type_thr));
          set have(drop=thr_trial) point=i;
          order = i - _n_;
          if i <= _n_ or missing(trial_type_thr) then output;
          end;
      lastEvent = _n_;
      end;
  drop i lastEvent readFrom readTo;
run;

mkeintz · Posted 05-15-2018 10:51 PM

There is a relatively straightforward sequential access approach as well.

data vneed / view=vneed;
  set have (keep=trial_type_thr);
  if not(missing(trial_type_thr));
  drop trial_type_thr;
  order+1;
  _start=_n_-500;
  _end=_n_+500;
run; 

data want (drop=_:);
  set have;
  if _n_ > _end then set vneed;
  if _n_ < _start then delete;
  else if _n_=_start then offset=-500;
  else offset+1;
run;

This also assumes that no two trial_type_thr have fewer than 1,000 records between them.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 05-15-2018 11:32 PM

This one is also based on sequential access, but is more compact:

data want2;
  merge have 
        have (firstobs=501 keep=trial_type_thr 
               rename=(trial_type_thr=ttt));

  retain offset 1e6;
  offset+1;


  if not(missing(ttt)) then do;
    order+1;
    offset=-500;
  end;

  if -500<=offset<=500;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

dsm · Posted 05-16-2018 04:46 PM

Thanks! Seems to be working great. I'm running into a hang-up when I try to do this with more than one subject (a variable called "subj", which has values such as 1, 2...100, 101. My data has many subjects, and so far I've been trying to get the process down for just one subject before scaling up to many subjects.

I can't figure out where to put the "by subj" statement such that the order count resets per participant. For one participant it correctly counts order up to 64, which is the total number of trials per subject. But after that, the next subject's order picks up at 65 instead of starting over at 1 and going to 64 again. Is there a way to do this by subject?

PGStats · Posted 05-16-2018 05:13 PM

Your initial question mentioned that you had one subject per dataset. Ths would handle multiple subjets per dataset:


data want;
retain lastEvent 0;
set have nobs=nobs;
by subj;
if first.subj then do;
    trial = 0;
    lastEvent = _n_ - 1;
    end;
if not missing(trial_type_thr) then do;
    readFrom = max(lastEvent+1, _n_-500);
    readTo = min(nobs, _n_+500);
    trial + 1;
    do i = readFrom to readTo 
      until(i > _n_ and not missing(trial_type_thr));
        set have(drop=thr_trial rename=subj=newSubj) point=i;
        order = i - _n_;
        if subj=newSubj and (i <= _n_ or missing(trial_type_thr)) then output;
        end;
    lastEvent = _n_;
    end;
drop i lastEvent readFrom readTo newSubj;
run;

PG

PGStats · Posted 05-15-2018 11:31 PM

You need an extra statement:

data want;
retain lastEvent 0;
set have nobs=nobs;
if not missing(trial_type_thr) then do;
    readFrom = max(lastEvent+1, _n_-500);
    readTo = min(nobs, _n_+500);
    trial + 1;
    do i = readFrom to readTo until(i > _n_ and not missing(trial_type_thr));
        set have(drop=thr_trial) point=i;
        order = i - _n_;
        if i <= _n_ or missing(trial_type_thr) then output;
        end;
    lastEvent = _n_;
    end;
drop i lastEvent readFrom readTo;
run;

PG

Classroom Training Available!