Solved: Re: Creating New Observations Via Multiple Output Statements for One C...

solfay243 · Posted 11-05-2019 08:25 PM

Hi all,

I've got some crude data shown here:

data have; 
input cond1 cond2 start end;
datalines; 
1 0 4075 4965
1 0 5025 5763
0 1 4333 5322
; 
run;

That I would like to process into the results shown here:

data want; 
input cond1 cond2 start end; 
datalines; 
1 0 4075 4333 
1 1 4333 4965
0 1 4965 5025
1 1 5025 5322
1 0 5322 5763 
; 
run;

I was wondering if I could use multiple output statements in 1 conditional (if-then-do) in order to create such an output. Otherwise I'm happy to hear other ideas about how to approach this. Thus far, I've tried using retain statements and sorting the data multiple ways in order to collapse observations and re-write values of cond1 or cond2, but it seems I'll have to add observations as well. The dataset is quite large and there are more conditions (cond#) that I'll be progressively adding to the dataset one at a time, but I'm just trying to get a feel for something I may not have thought of yet.

Many thanks!

FreelanceReinh · Posted 11-06-2019 06:45 AM

Hi @solfay243,

If the start and end values are discrete, e.g. integers (as in your example), you can define a temporary array (in a DATA step) whose index range is large enough to cover all start and end values. Example:

array _d[0:9999] _temporary_ (10000*0);

Then you go through the HAVE dataset (SET statement) and for each observation update the array entries between start and end in a DO loop

do _i=start to end;

so that after processing the last observation the array entry _d[_i] represents the combination of conditions met for value _i. For up to 52 conditions this combination could be captured in an integer whose binary digits indicate whether the corresponding condition is met (1) or not (0). The details of the "update" process depend on characteristics of the HAVE dataset: In your sample data only one condition is met (i.e., has value 1) in each observation and the start-end intervals for the same condition do not overlap. This may or may not be the case in your real data.

After the last observation of HAVE has been processed (use the END= option of the SET statement to detect this), you go through the array and take appropriate actions (e.g., set start, end and cond1, cond2, ... values, write an observation to dataset WANT) whenever the array value changes:

do _i=1 to dim(_d)-1;
  if _d[_i]~=_d[_i-1] then do;
    ...
  end;
end;

Again, the details ("...") depend on the specifications: In your sample WANT dataset, for instance, 4333 and 5763 are end values of intervals with cond1=1 & cond2=0 although dataset HAVE suggests that cond2 should be 1 for 4333. Also, your example doesn't show what should happen if there were gaps where none of the conditions is met (i.e., whether or not these should occur in dataset WANT).

If the array entries are "binary" integers as described above, you'll set variable condk (k=1, 2, ...) to the corresponding binary digit of the array value (e.g., use the BINARYw. format) and write an observation to dataset WANT (using the OUTPUT statement) once the end value has been determined.

View solution in original post

Patrick · Posted 11-05-2019 08:37 PM

You probably need to explain to us the logic for how you get from have to want.

Shmuel · Posted 11-05-2019 08:44 PM

It is absolutely not clear by what rules you want to create the results from your input dataset.

Please explain the logic of process.

FreelanceReinh · Posted 11-06-2019 06:45 AM

Hi @solfay243,

If the start and end values are discrete, e.g. integers (as in your example), you can define a temporary array (in a DATA step) whose index range is large enough to cover all start and end values. Example:

array _d[0:9999] _temporary_ (10000*0);

Then you go through the HAVE dataset (SET statement) and for each observation update the array entries between start and end in a DO loop

do _i=start to end;

so that after processing the last observation the array entry _d[_i] represents the combination of conditions met for value _i. For up to 52 conditions this combination could be captured in an integer whose binary digits indicate whether the corresponding condition is met (1) or not (0). The details of the "update" process depend on characteristics of the HAVE dataset: In your sample data only one condition is met (i.e., has value 1) in each observation and the start-end intervals for the same condition do not overlap. This may or may not be the case in your real data.

After the last observation of HAVE has been processed (use the END= option of the SET statement to detect this), you go through the array and take appropriate actions (e.g., set start, end and cond1, cond2, ... values, write an observation to dataset WANT) whenever the array value changes:

do _i=1 to dim(_d)-1;
  if _d[_i]~=_d[_i-1] then do;
    ...
  end;
end;

Again, the details ("...") depend on the specifications: In your sample WANT dataset, for instance, 4333 and 5763 are end values of intervals with cond1=1 & cond2=0 although dataset HAVE suggests that cond2 should be 1 for 4333. Also, your example doesn't show what should happen if there were gaps where none of the conditions is met (i.e., whether or not these should occur in dataset WANT).

If the array entries are "binary" integers as described above, you'll set variable condk (k=1, 2, ...) to the corresponding binary digit of the array value (e.g., use the BINARYw. format) and write an observation to dataset WANT (using the OUTPUT statement) once the end value has been determined.

solfay243 · Posted 11-06-2019 12:10 PM

@FreelanceReinhard I think you've outlined a possible solution in much more detail than I expected. Thank you for your attention to this problem!

solfay243 · Posted 11-06-2019 11:31 AM

Sorry all,

Let me explain further -

From Data Have to Data Want I want to review the time period (start/end) of when the two conditions overlap and when the two conditions do not overlap and then create observations that reflect those distinct time periods.

So if I were strictly looking at the first record of cond1=1 and the first record of cond2=1 then I would produce the following:

data overlap;
input cond1 cond2 start end;
datalines;
1 0 4075 4333
1 1 4333 4965
0 1 4965 5322
;
run;

However, as there is a second record of cond1=1 then I would also have to take that start/end into consideration, which would result in Data Want. All values are discrete as shown.

FreelanceReinh · Posted 11-06-2019 12:01 PM

Here's a full example of what I outlined in my previous post:

%let nc=2; /* number of conditions */

data test(drop=_:);
array _d[0:9999] _temporary_ (10000*0);
set have end=last;
array cond[&nc];
_c=2**(&nc-whichn(1, of cond[*]));
do _i=start to end;
  _d[_i]+_c;
end;
if last;
do _i=1 to dim(_d)-1;
  if _d[_i]~=_d[_i-1] then do;
    if _d[_i-1] then do;
      end=_i-1;
      _b=put(_d[end],binary&nc..);
      do _c=1 to &nc;
        cond[_c]=input(char(_b,_c),1.);
      end;
      output;
    end;
    start=_i;
  end;
end;
run;

This might not be the final solution because it makes several assumptions (mentioned in my previous post) and it produces a different (but consistent) output dataset from your sample dataset HAVE. (That's why I called it TEST, not WANT.) The table below shows the array values _d[_i] around the "change points" where a combination of conditions starts or ends.

                start         end  start              end  start    end  start              end  start         end 
_i     ... 4074 4075 4076 ... 4332 4333 4334 ... 4964 4965 4966 ... 5024 5025 5026 ... 5321 5322 5323 ... 5762 5763 5764 ...
cond1  ...   0    1    1  ...   1    1    1  ...   1    1    0  ...   0    1    1  ...   1    1    1  ...   1    1    0  ...
cond2  ...   0    0    0  ...   0    1    1  ...   1    1    1  ...   1    1    1  ...   1    1    0  ...   0    0    0  ...      
_d[_i] ...   0    2    2  ...   2    3    3  ...   3    3    1  ...   1    3    3  ...   3    3    2  ...   2    2    0  ...

Here's PROC PRINT output of the resulting dataset TEST:

cond1    cond2    start     end

  1        0       4075    4332
  1        1       4333    4965
  0        1       4966    5024
  1        1       5025    5322
  1        0       5323    5763

So, in this dataset "start" ("end") always means the first (last) value where the combination of conditions is met. [Edit: ...and this is also how I interpreted the start/end values in dataset HAVE.] There is no ambiguity as to which interval a value (e.g. at a change point) belongs to and intervals of length 1 (i.e. start=end) would be no problem. The DATA step works for more than two conditions without any change (up to 52, but I've only tested &nc=3 yet).

Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Re: Creating New Observations Via Multiple Output Statements for One Conditional

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!