Solved: Re: Expanding the data set for each event day

finans_sas · Posted 05-26-2023 04:52 PM

Hello Everyone:

I have a dataset similar to the one that I provided below:

obs_date	event	value
1/1/2020	0	0.003918
1/2/2020	0	0.294439
1/4/2020	1	0.048467
1/5/2020	0	0.718177
1/6/2020	0	0.001364
1/7/2020	0	0.440972
1/8/2020	0	0.559622
1/9/2020	0	0.821105
1/10/2020	1	0.332539
1/11/2020	0	0.289809
1/12/2020	0	0.911323
1/18/2020	0	0.896737
1/19/2020	0	0.333354
1/20/2020	0	0.19605
1/21/2020	0	0.606891
1/22/2020	1	0.439834
1/23/2020	0	0.896842
1/24/2020	0	0.147459

.....

My objective is to retain the next 180 observations of "value" every time event=1 (days do not have to be consecutive). At the end, I would like to have the following dataset:

obs_date	eventday	value	day_count
1/4/2020	1/4/2020	0.048467	0
1/5/2020	1/4/2020	0.718177	1
1/6/2020	1/4/2020	0.001364	2
1/7/2020	1/4/2020	0.440972	3
1/8/2020	1/4/2020	0.559622	4
1/9/2020	1/4/2020	0.821105	5
1/10/2020	1/4/2020	0.332539	6
1/11/2020	1/4/2020	0.289809	7
1/12/2020	1/4/2020	0.911323	8
1/18/2020	1/4/2020	0.896737	9
1/19/2020	1/4/2020	0.333354	10
…	1/4/2020	....	…
…	1/4/2020	....	180
1/10/2020	1/10/2020	0.332539	0
1/11/2020	1/10/2020	0.289809	1
1/12/2020	1/10/2020	0.911323	2
1/18/2020	1/10/2020	0.896737	3
1/19/2020	1/10/2020	0.333354	4
1/20/2020	1/10/2020	0.19605	5
1/21/2020	1/10/2020	0.606891	6
1/22/2020	1/10/2020	0.439834	7
1/23/2020	1/10/2020	0.896842	8
1/24/2020	1/10/2020	0.147459	9
1/25/2020	1/10/2020	0.923238	10
…	1/10/2020	...	…
…	1/10/2020	...	180

I could not figure out how to perform this data step. I would love to get your suggestions on it. Thank you so much for your help in advance.

ballardw · Posted 05-26-2023 09:34 PM

@finans_sas wrote:

Thank you so much @ballardw and @mkeintz for helping me with my inquiry. I really appreciated it. My data set extremely large, so perhaps the data step solution may work better for me. Please accept my apologies. The data set does have IDs, and I need to ensure that these steps are executed within the same IDs only. I did not work with hash operators before. How should I go about bringing in the IDs? I attempted below (bold line), but ended up getting something different.

This mod to my code may work to limit the matches within Secid (security id perhaps as in market trading?)

roc sql;
   create table need2 as
   select a.secid, a.obs_date as EventDay,a.obsnum as baseobs
          ,b.obs_date,b.value
   from (select * from need where Event=1) as a
        , 
        need as b
   where a.secid=b.secid and ( b.obsnum between a.obsnum and (180+a.obsnum) )
   order by a.secid, a.obsnum,b.obs_date
   ;
quit;

And the data step to number:

data want;
   set need2;
   by secid baseobs;
   retain daycount;
   if first.baseobs then daycount=0;
   else daycount+1;
run;

Many times we will recommend not splitting data but since the logic you want for SQL (join on ranges) ends up with cartesian joins and Hash might have memory issues you might consider creating a set for each Secid value and use either approach with the reduced set.

Or schedule the job to run at night and see if it finishes before you come into work the next day...

View solution in original post

ballardw · Posted 05-26-2023 05:28 PM

One way maybe. First step is to add something that can be used to determine observation number difference.

Please note use of a data step to provide example data. Please do so in the future.

data have;
   input obs_date :mmddyy10. 	event 	value;
   format obs_date mmddyy10.;
datalines;
1/1/2020 	0 	0.003918
1/2/2020 	0 	0.294439
1/4/2020 	1 	0.048467
1/5/2020 	0 	0.718177
1/6/2020 	0 	0.001364
1/7/2020 	0 	0.440972
1/8/2020 	0 	0.559622
1/9/2020 	0 	0.821105
1/10/2020 	1 	0.332539
1/11/2020 	0 	0.289809
1/12/2020 	0 	0.911323
1/18/2020 	0 	0.896737
1/19/2020 	0 	0.333354
1/20/2020 	0 	0.19605
1/21/2020 	0 	0.606891
1/22/2020 	1 	0.439834
1/23/2020 	0 	0.896842
1/24/2020 	0 	0.147459
;

data need;
   set have;
   obsnum = _n_;
run;

The automatic variable _N_ is provided by SAS as a count of the iterations of a data step. If the data simple, as in this case, it will provide a count of the observation position. This assumes that your existing data set is the desired order to impose the 180 rule.

This uses that data set to combine all of the values of your variable and select groups. Caution: if you have large data sets this can take a long time to run.

proc sql;
   create table need2 as
   select a.obs_date as EventDay,a.obsnum as baseobs
          ,b.obs_date,b.value
   from (select * from need where Event=1) as a
        , 
        need as b
   where b.obsnum between a.obsnum and (180+a.obsnum)
   order by a.obsnum,b.obs_date
   ;
quit;

If you really need that daycount variable:

data want;
   set need2;
   by baseobs;
   retain daycount;
   if first.baseobs then daycount=0;
   else daycount+1;
run;

Is 180 a surrogate for 6 Months? If so, then there are better methods than just counting obs.

Also, if there are any other variables that go along with this process, such as group id membership, then there will be more code needed to make sure that the observations are processed within group only.

mkeintz · Posted 05-26-2023 06:00 PM

Piggybacking on @ballardw 's creation of the SAS sample data set HAVE, here is a data step solution:

data have;;
   input obs_date :mmddyy10. 	event 	value;
   format obs_date mmddyy10.;
datalines;
1/1/2020 	0 	0.003918
1/2/2020 	0 	0.294439
1/4/2020 	1 	0.048467
1/5/2020 	0 	0.718177
1/6/2020 	0 	0.001364
1/7/2020 	0 	0.440972
1/8/2020 	0 	0.559622
1/9/2020 	0 	0.821105
1/10/2020 	1 	0.332539
1/11/2020 	0 	0.289809
1/12/2020 	0 	0.911323
1/18/2020 	0 	0.896737
1/19/2020 	0 	0.333354
1/20/2020 	0 	0.19605
1/21/2020 	0 	0.606891
1/22/2020 	1 	0.439834
1/23/2020 	0 	0.896842
1/24/2020 	0 	0.147459
;


data want (drop=event);
  set have (where=(event=1));

  if _n_=1 then do;
    declare hash h (dataset:'have (drop=event)',ordered:'a');
      h.definekey('obs_date');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');
    h.output(dataset:'h');
  end;

  day=0;
  eventday=obs_date;
  format eventday date9.;
  output;
  hi.setcur();  /*Point hash iterator at hash dataitem for event date */
  do day=1 to 180 while (hi.next()=0);
    output;
  end;
run;

If you change the 180 to 10, for this sample data, you'll see that the window stop criterion is properly applied. But note this program assumes that there is never an instance of multiple observations per obs_date. The program does NOT require dataset HAVE to be sorted.

The "trick" here is the WHILE (hi.next()=0) condition, which forces the hash iterator to advance (i.e. to retrieve the next OBS_DATE) at the start of each iteration of the DO DAY=1 ... loop. If the hi.next() fails (end of data), it returns a non-zero code, stopping the DO loop before the output statement is executed.

However, this program will NOT guarantee only complete 180 (181 actually) day windows. Will you be satisfied with the possibility of short windows at the end of the date range? It all depends on how many obs follow the last event=1 case.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

finans_sas · Posted 05-26-2023 06:42 PM

Thank you so much @ballardw and @mkeintz for helping me with my inquiry. I really appreciated it. My data set extremely large, so perhaps the data step solution may work better for me. Please accept my apologies. The data set does have IDs, and I need to ensure that these steps are executed within the same IDs only. I did not work with hash operators before. How should I go about bringing in the IDs? I attempted below (bold line), but ended up getting something different.

Also: I opted for 180 days instead of 6 months since I need to have 180 observations for each event, and the 6-month cutoff may not necessarily produce 180 observations. If for some events, I have fewer than 180 observations (due to data limitations), this will be fine. I will first run the estimations with only those with 180 observations first, and later I will include all of them (even those with fewer than 180 observations).

data have;;
input secid obs_date :mmddyy10. event value;
format obs_date mmddyy10.;
datalines;
1 1/1/2020 0 0.003918297
1 1/2/2020 0 0.294439369
1 1/4/2020 1 0.048466828
1 1/5/2020 0 0.718177034
1 1/6/2020 0 0.001364211
1 1/7/2020 0 0.440971903
1 1/8/2020 0 0.55962246
1 1/9/2020 0 0.821105414
1 1/10/2020 1 0.332539126
1 1/11/2020 0 0.289808782
1 1/12/2020 0 0.911322784
1 1/18/2020 0 0.896737429
1 1/19/2020 0 0.333353538
1 1/20/2020 0 0.19605047
1 1/21/2020 0 0.60689124
1 1/22/2020 1 0.439833751
1 1/23/2020 0 0.896841702
1 1/24/2020 0 0.147458515
1 1/25/2020 0 0.923237859
1 1/26/2020 0 0.394894147
1 1/27/2020 0 0.935664763
1 1/28/2020 0 0.667981258
1 1/29/2020 1 0.872639458
1 1/30/2020 0 0.064826005
1 1/31/2020 0 0.109456422
1 2/1/2020 0 0.386348541
1 2/2/2020 0 0.732321068
1 2/3/2020 0 0.503449099
2 1/1/2020 0 0.003918297
2 1/2/2020 0 0.294439369
2 1/4/2020 1 0.048466828
2 1/5/2020 0 0.718177034
2 1/6/2020 0 0.001364211
2 1/7/2020 0 0.440971903
2 1/8/2020 0 0.55962246
2 1/9/2020 0 0.821105414
2 1/10/2020 1 0.332539126
2 1/11/2020 0 0.289808782
2 1/12/2020 0 0.911322784
2 1/18/2020 0 0.896737429
2 1/19/2020 0 0.333353538
2 1/20/2020 0 0.19605047
2 1/21/2020 0 0.60689124
2 1/22/2020 1 0.439833751
2 1/23/2020 0 0.896841702
2 1/24/2020 0 0.147458515
2 1/25/2020 0 0.923237859
2 1/26/2020 0 0.394894147
2 1/27/2020 0 0.935664763
2 1/28/2020 0 0.667981258
2 1/29/2020 1 0.872639458
2 1/30/2020 0 0.064826005
2 1/31/2020 0 0.109456422
2 2/1/2020 0 0.386348541
2 2/2/2020 0 0.732321068
2 2/3/2020 0 0.503449099
2 2/4/2020 0 0.60689124
2 2/5/2020 0 0.439833751
2 2/6/2020 0 0.896841702
2 2/7/2020 0 0.147458515
2 2/8/2020 0 0.923237859
;

data want (drop=event);
set have (where=(event=1));
by secid;
if _n_=1 then do;
declare hash h (dataset:'have (drop=event)',ordered:'a');
h.definekey('obs_date');
h.definedata(all:'Y');
h.definedone();
declare hiter hi ('h');
h.output(dataset:'h');
end;

day=0;
eventday=obs_date;
format eventday date9.;
output;
hi.setcur(); /*Point hash iterator at hash dataitem for event date */
do day=1 to 180 while (hi.next()=0);
output;
end;
run;

Patrick · Posted 05-26-2023 07:49 PM

For your case don't use a hash object if you're dealing with big data volumes.

Below data step approach requires the source data to be sorted by secid and obs_date.

I've added three different selection options - pick the one you need.

data want;
  set have;
  by secid obs_date;

  retain day_count _start_dt;
  if first.secid then 
    do;
      day_count=0;
      _start_dt=obs_date;
    end;
  
  if event=0 then day_count+1;

  /* logic for picking the first 180 rows per secid where event=0 */
  if day_count<=180 then output;

  /* logic for picking the first 6 months of rows per secid where event=0 */
  if intck('month',_start_dt,obs_date)<6 then output;

  /* logic for picking the first 6 months of rows per secid where event=0; but never more than 180 rows*/
  if intck('month',_start_dt,obs_date)<6 and day_count<=180 then output;

  drop _start_dt;
run;

finans_sas · Posted 05-26-2023 08:24 PM

Thank you so much @Patrick for this suggestion. When I ran it, it did not quite produce the dataset I would like to have. day_count starts for the first observation in the sample and goes all the way till the end of the sample instead of starting at each secid & eventday=1 pairs. Should I create an ID that combines secid and event day to accomplish this task?

Thank you so much

Patrick · Posted 05-26-2023 08:42 PM

I thought you asked for selection logic by secid. If that's not the case then the source data for the data want step would need to be sorted by date only.

If you just need a consecutive row_count then just don't reset the counter when a new secid starts (the bit under the if first.secid condition).

ballardw · Posted 05-26-2023 09:24 PM

@Patrick wrote:

I thought you asked for selection logic by secid. If that's not the case then the source data for the data want step would need to be sorted by date only.

If you just need a consecutive row_count then just don't reset the counter when a new secid starts (the bit under the if first.secid condition).

Part of the headache is that he wants overlapping output each time an "event" happens. So when an event occurs within the 180 days such as from the example below he wants 180 observations starting on 1/4/2020 then another sequence of 180 starting on 1/10/2020 and the two will have 175 observation dates in common.

1/4/2020	1	0.048467
1/5/2020	0	0.718177
1/6/2020	0	0.001364
1/7/2020	0	0.440972
1/8/2020	0	0.559622
1/9/2020	0	0.821105
1/10/2020	1	0.332539

Patrick · Posted 05-27-2023 11:48 PM

Thanks @ballardw for pointing out where I missed what the OP asked for.

@finans_sas Below should work.

data want;
  set have(keep=secid obs_date event rename=(secid=_secid obs_date=_obs_date));
  by _secid _obs_date;
  if event=1;

  day_count=0;

  do _i=_n_ to _nobs;
    set have point=_i  nobs=_nobs;
    if _secid ne secid then leave;

/*    if day_count>179 then leave;*/

    if intck('month',_obs_date,obs_date)>=6 then leave;

    output;
    day_count+1;
  end;
  drop _:;
run;

mkeintz · Posted 05-27-2023 10:11 PM

If your data are sorted by SECID, you can use hash objects for each secid, then clear them to conserve memory for the next SECID. I say clear them (not clear it), because the program below uses a second hash object - for the event=1 observations:


data have;;
  input secid obs_date :mmddyy10. event value;
  format obs_date mmddyy10.;
datalines;
1 1/1/2020 0 0.003918297
1 1/2/2020 0 0.294439369
1 1/4/2020 1 0.048466828
1 1/5/2020 0 0.718177034
1 1/6/2020 0 0.001364211
1 1/7/2020 0 0.440971903
1 1/8/2020 0 0.55962246
1 1/9/2020 0 0.821105414
1 1/10/2020 1 0.332539126
1 1/11/2020 0 0.289808782
1 1/12/2020 0 0.911322784
1 1/18/2020 0 0.896737429
1 1/19/2020 0 0.333353538
1 1/20/2020 0 0.19605047
1 1/21/2020 0 0.60689124
1 1/22/2020 1 0.439833751
1 1/23/2020 0 0.896841702
1 1/24/2020 0 0.147458515
1 1/25/2020 0 0.923237859
1 1/26/2020 0 0.394894147
1 1/27/2020 0 0.935664763
1 1/28/2020 0 0.667981258
1 1/29/2020 1 0.872639458
1 1/30/2020 0 0.064826005
1 1/31/2020 0 0.109456422
1 2/1/2020 0 0.386348541
1 2/2/2020 0 0.732321068
1 2/3/2020 0 0.503449099
2 1/1/2020 0 0.003918297
2 1/2/2020 0 0.294439369
2 1/4/2020 1 0.048466828
2 1/5/2020 0 0.718177034
2 1/6/2020 0 0.001364211
2 1/7/2020 0 0.440971903
2 1/8/2020 0 0.55962246
2 1/9/2020 0 0.821105414
2 1/10/2020 1 0.332539126
2 1/11/2020 0 0.289808782
2 1/12/2020 0 0.911322784
2 1/18/2020 0 0.896737429
2 1/19/2020 0 0.333353538
2 1/20/2020 0 0.19605047
2 1/21/2020 0 0.60689124
2 1/22/2020 1 0.439833751
2 1/23/2020 0 0.896841702
2 1/24/2020 0 0.147458515
2 1/25/2020 0 0.923237859
2 1/26/2020 0 0.394894147
2 1/27/2020 0 0.935664763
2 1/28/2020 0 0.667981258
2 1/29/2020 1 0.872639458
2 1/30/2020 0 0.064826005
2 1/31/2020 0 0.109456422
2 2/1/2020 0 0.386348541
2 2/2/2020 0 0.732321068
2 2/3/2020 0 0.503449099
2 2/4/2020 0 0.60689124
2 2/5/2020 0 0.439833751
2 2/6/2020 0 0.896841702
2 2/7/2020 0 0.147458515
2 2/8/2020 0 0.923237859
;



data want (drop=event);
  if _n_=1 then do;
    if 0 then set have;
    declare hash h (dataset:'have (obs=0 drop=event)',ordered:'a');
      h.definekey('obs_date');
      h.definedata(all:'Y');
      h.definedone();
    declare hiter hi ('h');

    declare hash ev (dataset:'have (obs=0 drop=event)',ordered:'a');
      ev.definekey('obs_date');
      ev.definedata(all:'Y');
      ev.definedone();
    declare hiter evi ('ev');
  end;

  do until (last.secid);  /*populate hash objects for a single SECID*/
    set have;
    by secid;
    h.add();
    if event=1 then ev.add();
  end;

  do while (evi.next()=0);
    day=0;
    eventday=obs_date;   format eventday date9.;
    output;

    hi.setcur();  /*Point hash iterator at hash dataitem for event date */
    do day=1 to 180 while (hi.next()=0);
      output;
    end;
  end;
  ev.clear();
  h.clear();
run;

The above needs the data to be sorted by SECID, but the data don't have to be sorted by obs_date within SECID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ballardw · Posted 05-26-2023 09:34 PM

@finans_sas wrote:

Thank you so much @ballardw and @mkeintz for helping me with my inquiry. I really appreciated it. My data set extremely large, so perhaps the data step solution may work better for me. Please accept my apologies. The data set does have IDs, and I need to ensure that these steps are executed within the same IDs only. I did not work with hash operators before. How should I go about bringing in the IDs? I attempted below (bold line), but ended up getting something different.

This mod to my code may work to limit the matches within Secid (security id perhaps as in market trading?)

roc sql;
   create table need2 as
   select a.secid, a.obs_date as EventDay,a.obsnum as baseobs
          ,b.obs_date,b.value
   from (select * from need where Event=1) as a
        , 
        need as b
   where a.secid=b.secid and ( b.obsnum between a.obsnum and (180+a.obsnum) )
   order by a.secid, a.obsnum,b.obs_date
   ;
quit;

And the data step to number:

data want;
   set need2;
   by secid baseobs;
   retain daycount;
   if first.baseobs then daycount=0;
   else daycount+1;
run;

Many times we will recommend not splitting data but since the logic you want for SQL (join on ranges) ends up with cartesian joins and Hash might have memory issues you might consider creating a set for each Secid value and use either approach with the reduced set.

Or schedule the job to run at night and see if it finishes before you come into work the next day...

finans_sas · Posted 05-28-2023 05:53 AM

Thank you so much @ballardw , @Patrick , and @mkeintz for helping me with my question. I sincerely appreciated all your help.

Have a wonderful weekend!

Classroom Training Available!