Solved: Count number of times within a timeframe by id, multiple occurrences b...

andyec · Posted 04-16-2020 11:53 AM

I'm working with a dataset of lab values that occurred within 7 days of each other. Essentially, I'm trying to figure out what values are "duplicates" that couldn't be removed with a proc sort because the lab value and date are close but not the same (unfortunately a result of some reporting issues of date and value). I'm having issues because there can be multiple occurrences by id. For example, id 1 in the simulated dataset below has 3 labs within 7 days of 5/30/19 and also 2 labs within 7 days of 7/7/19.

obs	id	date	value	psource
1	1	5/30/19	20	1
2	1	5/31/19	20	0
3	1	6/1/19	19	0
4	1	7/7/19	350	1
5	1	7/10/19	200	1
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0
10	3	12/8/19	1500	1
11	3	12/10/19	1500	1
12	3	12/11/19	1500	0
13	3	1/15/20	19	1
14	3	1/16/20	20	0

As a result, I would like to create a count column that looks like the one below.

obs	id	date	value	psource	count
1	1	5/30/19	20	1	1
2	1	5/31/19	20	0	2
3	1	6/1/19	19	0	3
4	1	7/7/19	350	1	1
5	1	7/10/19	200	1	2
6	2	1/3/19	100	0	1
7	2	1/4/19	99	1	2
8	2	3/6/19	19	1	1
9	2	3/8/19	20	0	2
10	3	12/8/19	1500	1	1
11	3	12/10/19	1500	1	2
12	3	12/11/19	1500	0	3
13	3	1/15/20	19	1	1
14	3	1/16/20	20	0	2

Lastly, I would like to apply some rules like if the labs occurred within 7 days of each other for a given id and are close in value (within 11), then extract only the first one that comes from the preferred source (psource=1). For example, keep row 10 and get rid of rows 11 and 12. Also, if the labs occurred within 7 days of each other and aren't close in value then create a dataset of those so that those can undergo further review. For example, output rows 4 and 5.

Thank you in advance for any help on this.

novinosrin · Posted 04-16-2020 07:19 PM

Hi @andyec I was using remove method as an additional check to remove the hash contents for failed records of Notwithin11 logic. You wouldn't need that. However that error shouldn't have occured unless your sample data is different to the real that you are testing against.

I have modified the code slightly below



data have;
input obs	id	date :mmddyy10.	value	psource;
format date mmddyy10.;
cards;
1	1	5/30/19	20	1
2	1	5/31/19	20	0
3	1	6/1/19	19	0
4	1	7/7/19	350	1
5	1	7/10/19	200	1
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0
10	3	12/8/19	1500	1
11	3	12/10/19	1500	1
12	3	12/11/19	1500	0
13	3	1/15/20	19	1
14	3	1/16/20	20	0
;


data temp;
 do until(last.id);
  set have;
  by id;
  if first.id then do;
   d=date;
   count=1;
   _cg=1;
  end;
  else if intck('days',d,date)>7 then do;
   d=date;
   count=1;
   _cg=sum(_cg,1);
  end;
  else count=sum(count,1);
  output;
 end;
 drop d;
run;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("id","_cg") ;
   h.definedata("_d");
   h.definedone();
 end;
 do until(last._cg);
  set temp;
  by id _cg;
  if psource then do;
    _d=date;
    h.ref();
  end;
  if first._cg then _c=1;
  else if abs(_v-value)<=11 then _c=sum(_c,1);
  _v=value;
 end;  
 _f=_c=count;
 do until(last._cg);
  set temp;
  by id _cg;
  if _f then do;
    if  first._cg  then rc=h.find();
    if rc=0 then if _d=date then output;
  end;
  else do;
    Notwithin11=1;
    output;
  end;
 end;
 h.clear();
 drop _: rc;
run;

View solution in original post

novinosrin · Posted 04-16-2020 01:03 PM

Hi @andyec for the count column. I'm afraid you need to explain the "close" thing logic more clearly. Thank you!


data have;
input obs	id	date :mmddyy10.	value	psource;
format date mmddyy10.;
cards;
1	1	5/30/19	20	1
2	1	5/31/19	20	0
3	1	6/1/19	19	0
4	1	7/7/19	350	1
5	1	7/10/19	200	1
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0
10	3	12/8/19	1500	1
11	3	12/10/19	1500	1
12	3	12/11/19	1500	0
13	3	1/15/20	19	1
14	3	1/16/20	20	0
;


data want;
 do until(last.id);
  set have;
  by id;
  if first.id then do;
   d=date;
   count=1;
  end;
  else if intck('days',d,date)>7 then do;
   d=date;
   count=1;
  end;
  else count=sum(count,1);
  output;
 end;
 drop d;
run;

proc print noobs;run;

obs	id	date	value	psource	count
1	1	05/30/2019	20	1	1
2	1	05/31/2019	20	0	2
3	1	06/01/2019	19	0	3
4	1	07/07/2019	350	1	1
5	1	07/10/2019	200	1	2
6	2	01/03/2019	100	0	1
7	2	01/04/2019	99	1	2
8	2	03/06/2019	19	1	1
9	2	03/08/2019	20	0	2
10	3	12/08/2019	1500	1	1
11	3	12/10/2019	1500	1	2
12	3	12/11/2019	1500	0	3
13	3	01/15/2020	19	1	1
14	3	01/16/2020	20	0	2

andyec · Posted 04-16-2020 01:49 PM

Hi @novinosrin thank you for the count column code.

For the other part, I want values that occurred within 7 days and are within 11 points of each other to ultimately become one value (with a preference for the one where psource=1 and if multiple have psource=1, then pick the first one). For example, I want the table below to become the one below that.

obs	id	date	value	psource
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0

obs	id	date	value	psource
7	2	1/4/19	99	1
8	2	3/6/19	19	1

If the values occurred within 7 days but aren't within 11 points of each other, I want them to be output like the table below.

obs	id	date	value	psource
4	1	7/7/19	350	1
5	1	7/10/19	200	1

Is that explained better?

novinosrin · Posted 04-16-2020 02:10 PM

HI @andyec Hmm I think I am getting it. One more clarification plz. What do you mean as "Points".

Is it the values column? To wit, taking the difference between the values -->absolute value(19-20)<=11 is something what you mean as points?

andyec · Posted 04-16-2020 02:15 PM

@novinosrin yes I mean the values column and abs(19-20) <= 11 between values that occurred within 7 days of each other.

novinosrin · Posted 04-16-2020 04:04 PM

Hi @andyec Sorry I got busy for my boss's requirement at my work place. Okay here is what I understand.

Logic:

1. Within an ID, <=7 interval days series of counts for one set giving rise to the variable COUNT. There could be N number of sets

2. Within each set(count1-countN) in a given ID, if the absolute difference between the values of each value that is nth value-previous nth-1 aka previous) value <=11 then a record will qualify

3. However the number of record that qualifies should be equal to the COUNT_ N i.e. if for example,count=7 and the number of records with abs(diff)<=11 should also be 7.

4. I added a NotWithin11 flag to make it convenient your review

If the above makes sense, the following should be close, albeit let me know where the discrepancy is, we shall fix it


data have;
input obs	id	date :mmddyy10.	value	psource;
format date mmddyy10.;
cards;
1	1	5/30/19	20	1
2	1	5/31/19	20	0
3	1	6/1/19	19	0
4	1	7/7/19	350	1
5	1	7/10/19	200	1
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0
10	3	12/8/19	1500	1
11	3	12/10/19	1500	1
12	3	12/11/19	1500	0
13	3	1/15/20	19	1
14	3	1/16/20	20	0
;


data temp;
 do until(last.id);
  set have;
  by id;
  if first.id then do;
   d=date;
   count=1;
   _cg=1;
  end;
  else if intck('days',d,date)>7 then do;
   d=date;
   count=1;
   _cg=sum(_cg,1);
  end;
  else count=sum(count,1);
  output;
 end;
 drop d;
run;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("id","_cg") ;
   h.definedata("_d");
   h.definedone();
 end;
 do until(last._cg);
  set temp;
  by id _cg;
  if h.check() ne 0 and psource then do;
    _d=date;
    h.add();
  end;
  if first._cg then do; _v=value;_c=1;end;
  else if abs(_v-value)<=11 then _c=sum(_c,1);
 end;
 _f=_c=count;
 if not _f then h.remove();
 do until(last._cg);
  set temp;
  by id _cg;
  if _f then do;
    if  first._cg  then rc=h.find();
    if rc=0 then if _d=date then output;
  end;
  else do;
    Notwithin11=1;
    output;
  end;
 end;
 h.clear();
 drop _: rc;
run;

obs	id	date	value	psource	count	Notwithin11
1	1	05/30/2019	20	1	1	.
4	1	07/07/2019	350	1	1	1
5	1	07/10/2019	200	1	2	1
7	2	01/04/2019	99	1	2	.
8	2	03/06/2019	19	1	1	.
10	3	12/08/2019	1500	1	1	.
13	3	01/15/2020	19	1	1	.

andyec · Posted 04-16-2020 05:47 PM

@novinosrin thank you very much for this. I am getting the following error: An error has occurred during instance method OM_Remove(179) of "DATASTEP.HASH." Any idea of what this may be? I will have to try and troubleshoot in a little bit.

novinosrin · Posted 04-16-2020 07:19 PM

Hi @andyec I was using remove method as an additional check to remove the hash contents for failed records of Notwithin11 logic. You wouldn't need that. However that error shouldn't have occured unless your sample data is different to the real that you are testing against.

I have modified the code slightly below



data have;
input obs	id	date :mmddyy10.	value	psource;
format date mmddyy10.;
cards;
1	1	5/30/19	20	1
2	1	5/31/19	20	0
3	1	6/1/19	19	0
4	1	7/7/19	350	1
5	1	7/10/19	200	1
6	2	1/3/19	100	0
7	2	1/4/19	99	1
8	2	3/6/19	19	1
9	2	3/8/19	20	0
10	3	12/8/19	1500	1
11	3	12/10/19	1500	1
12	3	12/11/19	1500	0
13	3	1/15/20	19	1
14	3	1/16/20	20	0
;


data temp;
 do until(last.id);
  set have;
  by id;
  if first.id then do;
   d=date;
   count=1;
   _cg=1;
  end;
  else if intck('days',d,date)>7 then do;
   d=date;
   count=1;
   _cg=sum(_cg,1);
  end;
  else count=sum(count,1);
  output;
 end;
 drop d;
run;

data want;
 if _n_=1 then do;
   dcl hash H () ;
   h.definekey  ("id","_cg") ;
   h.definedata("_d");
   h.definedone();
 end;
 do until(last._cg);
  set temp;
  by id _cg;
  if psource then do;
    _d=date;
    h.ref();
  end;
  if first._cg then _c=1;
  else if abs(_v-value)<=11 then _c=sum(_c,1);
  _v=value;
 end;  
 _f=_c=count;
 do until(last._cg);
  set temp;
  by id _cg;
  if _f then do;
    if  first._cg  then rc=h.find();
    if rc=0 then if _d=date then output;
  end;
  else do;
    Notwithin11=1;
    output;
  end;
 end;
 h.clear();
 drop _: rc;
run;

andyec · Posted 04-16-2020 08:12 PM

@novinosrin that appeared to work very well - thank you!

novinosrin · Posted 04-16-2020 08:19 PM

Hi @andyec Do feel free to come back to us in this thread or a new one should you need help. The hardest part in a tech forum in my humble opinion is the communication. The apparent best approach to get quick and precise answers is by offering the following:

1. A "representative sample" data of what you HAVE

2. An expected output sample(WANT) corresponding to the input sample(HAVE)

3. Briefly explaning the convert/transformation or reporting logic/purpose

and finally additional notes/comments if any.

Once this is communicated properly, the likelihood is that you will get the much needed solution at the soonest. Thank you!

andyec · Posted 04-17-2020 08:57 AM

Hi @novinosrin sorry one more thing. How would this change if multiple have psource=1, then pick the last one (instead of the first one)?

novinosrin · Posted 04-17-2020 09:10 AM

Try changing the the current

  if psource then do;
    _d=date;
    h.ref();
  end;

to

  if psource then do;
    _d=date;
    h.replace();
  end;

and let me know 🙂 😉

mkeintz · Posted 04-16-2020 05:59 PM

I presume by "within 7 days of each other" you are referring to the number of days between consecutive records (so if you have 7 records spaced 2 days apart, you could have a single range of 14 days).

data have;
  input bs	id	date  :mmddyy10.	value	psource;
  format date date9.;
datalines;
1	1	5/30/2019	20	1
2	1	5/31/2019	20	0
3	1	6/1/2019	19	0
4	1	7/7/2019	350	1
5	1	7/10/2019	200	1
6	2	1/3/2019	100	0
7	2	1/4/2019	99	1
8	2	3/6/2019	19	1
9	2	3/8/2019	20	0
10	3	12/8/2019	1500	1
11	3	12/10/2019	1500	1
12	3	12/11/2019	1500	0
13	3	1/15/2020	19	1
14	3	1/16/2020	20	0
run;

data want;
  set have;
  by id;
  if first.id=1 or dif(date)>7 then count=1;
  else count+1;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Re: Count number of times within a timeframe by id, multiple occurrences by id possible

Classroom Training Available!