I'm working with a dataset of lab values that occurred within 7 days of each other. Essentially, I'm trying to figure out what values are "duplicates" that couldn't be removed with a proc sort because the lab value and date are close but not the same (unfortunately a result of some reporting issues of date and value). I'm having issues because there can be multiple occurrences by id. For example, id 1 in the simulated dataset below has 3 labs within 7 days of 5/30/19 and also 2 labs within 7 days of 7/7/19.
obs | id | date | value | psource |
1 | 1 | 5/30/19 | 20 | 1 |
2 | 1 | 5/31/19 | 20 | 0 |
3 | 1 | 6/1/19 | 19 | 0 |
4 | 1 | 7/7/19 | 350 | 1 |
5 | 1 | 7/10/19 | 200 | 1 |
6 | 2 | 1/3/19 | 100 | 0 |
7 | 2 | 1/4/19 | 99 | 1 |
8 | 2 | 3/6/19 | 19 | 1 |
9 | 2 | 3/8/19 | 20 | 0 |
10 | 3 | 12/8/19 | 1500 | 1 |
11 | 3 | 12/10/19 | 1500 | 1 |
12 | 3 | 12/11/19 | 1500 | 0 |
13 | 3 | 1/15/20 | 19 | 1 |
14 | 3 | 1/16/20 | 20 | 0 |
As a result, I would like to create a count column that looks like the one below.
obs | id | date | value | psource | count |
1 | 1 | 5/30/19 | 20 | 1 | 1 |
2 | 1 | 5/31/19 | 20 | 0 | 2 |
3 | 1 | 6/1/19 | 19 | 0 | 3 |
4 | 1 | 7/7/19 | 350 | 1 | 1 |
5 | 1 | 7/10/19 | 200 | 1 | 2 |
6 | 2 | 1/3/19 | 100 | 0 | 1 |
7 | 2 | 1/4/19 | 99 | 1 | 2 |
8 | 2 | 3/6/19 | 19 | 1 | 1 |
9 | 2 | 3/8/19 | 20 | 0 | 2 |
10 | 3 | 12/8/19 | 1500 | 1 | 1 |
11 | 3 | 12/10/19 | 1500 | 1 | 2 |
12 | 3 | 12/11/19 | 1500 | 0 | 3 |
13 | 3 | 1/15/20 | 19 | 1 | 1 |
14 | 3 | 1/16/20 | 20 | 0 | 2 |
Lastly, I would like to apply some rules like if the labs occurred within 7 days of each other for a given id and are close in value (within 11), then extract only the first one that comes from the preferred source (psource=1). For example, keep row 10 and get rid of rows 11 and 12. Also, if the labs occurred within 7 days of each other and aren't close in value then create a dataset of those so that those can undergo further review. For example, output rows 4 and 5.
Thank you in advance for any help on this.
Hi @andyec I was using remove method as an additional check to remove the hash contents for failed records of Notwithin11 logic. You wouldn't need that. However that error shouldn't have occured unless your sample data is different to the real that you are testing against.
I have modified the code slightly below
data have;
input obs id date :mmddyy10. value psource;
format date mmddyy10.;
cards;
1 1 5/30/19 20 1
2 1 5/31/19 20 0
3 1 6/1/19 19 0
4 1 7/7/19 350 1
5 1 7/10/19 200 1
6 2 1/3/19 100 0
7 2 1/4/19 99 1
8 2 3/6/19 19 1
9 2 3/8/19 20 0
10 3 12/8/19 1500 1
11 3 12/10/19 1500 1
12 3 12/11/19 1500 0
13 3 1/15/20 19 1
14 3 1/16/20 20 0
;
data temp;
do until(last.id);
set have;
by id;
if first.id then do;
d=date;
count=1;
_cg=1;
end;
else if intck('days',d,date)>7 then do;
d=date;
count=1;
_cg=sum(_cg,1);
end;
else count=sum(count,1);
output;
end;
drop d;
run;
data want;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("id","_cg") ;
h.definedata("_d");
h.definedone();
end;
do until(last._cg);
set temp;
by id _cg;
if psource then do;
_d=date;
h.ref();
end;
if first._cg then _c=1;
else if abs(_v-value)<=11 then _c=sum(_c,1);
_v=value;
end;
_f=_c=count;
do until(last._cg);
set temp;
by id _cg;
if _f then do;
if first._cg then rc=h.find();
if rc=0 then if _d=date then output;
end;
else do;
Notwithin11=1;
output;
end;
end;
h.clear();
drop _: rc;
run;
Hi @andyec for the count column. I'm afraid you need to explain the "close" thing logic more clearly. Thank you!
data have;
input obs id date :mmddyy10. value psource;
format date mmddyy10.;
cards;
1 1 5/30/19 20 1
2 1 5/31/19 20 0
3 1 6/1/19 19 0
4 1 7/7/19 350 1
5 1 7/10/19 200 1
6 2 1/3/19 100 0
7 2 1/4/19 99 1
8 2 3/6/19 19 1
9 2 3/8/19 20 0
10 3 12/8/19 1500 1
11 3 12/10/19 1500 1
12 3 12/11/19 1500 0
13 3 1/15/20 19 1
14 3 1/16/20 20 0
;
data want;
do until(last.id);
set have;
by id;
if first.id then do;
d=date;
count=1;
end;
else if intck('days',d,date)>7 then do;
d=date;
count=1;
end;
else count=sum(count,1);
output;
end;
drop d;
run;
proc print noobs;run;
obs | id | date | value | psource | count |
---|---|---|---|---|---|
1 | 1 | 05/30/2019 | 20 | 1 | 1 |
2 | 1 | 05/31/2019 | 20 | 0 | 2 |
3 | 1 | 06/01/2019 | 19 | 0 | 3 |
4 | 1 | 07/07/2019 | 350 | 1 | 1 |
5 | 1 | 07/10/2019 | 200 | 1 | 2 |
6 | 2 | 01/03/2019 | 100 | 0 | 1 |
7 | 2 | 01/04/2019 | 99 | 1 | 2 |
8 | 2 | 03/06/2019 | 19 | 1 | 1 |
9 | 2 | 03/08/2019 | 20 | 0 | 2 |
10 | 3 | 12/08/2019 | 1500 | 1 | 1 |
11 | 3 | 12/10/2019 | 1500 | 1 | 2 |
12 | 3 | 12/11/2019 | 1500 | 0 | 3 |
13 | 3 | 01/15/2020 | 19 | 1 | 1 |
14 | 3 | 01/16/2020 | 20 | 0 | 2 |
Hi @novinosrin thank you for the count column code.
For the other part, I want values that occurred within 7 days and are within 11 points of each other to ultimately become one value (with a preference for the one where psource=1 and if multiple have psource=1, then pick the first one). For example, I want the table below to become the one below that.
obs | id | date | value | psource |
6 | 2 | 1/3/19 | 100 | 0 |
7 | 2 | 1/4/19 | 99 | 1 |
8 | 2 | 3/6/19 | 19 | 1 |
9 | 2 | 3/8/19 | 20 | 0 |
obs | id | date | value | psource |
7 | 2 | 1/4/19 | 99 | 1 |
8 | 2 | 3/6/19 | 19 | 1 |
If the values occurred within 7 days but aren't within 11 points of each other, I want them to be output like the table below.
obs | id | date | value | psource |
4 | 1 | 7/7/19 | 350 | 1 |
5 | 1 | 7/10/19 | 200 | 1 |
Is that explained better?
HI @andyec Hmm I think I am getting it. One more clarification plz. What do you mean as "Points".
Is it the values column? To wit, taking the difference between the values -->absolute value(19-20)<=11 is something what you mean as points?
@novinosrin yes I mean the values column and abs(19-20) <= 11 between values that occurred within 7 days of each other.
Hi @andyec Sorry I got busy for my boss's requirement at my work place. Okay here is what I understand.
Logic:
1. Within an ID, <=7 interval days series of counts for one set giving rise to the variable COUNT. There could be N number of sets
2. Within each set(count1-countN) in a given ID, if the absolute difference between the values of each value that is nth value-previous nth-1 aka previous) value <=11 then a record will qualify
3. However the number of record that qualifies should be equal to the COUNT_ N i.e. if for example,count=7 and the number of records with abs(diff)<=11 should also be 7.
4. I added a NotWithin11 flag to make it convenient your review
If the above makes sense, the following should be close, albeit let me know where the discrepancy is, we shall fix it
data have;
input obs id date :mmddyy10. value psource;
format date mmddyy10.;
cards;
1 1 5/30/19 20 1
2 1 5/31/19 20 0
3 1 6/1/19 19 0
4 1 7/7/19 350 1
5 1 7/10/19 200 1
6 2 1/3/19 100 0
7 2 1/4/19 99 1
8 2 3/6/19 19 1
9 2 3/8/19 20 0
10 3 12/8/19 1500 1
11 3 12/10/19 1500 1
12 3 12/11/19 1500 0
13 3 1/15/20 19 1
14 3 1/16/20 20 0
;
data temp;
do until(last.id);
set have;
by id;
if first.id then do;
d=date;
count=1;
_cg=1;
end;
else if intck('days',d,date)>7 then do;
d=date;
count=1;
_cg=sum(_cg,1);
end;
else count=sum(count,1);
output;
end;
drop d;
run;
data want;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("id","_cg") ;
h.definedata("_d");
h.definedone();
end;
do until(last._cg);
set temp;
by id _cg;
if h.check() ne 0 and psource then do;
_d=date;
h.add();
end;
if first._cg then do; _v=value;_c=1;end;
else if abs(_v-value)<=11 then _c=sum(_c,1);
end;
_f=_c=count;
if not _f then h.remove();
do until(last._cg);
set temp;
by id _cg;
if _f then do;
if first._cg then rc=h.find();
if rc=0 then if _d=date then output;
end;
else do;
Notwithin11=1;
output;
end;
end;
h.clear();
drop _: rc;
run;
obs | id | date | value | psource | count | Notwithin11 |
---|---|---|---|---|---|---|
1 | 1 | 05/30/2019 | 20 | 1 | 1 | . |
4 | 1 | 07/07/2019 | 350 | 1 | 1 | 1 |
5 | 1 | 07/10/2019 | 200 | 1 | 2 | 1 |
7 | 2 | 01/04/2019 | 99 | 1 | 2 | . |
8 | 2 | 03/06/2019 | 19 | 1 | 1 | . |
10 | 3 | 12/08/2019 | 1500 | 1 | 1 | . |
13 | 3 | 01/15/2020 | 19 | 1 | 1 | . |
@novinosrin thank you very much for this. I am getting the following error: An error has occurred during instance method OM_Remove(179) of "DATASTEP.HASH." Any idea of what this may be? I will have to try and troubleshoot in a little bit.
Hi @andyec I was using remove method as an additional check to remove the hash contents for failed records of Notwithin11 logic. You wouldn't need that. However that error shouldn't have occured unless your sample data is different to the real that you are testing against.
I have modified the code slightly below
data have;
input obs id date :mmddyy10. value psource;
format date mmddyy10.;
cards;
1 1 5/30/19 20 1
2 1 5/31/19 20 0
3 1 6/1/19 19 0
4 1 7/7/19 350 1
5 1 7/10/19 200 1
6 2 1/3/19 100 0
7 2 1/4/19 99 1
8 2 3/6/19 19 1
9 2 3/8/19 20 0
10 3 12/8/19 1500 1
11 3 12/10/19 1500 1
12 3 12/11/19 1500 0
13 3 1/15/20 19 1
14 3 1/16/20 20 0
;
data temp;
do until(last.id);
set have;
by id;
if first.id then do;
d=date;
count=1;
_cg=1;
end;
else if intck('days',d,date)>7 then do;
d=date;
count=1;
_cg=sum(_cg,1);
end;
else count=sum(count,1);
output;
end;
drop d;
run;
data want;
if _n_=1 then do;
dcl hash H () ;
h.definekey ("id","_cg") ;
h.definedata("_d");
h.definedone();
end;
do until(last._cg);
set temp;
by id _cg;
if psource then do;
_d=date;
h.ref();
end;
if first._cg then _c=1;
else if abs(_v-value)<=11 then _c=sum(_c,1);
_v=value;
end;
_f=_c=count;
do until(last._cg);
set temp;
by id _cg;
if _f then do;
if first._cg then rc=h.find();
if rc=0 then if _d=date then output;
end;
else do;
Notwithin11=1;
output;
end;
end;
h.clear();
drop _: rc;
run;
@novinosrin that appeared to work very well - thank you!
Hi @andyec Do feel free to come back to us in this thread or a new one should you need help. The hardest part in a tech forum in my humble opinion is the communication. The apparent best approach to get quick and precise answers is by offering the following:
1. A "representative sample" data of what you HAVE
2. An expected output sample(WANT) corresponding to the input sample(HAVE)
3. Briefly explaning the convert/transformation or reporting logic/purpose
and finally additional notes/comments if any.
Once this is communicated properly, the likelihood is that you will get the much needed solution at the soonest. Thank you!
Hi @novinosrin sorry one more thing. How would this change if multiple have psource=1, then pick the last one (instead of the first one)?
Try changing the the current
if psource then do;
_d=date;
h.ref();
end;
to
if psource then do;
_d=date;
h.replace();
end;
and let me know 🙂 😉
I presume by "within 7 days of each other" you are referring to the number of days between consecutive records (so if you have 7 records spaced 2 days apart, you could have a single range of 14 days).
data have;
input bs id date :mmddyy10. value psource;
format date date9.;
datalines;
1 1 5/30/2019 20 1
2 1 5/31/2019 20 0
3 1 6/1/2019 19 0
4 1 7/7/2019 350 1
5 1 7/10/2019 200 1
6 2 1/3/2019 100 0
7 2 1/4/2019 99 1
8 2 3/6/2019 19 1
9 2 3/8/2019 20 0
10 3 12/8/2019 1500 1
11 3 12/10/2019 1500 1
12 3 12/11/2019 1500 0
13 3 1/15/2020 19 1
14 3 1/16/2020 20 0
run;
data want;
set have;
by id;
if first.id=1 or dif(date)>7 then count=1;
else count+1;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.