Hi @righcoastmike Thank your for your patience and feedback. Can you post a more comprehensive sample of your HAVE and the WANT. I have a feeling, I am failing to understand the requirement somewhere.
Objective : To find study id's have 3 or more ed_visits within 12 months ( we'll call them repeat users)
My understanding: If interval between "any 2 dates is <=12", that qualifies. So if month interval between first record of studyid and 3rd record or nth record of studyid is <=12, that still qualifies.
So once we have the dates extracted
md=input(compress(mth_date,'-'),yymmn6.);
Here is how the HASH logic works:
1. Dataset temp with extracted date is sorted by studyid for By group processing
2. For every observation of By group, the look up happens for all obs for it's equivalent. So if studyid has 3 obs as in the sample, the look up happens 3*3=9 times i.e 1 obs of studyid1 with all 3 obs of studyid 1, then 2nd obs of studyid1 with all 3 obs of studyid 1 etc.
3. Since it is a look ahead(next set of records i.e future dates) the look up(hash table) md variable should be greater than the current md variable of the table from which the look up is taking place.
4. Should 3 be true, we check whether number of months between the two dates is <=12. If 3 and 4 happen to be true, we sum ed_visit of current +ed_visits of the future(look up table)
4. The sum is a running total across each by group
5. At the end of each by group should the cumulative sum is >=3 then that entire group should be flagged as 1 else 0
6. In instances, where there is only one record for a by group, should the value of ed_visits is >=3 that should also be flagged as 1.
If the above is correct, I think where i perhaps was an idiot is likely this part that I am noticing now, that misses it is
The logic requires change to-->
do rc=h.find() by 0 while(rc=0); if md>_md and intck('month',_md,md)<=12 then do; k=sum(ed_visits,_ed_visits); _k+k; end;
rc=h.find_next();
end;
from the earlier below
do rc=h.find() by 0 while(rc=0); if md>_md and intck('month',_md,md)<=12 then k=sum(ed_visits,_ed_visits); rc=h.find_next(); end; _k+k;
So the changed full version is
data want;
if _n_=1 then do;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A",multidata:'y') ;
h.definekey ("studyID") ;
h.definedata ("md","ed_visits") ;
h.definedone () ;
end;
_k=0;
do until(last.studyid);
set have(rename=(md=_md ed_visits=_ed_visits));
by studyID;
do rc=h.find() by 0 while(rc=0);
if md>_md and intck('month',_md,md)<=12 then do;
k=sum(ed_visits,_ed_visits);
_k+k;
end;
rc=h.find_next();
end;
end;
repeat_user=first.studyid and last.studyid and _ed_visits>=3 or k>=3 ;
do until(last.studyid);
set have;
by studyID;
output;
end;
keep studyid mth_date ed_visits year repeat_user;
run;
Let me know. Sorry and Thank you for your patience.
Hi @novinosrin and @PaigeMiller
First off, I really appreciate the time you are spending on this. It's really helping me learn a lot.
Secondly, @novinosrin you have no need to apologize, you are both doing me a huge service helping me out on this, and anyone who would be impatient when someone is volunteering to help them probably doesn't deserve your help. Thank you.
Since things have changed a bit after a day of working on this, I thought a general update and summary of where we are at would be helpful.
1. I've managed to convert the date variables in the dataset to SAS dates, so my dataset now looks something like this:
data have;
input studyID date :anydtdte10. ed_visits year ;
cards;
1 2011-04 0 2011
1 2011-05 1 2011
1 2012-01 3 2012
1 2012-04 0 2012
1 2013-05 2 2013
1 2013-06 0 2013
2 2011-03 4 2011
3 2011-04 2 2011
3 2011-06 0 2011
3 2013-08 2 2013
4 2011-04 1 2011
4 2012-05 1 2012
4 2013-06 1 2013
4 2014-07 1 2014
5 2011-03 0 2011
5 2012-05 0 2012
5 2013-08 4 2013
5 2014-02 0 2013
run;
Ideally, I would like to end up with a dataset that looks like this
data want;
input studyID date :anydtdte10. ed_visits year repeat_user ;
cards;
1 2011-04 0 2011 1
1 2011-05 1 2011 1
1 2012-01 2 2012 1
1 2012-04 0 2012 1
1 2013-05 2 2013 1
1 2013-06 0 2013 1
2 2011-03 4 2011 1
3 2011-04 2 2011 0
3 2011-06 0 2011 0
3 2013-08 2 2013 0
4 2011-04 1 2011 0
4 2012-05 1 2012 0
4 2013-06 1 2013 0
4 2014-07 1 2014 0
5 2011-03 0 2011 1
5 2012-05 0 2012 1
5 2013-08 4 2013 1
5 2014-02 0 2013 1
run;
where 1 indicates a repeat user.
Our criteria for a repeat user is 3 or more ed visits within a 12 month period.
So in this case:
#1 is a repeat user because they had 3 visits between 2011-05 and 2012-01 (less than 12 months apart). The 2 visits in 2013-05 don't affect anything
#2 is a repeat user, because even though we only have one month of data for them, they had 4 visits in the month
#3 is not a repeat user because even though they had 4 visits, 2 were in 2011 and 2 were in 2013 (more than 12 months apart)
#4 is not a repeat user because they only had 1 visit in each year, with more than 12 months between them.
#5 is a repeat user for the same reason as #2, we just have more data because they were in the study longer.
Thus far we've tried:
Proc expand - looked promising but ran into trouble because of overlapping dates. for example, both #1 and #2 and #4 have records for 2011-04. proc throws up an error because of the duplicate values in the ID field (date)
@novinosrin 's hash method - This one feel like we're close, but while it does a good job of identifying cases like #1, it misses cases like #2 and #5 - I've tried the updated code and get the same results as before.
@novinosrin 's sql method - This one is also close, but has the opposite problem of the Hash method, it's capturing many records that have less than three ed visits within 12 months, I haven't spotted any particular pattern yet, but I'll let you know if I find anything.
and that's where we are at. I'm going to keep playing with the code I have to see if I can't figure out where things are going wrong, but I hope this clarifies things. Any further insights would be very much appreciated.
Thank you both so much
Mike
Hi and Good evening @righcoastmike A silly change that i missed to make
The k>3 below
repeat_user=first.studyid and last.studyid and _ed_visits>=3 or k>=3 ;
should have been _k>3 with a prefix _ underscore coz that's the accumlating var
So I tested with the changed below
repeat_user=first.studyid and last.studyid and _ed_visits>=3 or _k>=3 ;
Here -->
/*Again I converted to SAS dates, this is no big deal*/
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 1 2011
1 2012-01 3 2012
1 2012-04 0 2012
1 2013-05 2 2013
1 2013-06 0 2013
2 2011-03 4 2011
3 2011-04 2 2011
3 2011-06 0 2011
3 2013-08 2 2013
4 2011-04 1 2011
4 2012-05 1 2012
4 2013-06 1 2013
4 2014-07 1 2014
5 2011-03 0 2011
5 2012-05 0 2012
5 2013-08 4 2013
5 2014-02 0 2013
run;
/*the same code with _k>3 lol hahaha*/
data want;
if _n_=1 then do;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A",multidata:'y') ;
h.definekey ("studyID") ;
h.definedata ("md","ed_visits") ;
h.definedone () ;
end;
_k=0;
do until(last.studyid);
set have(rename=(md=_md ed_visits=_ed_visits));
by studyID;
do rc=h.find() by 0 while(rc=0);
if md>_md and intck('month',_md,md)<=12 then do;
k=sum(ed_visits,_ed_visits);
_k+k;
end;
rc=h.find_next();
end;
end;
repeat_user=first.studyid and last.studyid and _ed_visits>=3 or _k>=3 ;
do until(last.studyid);
set have;
by studyID;
output;
end;
keep studyid date md ed_visits year repeat_user;
run;
Proc print noobs;run;
Here are the test results;
study repeat_
ID date ed_visits year md user
1 2011-04 0 2011 01APR2011 1
1 2011-05 1 2011 01MAY2011 1
1 2012-01 3 2012 01JAN2012 1
1 2012-04 0 2012 01APR2012 1
1 2013-05 2 2013 01MAY2013 1
1 2013-06 0 2013 01JUN2013 1
2 2011-03 4 2011 01MAR2011 1
3 2011-04 2 2011 01APR2011 0
3 2011-06 0 2011 01JUN2011 0
3 2013-08 2 2013 01AUG2013 0
4 2011-04 1 2011 01APR2011 0
4 2012-05 1 2012 01MAY2012 0
4 2013-06 1 2013 01JUN2013 0
4 2014-07 1 2014 01JUL2014 0
5 2011-03 0 2011 01MAR2011 1
5 2012-05 0 2012 01MAY2012 1
5 2013-08 4 2013 01AUG2013 1
5 2014-02 0 2013 01FEB2014 1
Please let me know at your earliest convenience.
Hi @novinosrin
Ok we're getting somewhere now, the latest code made it so the hash and the sql are producing the same results, they aren't quite correct but at least it could help to identify the problem. Here is a complete example of a study id that has been misidentified as a 1 when it should be a zero:
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 0 2011
1 2011-06 0 2011
1 2011-07 0 2011
1 2011-08 0 2011
1 2011-09 0 2011
1 2011-10 0 2011
1 2011-11 0 2011
1 2011-12 0 2011
1 2012-01 0 2012
1 2012-02 0 2012
1 2012-03 2 2012
1 2012-04 0 2012
1 2012-05 0 2012
1 2012-06 0 2012
1 2012-07 0 2012
1 2012-08 0 2012
1 2012-09 0 2012
1 2012-10 0 2012
1 2012-11 0 2012
1 2012-12 0 2012
1 2013-01 0 2013
1 2013-02 0 2013
1 2013-03 0 2013
1 2013-04 0 2013
1 2013-05 0 2013
1 2013-06 0 2013
run;
I ran the code on just this studyid alone, and the misclassification persists. The only thing I can think of is that the fact that there are 27 records associated with the one studyid is throwing it off.
Thoughts?
Mike
sorry for the weird code, I'm not sure what happened there.
here's a better version.
mike
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 0 2011
1 2011-06 0 2011
1 2011-07 0 2011
1 2011-08 0 2011
1 2011-09 0 2011
1 2011-10 0 2011
1 2011-11 0 2011
1 2011-12 0 2011
1 2012-01 0 2012
1 2012-02 0 2012
1 2012-03 2 2012
1 2012-04 0 2012
1 2012-05 0 2012
1 2012-06 0 2012
1 2012-07 0 2012
1 2012-08 0 2012
1 2012-09 0 2012
1 2012-10 0 2012
1 2012-11 0 2012
1 2012-12 0 2012
1 2013-01 0 2013
1 2013-02 0 2013
1 2013-03 0 2013
1 2013-04 0 2013
1 2013-05 0 2013
1 2013-06 0 2013
run;
Ok i think I am able to get the sense of the requirement after all. Let me do some due diligence and come back. Thank you again
Hi @righcoastmike Just before hitting the bed, I think I made it. Well I truly and sincerely hope. Basically the problem was I didn't understand the requirement thoroughly and hence the ding dong(back and forth) but the coding part only took about 10-12 mins . 🙂
I guess I will have to improve my reading comprehension skills in the first place.
Now, it's time for you to test and let me at the earliest and I wanna party
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 0 2011
1 2011-06 0 2011
1 2011-07 0 2011
1 2011-08 0 2011
1 2011-09 0 2011
1 2011-10 0 2011
1 2011-11 0 2011
1 2011-12 0 2011
1 2012-01 0 2012
1 2012-02 0 2012
1 2012-03 2 2012
1 2012-04 0 2012
1 2012-05 0 2012
1 2012-06 0 2012
1 2012-07 0 2012
1 2012-08 0 2012
1 2012-09 0 2012
1 2012-10 0 2012
1 2012-11 0 2012
1 2012-12 0 2012
1 2013-01 0 2013
1 2013-02 0 2013
1 2013-03 0 2013
1 2013-04 0 2013
1 2013-05 0 2013
1 2013-06 0 2013
;
run;
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 1 2011
1 2012-01 3 2012
1 2012-04 0 2012
1 2013-05 2 2013
1 2013-06 0 2013
2 2011-03 4 2011
3 2011-04 2 2011
3 2011-06 0 2011
3 2013-08 2 2013
4 2011-04 1 2011
4 2012-05 1 2012
4 2013-06 1 2013
4 2014-07 1 2014
5 2011-03 0 2011
5 2012-05 0 2012
5 2013-08 4 2013
5 2014-02 0 2013
run;
data want;
if _n_=1 then do;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A",multidata:'y') ;
h.definekey ("studyID") ;
h.definedata ("md","ed_visits") ;
h.definedone () ;
end;
k=0;
do until(last.studyid);
set have(rename=(md=_md ed_visits=_ed_visits));
by studyID;
k=0;
if _md>j then do;
j=intnx('month',_md,12);
put j= _md=;
do rc=h.find() by 0 while(rc=0);
if _md<=md<=j then do; put md= ed_visits=; k+ed_visits; end;
rc=h.find_next();
end;
end;
if k>=3 then repeat_user=1;
end;
do until(last.studyid);
set have;
by studyID;
repeat_user=repeat_user=1;
output;
end;
keep studyid date ed_visits year repeat_user ;
run;
And do not worry. We're super close. Chill and will look forward to you earnest communication. Good night!
Good morning @novinosrin
I ran the new code as soon as I woke up this morning, and I've got good news and bad news.
The good news is the code works. I tried it on several smaller datasets and it looks like it's doing exactly what we need it to do. Hurray!
The bad news is that when I try to run it on the full dataset I keep getting a "SAS has disconnected from the server" error. I did a little bit of research and it looks like it might be because the code generates a very large log file. Apparently this can happen in SAS enterprise guide (which is what I'm using).
here's what I've tested so far
100 000 observations - works great
200 000 observations - works but I get a warning that the log is to large
500 000 observations - doesn't run and disconnects from the server
I'm in the process of coming up with a workaround, but as always, any thoughts are much appreciated.
I'll keep you posted if I find anything
Mike
Mike
Hi @righcoastmike I was thinking you may have removed the "put statements" that i used for debugging. I was having my laptop leaning in my bed and soon as I messaged you, I fell asleep. So I didn't post a clean code as I was tired. Anyyays
Here is a cleaned code-->
data have;
input studyID date $ ed_visits year ;
md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/
format md date9.;
cards;
1 2011-04 0 2011
1 2011-05 1 2011
1 2012-01 3 2012
1 2012-04 0 2012
1 2013-05 2 2013
1 2013-06 0 2013
2 2011-03 4 2011
3 2011-04 2 2011
3 2011-06 0 2011
3 2013-08 2 2013
4 2011-04 1 2011
4 2012-05 1 2012
4 2013-06 1 2013
4 2014-07 1 2014
5 2011-03 0 2011
5 2012-05 0 2012
5 2013-08 4 2013
5 2014-02 0 2013
run;
data want;
if _n_=1 then do;
if 0 then set have;
dcl hash H (dataset:'have',ordered: "A",multidata:'y') ;
h.definekey ("studyID") ;
h.definedata ("md","ed_visits") ;
h.definedone () ;
end;
k=0;
do until(last.studyid);
set have(rename=(md=_md ed_visits=_ed_visits));
by studyID;
k=0;
if _md>j then do;
j=intnx('month',_md,12);
do rc=h.find() by 0 while(rc=0);
if _md<=md<=j then k+ed_visits;
rc=h.find_next();
end;
end;
if k>=3 then repeat_user=1;
end;
do until(last.studyid);
set have;
by studyID;
repeat_user=repeat_user=1;
output;
end;
keep studyid date ed_visits year repeat_user ;
run;
Let me know
We did it! The code works perfectly.
Thanks so much @novinosrin, if we ever cross paths I owe you a drink :-).
I hope that one day I can get good enough at coding to help other people in the same way you helped me.
Thanks again!
Mike
FYI for others looking to do something similar:
This code does exactly what we need. The only issue that I ran into with it is that it creates a very large log file when dealing with many records, and can cause SAS EG to disconnect from the server. By adding the following line at the top of the code @novinosrin provided I redirected the log to my desktop and the problem was solved! :
proc printto log="c:\path-on-server\EG.log"; run;
I am pleased. Thank you so much. I might be hung over tomorrow lol
And feel free to reach out anytime. We are all a SAS family
@righcoastmike wrote:
sorry for the weird code, I'm not sure what happened there.
here's a better version.
mike
data have; input studyID date $ ed_visits year ; md=input(compress(date,'-'),yymmn6.); /*notice here the converted SAS date variable*/ format md date9.; cards; 1 2011-04 0 2011 1 2011-05 0 2011 1 2011-06 0 2011 1 2011-07 0 2011 1 2011-08 0 2011 1 2011-09 0 2011 1 2011-10 0 2011 1 2011-11 0 2011 1 2011-12 0 2011 1 2012-01 0 2012 1 2012-02 0 2012 1 2012-03 2 2012 1 2012-04 0 2012 1 2012-05 0 2012 1 2012-06 0 2012 1 2012-07 0 2012 1 2012-08 0 2012 1 2012-09 0 2012 1 2012-10 0 2012 1 2012-11 0 2012 1 2012-12 0 2012 1 2013-01 0 2013 1 2013-02 0 2013 1 2013-03 0 2013 1 2013-04 0 2013 1 2013-05 0 2013 1 2013-06 0 2013 run;
Now I'm confused. You have all these consecutive months, which is good, but only 2 hospital visits in the entire time. So none of these records would be flagged, because at no time the 12-month running total is 3 or more. Am I understanding this properly?
Can we get a more realistic example? Can we get an example where some patients actually do visit 3 times in a 12 month time period, and some patients do not visit 3 times in a 12 month time period?
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.