- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello SAS community,
I have a question regarding how to capture certain rows that fall under a certain criteria.
For example I have a database that has clinical information on clients. (Client 1 and 2 in this example). I also have which provider they saw and what their admission (Admit Date) and Dishcarge (dsch date) dates are. I want to only keep the first rowe where the difference between the discharge date and the next admit date (for the same client) is less than 15.
As such I would keep only row 1 (for client1) and rows 4 and 5 for Client 2.
Suggestions on SAS code that can get this output?
Row # | Client | Provider | Admit Date | Dsch Date |
1 | 1 | 3 | 2/8/2016 | 2/10/2016 |
2 | 1 | 3 | 2/26/2016 | 2/22/2016 |
3 | 1 | 4 | 2/23/2016 | 2/24/2016 |
4 | 2 | 1 | 1/2/2016 | 1/3/2016 |
5 | 2 | 3 | 3/20/2016 | 3/23/2016 |
6 | 2 | 4 | 4/25/2016 | 4/26/2016
|
Thank you in advance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi. Did you mean "greater than or equal to 15 days" since that waht your dsired output indicates, yes/no? Also, row #2 in your data has a discharge date prior to the admit date. Nionetheless (and presuming GE 15 days, not LT 15 days) ...
data x;
input client provider (admit disch) (:mmddyy.);
format admit disch mmddyy10.;
datalines;
1 3 2/8/2016 2/10/2016
1 3 2/26/2016 2/22/2016
1 4 2/23/2016 2/24/2016
2 1 1/2/2016 1/3/2016
2 3 3/20/2016 3/23/2016
2 4 4/25/2016 4/26/2016
;
data y;
retain rec 1;
do j=1 by 1;
set x;
by client;
rec + 1;
if last.client then leave;
set x (keep=admit rename=(admit=next_admit)) point=rec;
if next_admit - disch ge 15 then output;
end;
drop j;
run;
DATA SET: y
client provider admit disch next_admit
1 3 02/08/2016 02/10/2016 02/26/2016
2 1 01/02/2016 01/03/2016 03/20/2016
2 3 03/20/2016 03/23/2016 04/25/2016
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is a good program for the "set/by plus merge-with-offset" technique. Assuming your data are sorted by client_id/disch_date:
data want;
set have (keep=client);
by client;
merge have
have (firstobs=2 keep=admit_date rename=(admit_date=next_admit));
if last.client=0 and (next_admit-dsch_date)>=15; /*Changed admit_date to next_admit*/
run;
Correction made above thanks to @MikeZdeb
Notes:
- "SET/BY": The SET and BY statement are there generate the automatirc variables first.client and last.client, indicating whether the record-in-hand is at the beginning and/or end of data for a given client
- "Merge with offset": the MERGE statement merges dataset HAVE with itself, but the 2nd reference to have is offset by one record ("firstobs=2"). And the 2nd reference keeps only one variable - admit_date, which is renamed to next_admit.
- The result is that it is easy to test for the conditions you require., using the subsetting if statement above.
In general this is a good technique for looking ahead for a condition to control the status of the record-in-hand.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Nice.
ps I think that this ... if last.client=0 and (admit_date - dsch_date)>=15;
was supoosed to be ... if last.client=0 and (next_admit - dsch_date)>=15;
Yes/no?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have follow-up and clarifcaiton to my question,
Apologies for being confusing.
To clarify, what I want to do is if a client has multiple visits to a provider any visits that are within 15 days of each other to only take the first visit. So if a client is discharged on Jan 1 and then seen again within 15 days (so between Jan 2 and Jan 15) then only to include the Jan 1 visit. Then if a client is seen on Jan 16 then to include the Jan 16 visit as well.
My example table was not the best. Actaully according to the above description then from my example table I should be expecting row 1 for client 1 and all three rows for client 2.
I did try both ways that were suggested and it kind of worked.
So when I have the data set as follows:
(note that row 2 the admit date is before the discharge date which is wrong as zdeb had correctly pointed out)
data x;
input client provider (admit disch) (:mmddyy.);
format admit disch mmddyy10.;
datalines;
1 3 2/8/2016 2/10/2016
1 3 2/26/2016 2/22/2016
1 4 2/23/2016 2/24/2016
2 1 1/2/2016 1/3/2016
2 3 3/20/2016 3/23/2016
2 4 4/25/2016 4/26/2016
;
run;
I get the output where it gives me back rows 1, 4 and 5. NOTE: I also need row 6 for client 2 to also output.
However, if I change the admit date on row two from '2/26/16' to '2/11/16' then I only get rows 4 and 5 from client 2 and I dont get any rows from client 1 back. I would want to get row 1 from client 1 still.
To clarify, there were mistakes in my example. I would want row 1 from client 1 and all three rows from client 2, since what I am trying to do is that if there are multiple visits of a cilent and those multiple visits are all within 15 days of each other then to only give me the first vist record, however if the visits are greater than 15 days from each other then to give me each visit that is greater than 15 days. Hope this makes more sense and apologies for any confusion.
Suggestions to the code that you had shared would be greatly appreciated!
Thank you very very much! Appreciate it!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Does the following do what you want?
data have; input client provider (admit disch) (:mmddyy.); format admit disch mmddyy10.; datalines; 1 3 2/8/2016 2/10/2016 1 3 2/11/2016 2/22/2016 1 4 2/23/2016 2/24/2016 2 1 1/2/2016 1/3/2016 2 3 3/20/2016 3/23/2016 2 4 4/25/2016 4/26/2016 ; run; data want (drop=last: next:); set have (keep=client); by client; retain last_disch; merge have have (firstobs=2 keep=admit rename=(admit=next_admit)); last_disch=lag(disch); if first.client then output; else if last.client=0 then do; if (next_admit - disch)>=15 then output; end; else if last.client then do; if admit-last_disch >=15 then output; end; run;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
OK, for all pair of consecutive records, you want only the first if it is 15 days or less prior to the latter record. And you also want all records that are more the 15 days prior.
data want;
set have (keep=client);
by client;
merge have
have (firstobs=2 keep=admit rename=(admit_date=next_admit));
if admit_date>lag(discharge_date)+15 and first.client=0 then output;
else if (last.client=0 and next_admit<=dsch_date+15) then output;
run;
But let's rephrase your request - basically you want all records EXCEPT those that are within 15 days of the prior record, right? If so, then the program can be much more compact:
data want;
set have;
by client;
if admit_date-15 < lag(dsch_date) and first.client=0 then delete;
run;
Editted addition, after @art297's endorsement:
I should add an important caveat to the 2nd option above. Make sure the "admit_date-15<lag(dsch_date)" is always the FIRST condition in the "if ... then delete" statement. I.e. don't use "if first.client=0 and admit_date-15 < lag(dsch_date) then delete".
Why? Because, regardless of the result of the IF test you always want to update the (rather short) queue maintained by the lag function (i.e. you always wnat to replace the retrieved dsch_date value with the new value of dsch_date). And since there is an "AND" connecting two conditions in the if test, SAS is smart enough not to waste resources testing the 2nd condition when the first is not true. Putting the "lag()" part as the 2nd condition would mean the queue is not always updated. Once that happens erroneous results are guaranteed.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have;
input client provider (admit disch) (:mmddyy.);
format admit disch mmddyy10.;
datalines;
1 3 2/8/2016 2/10/2016
1 3 2/26/2016 2/22/2016
1 4 2/23/2016 2/24/2016
2 1 1/2/2016 1/3/2016
2 3 3/20/2016 3/23/2016
2 4 4/25/2016 4/26/2016
;
data want;
merge have
have (firstobs=2 keep=client admit
rename=(admit=next_admit client=_client));
if client=_client and (next_admit-disch)>=15;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Ksharp for your reply.
I tried this and it didnt get the result I needed.
I ran the following (Note I fixed row 2 admit date as it was after the disch date originally which was incorrect on my part)
data have;
input client provider (admit disch) (:mmddyy.);
format admit disch mmddyy10.;
datalines;
1 3 2/8/2016 2/10/2016
1 3 2/16/2016 2/22/2016
1 4 2/23/2016 2/24/2016
2 1 1/2/2016 1/3/2016
2 3 3/20/2016 3/23/2016
2 4 4/25/2016 4/26/2016
;
run;
data want;
merge have
have (firstobs=2 keep=client admit
rename=(admit=next_admit client=_client));
if client=_client and (next_admit-disch)>=15;
run;
I get the following:
client | provider | admit | disch | _client | next_admit |
2 | 1 | 1/2/2016 | 1/3/2016 | 2 | 3/20/2016 |
2 | 3 | 3/20/2016 | 3/23/2016 | 2 | 4/25/2016 |
which is not what I need. I need rows 1 for client 1 and all three rows for client 3.
But a big thank you for replying!
I think mkeintz and possibly art297s suggestions may have worked!
Testing it on my actual and bigger database!
Thank you ALL for your help, I love asking questions here and getting many useful and helpful responses!!!
I really appreciate it! You are all fantastic!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
FWIW: I would go with @mkeintz's 2nd suggested solution:
data want; set have; by client; if admit-15 < lag(disch) and first.client=0 then delete; run;
It does the same thing as my suggested code, but will run a lot faster than the code I suggested.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @art297,
I was just sending you a note when I noticed your reply.
You are correct @mkeintz 2nd suggestion actually yielded the result I needed.
Actually I was trying your code too (which was cool and i wanted to try it) but it didnt quite give me the results I needed.
For example: if I have a table of admit and disch of the client below
client | admit | disch |
1 | 3/5/2016 | 3/8/2016 |
1 | 3/10/2016 | 3/12/2016 |
1 | 4/3/2016 | 4/8/2016 |
1 | 4/19/2016 | 4/21/2016 |
1 | 4/23/2016 | 4/24/2016 |
1 | 5/23/2016 | 5/24/2016 |
1 | 6/1/2016 | 6/6/2016 |
1 | 12/23/2016 | 12/29/2016 |
I need the following rows to be output:
client | admit | disch |
1 | 3/5/2016 | 3/8/2016 |
1 | 4/3/2016 | 4/8/2016 |
1 | 5/23/2016 | 5/24/2016 |
1 | 12/23/2016 | 12/29/2016 |
@mkeintz 2nd suggestion does that however when I run the code you provided I dont get the exact rows I get:
client | admit | disch |
1 | 3/5/2016 | 3/8/2016 |
1 | 3/10/2016 | 3/12/2016 |
1 | 4/23/2016 | 4/24/2016 |
1 | 6/1/2016 | 6/6/2016 |
1 | 12/23/2016 | 12/29/2016 |
here is the full tables with the coloumns "needed", "mkeintz" and "art", the x under "mkeintz" and "art" indicate rows that were outputed using the respective codes and the x under the "needed" coloumn are as you may have guessed the rows that were needed:
client | admit | disch | needed | mkeintz | art |
1 | 3/5/2016 | 3/8/2016 | x | x | x |
1 | 3/10/2016 | 3/12/2016 | x | ||
1 | 4/3/2016 | 4/8/2016 | x | x | |
1 | 4/19/2016 | 4/21/2016 | |||
1 | 4/23/2016 | 4/24/2016 | x | ||
1 | 5/23/2016 | 5/24/2016 | x | x | |
1 | 6/1/2016 | 6/6/2016 | x | ||
1 | 12/23/2016 | 12/29/2016 | x | x | x |
trying to figure out where the code needs to be tweaked. maybe in the first. last.
Thanks again to all
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My code wouldn't work because the logic wasn't correct. Unlike @mkeintz's code, mine was designed to keep rather than delete records. However, it's design was to compare current with next and sometimes previous values. The latter part wasn't in line with your specs.
The corrected version to keep rather than delete would be:
data want (drop=last:); set have; by client; retain last_disch; last_disch=lag(disch); if first.client then output; else if (admit - last_disch)>=15 then output; run;
That should make the same selections as @mkeintz's code.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Or, in the interest of ever-more-compact code:
data want;
set have;
by client;
if (admit-15)>lag(disch) or first.client;
run;
Like my other note, make sure the test on lag(disch) is the first condition in the subsetting IF.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------