Hi,
I am using SAS 9.4 TS level 1M3. I have a data set where each ID can have multiple records. It looks like this:
ID | Epi_start | Epi_end | Epi-Number |
1 | 21-Aug-10 | 25-Aug-10 | 1 |
2 | 27-Aug-10 | 10-Sep-10 | 1 |
3 | 13-Sep-10 | 18-Sep-10 | 1 |
3 | 01-Oct-10 | 11-Oct-10 | 2 |
3 | 14-Oct-10 | 19-Oct-10 | 3 |
3 | 15-Dec-12 | 18-Dec-10 | 1 |
4 | 18-Oct-10 | 22-Oct-10 | 1 |
4 | 18-Oct-10 | 22-Oct-10 | 2 |
5 | 19-Oct-10 | 29-Oct-10 | 1 |
5 | 13-Dec-10 | 01-Jan-11 | 2 |
5 | 13-Dec-10 | 01-Jan-11 | 3 |
5 | 16-Feb-11 | 28-Feb-11 | 4 |
6 | 01-Dec-10 | 09-Dec-10 | 1 |
7 | 11-Dec-10 | 17-Dec-10 | 1 |
7 | 15-Oct-11 | 22-Oct-11 | 1 |
7 | 01-Dec-11 | 10-Dec-11 | 2 |
7 | 20-Jun-12 | 29-Jun-12 | 1 |
7 | 29-Jul-12 | 08-Aug-12 | 2 |
7 | 07-Sep-12 | 16-Sep-12 | 3 |
8 | 19-Dec-10 | 25-Dec-10 | 1 |
8 | 19-Dec-10 | 25-Dec-10 | 2 |
8 | 19-Dec-10 | 25-Dec-10 | 3 |
8 | 17-Jan-13 | 25-Jan-13 | 1 |
9 | 01-Mar-11 | 10-Mar-11 | 1 |
9 | 25-Mar-11 | 30-Mar-11 | 2 |
I want to transfer all the records of individuals who have even one identical Epi_start and Epi_end to a new file. Basically I want to get this file:
ID | Epi_start | Epi_end | Epi-Number |
4 | 18-Oct-10 | 22-Oct-10 | 1 |
4 | 18-Oct-10 | 22-Oct-10 | 2 |
5 | 19-Oct-10 | 29-Oct-10 | 1 |
5 | 13-Dec-10 | 01-Jan-11 | 2 |
5 | 13-Dec-10 | 01-Jan-11 | 3 |
5 | 16-Feb-11 | 28-Feb-11 | 4 |
8 | 19-Dec-10 | 25-Dec-10 | 1 |
8 | 19-Dec-10 | 25-Dec-10 | 2 |
8 | 19-Dec-10 | 25-Dec-10 | 3 |
8 | 17-Jan-13 | 25-Jan-13 | 1 |
I tried a number of ways myself but being new to SAS I did not get the correct results. Thanks for the help.
Slight tweak all that you need is
and max(hospital));
Full version
data have;
input ID (Epi_start Epi_end) (:Date9.) Epi_Number Hospital;
cards;
1 21-Aug-10 25-Aug-10 1 .
2 27-Aug-10 10-Sep-10 1 1
3 13-Sep-10 18-Sep-10 1 .
3 01-Oct-10 11-Oct-10 2 .
3 14-Oct-10 19-Oct-10 3 .
3 15-Dec-12 18-Dec-10 1 1
4 18-Oct-10 22-Oct-10 1 .
4 18-Oct-10 22-Oct-10 2 .
5 19-Oct-10 29-Oct-10 1 .
5 13-Dec-10 01-Jan-11 2 .
5 13-Dec-10 01-Jan-11 3 1
5 16-Feb-11 28-Feb-11 4 .
6 01-Dec-10 09-Dec-10 1 .
7 11-Dec-10 17-Dec-10 1 .
7 15-Oct-11 22-Oct-11 1 .
7 01-Dec-11 10-Dec-11 2 1
7 20-Jun-12 29-Jun-12 1 1
7 29-Jul-12 08-Aug-12 2 .
7 07-Sep-12 16-Sep-12 3 .
8 19-Dec-10 25-Dec-10 1 .
8 19-Dec-10 25-Dec-10 2 1
8 19-Dec-10 25-Dec-10 3 .
8 17-Jan-13 25-Jan-13 1 .
9 01-Mar-11 10-Mar-11 1 1
9 25-Mar-11 30-Mar-11 2 1
;
proc sql;
create table want as
select *
from have
where id in (select id from have group by id,Epi_start,Epi_end having count(*)>1 and max(hospital));
quit;
SQL is so easy 🙂
Easy in SQL
data have;
input ID (Epi_start Epi_end) (:date9.) Epi_Number;
format Epi_start Epi_end date9.;
cards;
1 21-Aug-10 25-Aug-10 1
2 27-Aug-10 10-Sep-10 1
3 13-Sep-10 18-Sep-10 1
3 01-Oct-10 11-Oct-10 2
3 14-Oct-10 19-Oct-10 3
3 15-Dec-12 18-Dec-10 1
4 18-Oct-10 22-Oct-10 1
4 18-Oct-10 22-Oct-10 2
5 19-Oct-10 29-Oct-10 1
5 13-Dec-10 01-Jan-11 2
5 13-Dec-10 01-Jan-11 3
5 16-Feb-11 28-Feb-11 4
6 01-Dec-10 09-Dec-10 1
7 11-Dec-10 17-Dec-10 1
7 15-Oct-11 22-Oct-11 1
7 01-Dec-11 10-Dec-11 2
7 20-Jun-12 29-Jun-12 1
7 29-Jul-12 08-Aug-12 2
7 07-Sep-12 16-Sep-12 3
8 19-Dec-10 25-Dec-10 1
8 19-Dec-10 25-Dec-10 2
8 19-Dec-10 25-Dec-10 3
8 17-Jan-13 25-Jan-13 1
9 01-Mar-11 10-Mar-11 1
9 25-Mar-11 30-Mar-11 2
;
proc sql;
create table want as
select *
from have
where id in (select id from have group by id,Epi_start,Epi_end having count(*)>1);
quit;
Sorry, I failed to mention that the data set that I am working with is not identical to the one I posted. My original data set is a SAS data set with about 300000 records and about 20 variables. I am not sure if I can create the have table like your answer.
Also I am particularly bad at SQL sadly. Can you please explain the last line of your code (particularly count(*)>1 part of it)?
Many thanks,
The logic is very simple:
1. Subset ID's , whose records contains more than one occurrence of identical Epi_start Epi_end which has to make a count that is greater than one.
2. LooK up ID's from the full table to the ID's the resulting subset mentioned in 1
The output will have only those matching ID's . Does that explain enough to help you comprehend?
It doesn't matter how many records or variables you have as long as the variables of interest for the subsetting logic is ID, Epi_start Epi_end
Thank you so much. It worked and I am beginning to understand it. May I ask one more question? Now assume that you have a hospital flag that is either 1 or missing. Like this:
ID | Epi_start | Epi_end | Epi-Number | Hospital |
1 | 21-Aug-10 | 25-Aug-10 | 1 | . |
2 | 27-Aug-10 | 10-Sep-10 | 1 | 1 |
3 | 13-Sep-10 | 18-Sep-10 | 1 | . |
3 | 01-Oct-10 | 11-Oct-10 | 2 | . |
3 | 14-Oct-10 | 19-Oct-10 | 3 | . |
3 | 15-Dec-12 | 18-Dec-10 | 1 | 1 |
4 | 18-Oct-10 | 22-Oct-10 | 1 | . |
4 | 18-Oct-10 | 22-Oct-10 | 2 | . |
5 | 19-Oct-10 | 29-Oct-10 | 1 | . |
5 | 13-Dec-10 | 01-Jan-11 | 2 | . |
5 | 13-Dec-10 | 01-Jan-11 | 3 | 1 |
5 | 16-Feb-11 | 28-Feb-11 | 4 | . |
6 | 01-Dec-10 | 09-Dec-10 | 1 | . |
7 | 11-Dec-10 | 17-Dec-10 | 1 | . |
7 | 15-Oct-11 | 22-Oct-11 | 1 | . |
7 | 01-Dec-11 | 10-Dec-11 | 2 | 1 |
7 | 20-Jun-12 | 29-Jun-12 | 1 | 1 |
7 | 29-Jul-12 | 08-Aug-12 | 2 | . |
7 | 07-Sep-12 | 16-Sep-12 | 3 | . |
8 | 19-Dec-10 | 25-Dec-10 | 1 | . |
8 | 19-Dec-10 | 25-Dec-10 | 2 | 1 |
8 | 19-Dec-10 | 25-Dec-10 | 3 | . |
8 | 17-Jan-13 | 25-Jan-13 | 1 | . |
9 | 01-Mar-11 | 10-Mar-11 | 1 | 1 |
9 | 25-Mar-11 | 30-Mar-11 | 2 | 1 |
Now lets say I want the records where at least one of the identical epi_start and epi_end has a hospital flag of 1. Like table below:
ID | Epi_start | Epi_end | Epi-Number | Hospital |
5 | 19-Oct-10 | 29-Oct-10 | 1 | . |
5 | 13-Dec-10 | 01-Jan-11 | 2 | . |
5 | 13-Dec-10 | 01-Jan-11 | 3 | 1 |
5 | 16-Feb-11 | 28-Feb-11 | 4 | . |
8 | 19-Dec-10 | 25-Dec-10 | 1 | . |
8 | 19-Dec-10 | 25-Dec-10 | 2 | 1 |
8 | 19-Dec-10 | 25-Dec-10 | 3 | . |
8 | 17-Jan-13 | 25-Jan-13 | 1 | . |
How would you add that to your code? Thank you so much again.
Slight tweak all that you need is
and max(hospital));
Full version
data have;
input ID (Epi_start Epi_end) (:Date9.) Epi_Number Hospital;
cards;
1 21-Aug-10 25-Aug-10 1 .
2 27-Aug-10 10-Sep-10 1 1
3 13-Sep-10 18-Sep-10 1 .
3 01-Oct-10 11-Oct-10 2 .
3 14-Oct-10 19-Oct-10 3 .
3 15-Dec-12 18-Dec-10 1 1
4 18-Oct-10 22-Oct-10 1 .
4 18-Oct-10 22-Oct-10 2 .
5 19-Oct-10 29-Oct-10 1 .
5 13-Dec-10 01-Jan-11 2 .
5 13-Dec-10 01-Jan-11 3 1
5 16-Feb-11 28-Feb-11 4 .
6 01-Dec-10 09-Dec-10 1 .
7 11-Dec-10 17-Dec-10 1 .
7 15-Oct-11 22-Oct-11 1 .
7 01-Dec-11 10-Dec-11 2 1
7 20-Jun-12 29-Jun-12 1 1
7 29-Jul-12 08-Aug-12 2 .
7 07-Sep-12 16-Sep-12 3 .
8 19-Dec-10 25-Dec-10 1 .
8 19-Dec-10 25-Dec-10 2 1
8 19-Dec-10 25-Dec-10 3 .
8 17-Jan-13 25-Jan-13 1 .
9 01-Mar-11 10-Mar-11 1 1
9 25-Mar-11 30-Mar-11 2 1
;
proc sql;
create table want as
select *
from have
where id in (select id from have group by id,Epi_start,Epi_end having count(*)>1 and max(hospital));
quit;
SQL is so easy 🙂
Thank you for replying again. I think there might be a problem with this additional code. If we have identical Epi_start and Epi_end that are not hospitalizations in one individual's records but then we also have one separate occasion of hospitalization (hospital=1) then wouldn't that set of records make it to our final list? I mean something like this:
ID | Epi_start | Epi_end | Epi-Number | Hospital |
7 | 11-Dec-10 | 17-Dec-10 | 1 | . |
7 | 15-Oct-11 | 22-Oct-11 | 1 | . |
7 | 01-Dec-11 | 10-Dec-11 | 2 | 1 |
7 | 20-Jun-12 | 29-Jun-12 | 1 | 1 |
7 | 29-Jul-12 | 08-Aug-12 | 2 | . |
7 | 29-Jul-12 | 08-Aug-12 | 3 | . |
I might be wrong though. I have used SQL only once before (and it did not work that time). I am still not sure if I am getting your logic (both the count(*)>1 and max(hospital) sections). Thanks.
Hi @Primavera Yes, the current logic/logic so far will not pick ID7. The current logic approach is very simple:
1. Identify the number of identical occurrences of Epi_start and Epi_end for a given ID. If it is greater than one, it passes the 1st test
2. Use the result of 1 and determine if any of the records in the subset has hospitalization flag.
If the check passes both the tests, output all the records of those IDs. For ID7 in your latest post, though it passes the 1st test, it fails the 2nd, hence 7 gets dropped .
Please review your requirement thoroughly and let us know what you want in/as final. It's cool to go back and forth as long we hit the much needed target.
@Primavera wrote:
Thank you for replying again. I think there might be a problem with this additional code. If we have identical Epi_start and Epi_end that are not hospitalizations in one individual's records but then we also have one separate occasion of hospitalization (hospital=1) then wouldn't that set of records make it to our final list? I mean something like this:
ID Epi_start Epi_end Epi-Number Hospital 7 11-Dec-10 17-Dec-10 1 . 7 15-Oct-11 22-Oct-11 1 . 7 01-Dec-11 10-Dec-11 2 1 7 20-Jun-12 29-Jun-12 1 1 7 29-Jul-12 08-Aug-12 2 . 7 29-Jul-12 08-Aug-12 3 .
I might be wrong though. I have used SQL only once before (and it did not work that time). I am still not sure if I am getting your logic (both the count(*)>1 and max(hospital) sections). Thanks.
Hi again,
May I take a bit of your time again?
I am trying to modify your code for another need. I want to gather all the records of an ID if that individual has had an Epi_start before a certain date (say 01Jan2011). I don't care about any of the conditions that I had previously. This time I only care about Epi_start date.
I have modified the last line to :
where ID in (select ID from have group by ID having count(*)>1 and Epi_start LT '01JAN2011'd );
But it does not work and giving me a table with 0 rows and this note in the log:
NOTE: The query requires remerging summary statistics back with the original data.
Can you please tell me where I have gone wrong?
Thank you,
HI @Primavera
Since the requirement is much too simple considering your notes
1. I want to gather all the records of an ID if that individual has had an Epi_start before a certain date (say 01Jan2011).
2. I don't care about any of the conditions that I had previously. This time I only care about Epi_start date.
The logic becomes merely a filter, and no need for a summary statistics. Therefore, essentially how you would build is
1. where Epi_start LT '01JAN2011'd - This will filter with ID's satisfying the condition
2. Now, get other corresponding records for the filtered ID's.
So in essence, the mind should now think 1 as a subquery( or in simple terms 1st filter query) and 2 An outer query or a full query to get all the corresponding records for those ID's. That's as simple or as terse the build logic would become.
Now, let's turn English words to SQL words
proc sql;
create table want as
select *
from have
where id in (select id from have where Epi_start LT '01JAN2011'd);
quit;
Hope the explanation gives that spark to help readers build the logic. Cheers! Have fun!
Hi @novinosrin ,
Thank you very much for your help and I am very sorry for my belated message of gratitude. I was occupied with something else.
I tried something very very similar to your code when I was trying to modify your previous code and it did not work. Now I plugged in your simple code and voila! Worked like a charm. So frustrating when my own code does not work and a master comes in and solves it so easily.
Thank you again for coming to my help again and thank you for your perfect explanations.
Regards,
Hi again,
I hope you are safe and well.
I am very sorry for taking your time once more. I have a problem that at first I thought I can code in less than 30 minutes but now 2 days later I have given up. It kind of follows the same trail as my previous questions. Let's say I have a data set with about 100 K participants. The records of one of these people could look like this:
ID | dispensation | Epi_start | Epi_end | Dispen-Number | Hospital | epi_duration |
5 | 21-May-10 | 21-May-10 | 29-May-10 | 1 | . | 9 |
5 | 27-Aug-10 | 27-Aug-10 | 28-Aug-10 | 1 | 1 | 2 |
5 | 13-Sep-10 | 13-Sep-10 | 30-Sep-10 | 1 | . | 18 |
5 | 11-Oct-10 | 13-Sep-10 | 25-Oct-10 | 2 | . | 43 |
5 | 05-Nov-10 | 13-Sep-10 | 20-Dec-10 | 3 | . | 99 |
5 | 15-Dec-12 | 15-Dec-12 | 17-Dec-12 | 1 | 1 | 3 |
5 | 18-Oct-13 | 18-Oct-13 | 22-Oct-13 | 1 | . | 5 |
5 | 01-Nov-13 | 18-Oct-13 | 09-Nov-13 | 2 | . | 23 |
5 | 03-Mar-14 | 03-Mar-14 | 15-Mar-14 | 1 | . | 13 |
various people can have various records of course but it mainly looks like above. I want to gather all the records of drug dispensations in an episode if that episode has lasted more than 2 months. For example in the case of the subject with ID=5 that would be:
ID | dispensation | Epi_start | Epi_end | Dispen-Number | Hospital | epi_duration |
5 | 13-Sep-10 | 13-Sep-10 | 30-Sep-10 | 1 | . | 18 |
5 | 11-Oct-10 | 13-Sep-10 | 25-Oct-10 | 2 | . | 43 |
5 | 05-Nov-10 | 13-Sep-10 | 20-Dec-10 | 3 | . | 99 |
Thank you once more,
Hi @Primavera We are okay. Thank you. I hope the same is the case where you are at. Okay, I vaguely recall you wanting Proc SQL based solutions? Is that correct or you have no preference?
Since our discussions have been proc sql thus far, I'd go with the same-
/*I want to gather all the records of drug dispensations in an episode if that episode has lasted more than 2 months.*/
data have;
input ID (dispensation Epi_start Epi_end) (:date9.) Dispen_Number Hospital epi_duration;
format dispensation Epi_start Epi_end date9.;
cards;
5 21-May-10 21-May-10 29-May-10 1 . 9
5 27-Aug-10 27-Aug-10 28-Aug-10 1 1 2
5 13-Sep-10 13-Sep-10 30-Sep-10 1 . 18
5 11-Oct-10 13-Sep-10 25-Oct-10 2 . 43
5 05-Nov-10 13-Sep-10 20-Dec-10 3 . 99
5 15-Dec-12 15-Dec-12 17-Dec-12 1 1 3
5 18-Oct-13 18-Oct-13 22-Oct-13 1 . 5
5 01-Nov-13 18-Oct-13 09-Nov-13 2 . 23
5 03-Mar-14 03-Mar-14 15-Mar-14 1 . 13
;
proc sql;
create table want as
select *
from have
group by id,Epi_start
having intck('mon',Epi_start,max(Epi_end),'c')>2
order by id,dispensation,Epi_start,Epi_end;
quit;
Hi,
Thank you very much once again. I tried your code and it works perfectly. I figured out most of the logic for the code. Just a few question:
1. Why did you group by both ID and Epi_start and not just by ID?
2. 'c' stands for 'continuous', right?
3. How would you modify this code to just grab the longest episode or the first qualifying episode of each participant (if some participants have more than 1 qualifying episodes)?
I actually don't prefer SQL as I am very bad at it. I am slowly reading and learning about it. Do you know how to code for this without using SQL? I tried several methods including reversing the data set and trying to flag the desired dispensations and none of them worked. I am very sorry for taking your time; it's just that your codes are very professional and educational for me.
Thanks again,
Need not be so formal. It's fun to be around here. I am about to go home. I might login from home and will offer an alternative tonight or tomorrow morning. It's been a long day.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.