Hello! I need to calculate the median VF value (my outcome outcome) for each visit (date) for each Patient_ID.
Here is an example of the data:
Patient_ID Date VF
1 08AUG18 6.0
1 08AUG18 6.6
1 08AUG18 4.2
1 08AUG18 6.8
1 22SEP18 6.9
1 22SEP18 7.7
1 22SEP18 6.3
1 22SEP18 6.8
2 08AUG18 7.9
2 08AUG18 7.3
2 08AUG18 7.4
2 29OCT18 7.2
2 29OCT18 7.2
3 ....etc
I only want to calculate the median when there are at least observations (ex. I do not want to calculate the median for ID #2 who only had 2 measures on 29OCT18. Therefore, I am trying to select the top 3 observations for each date and ID. Note: different ID's may have VF measures done on the same date so I cannot just select the top 3 observations for each date.
I know how to select the top 3 observations for a given ID, but do not how to select the top 3 observations for each date for each person.
Can anyone help with this?
Here's what I came up with. It sorts by VF descending in the proc sort, then a data step with a by statement. It basically establishes the "median" when the cnt = 2 and retains that until the last iteration for that ID & date and outputs it.
proc sort data=have2;
by patient_id date2 descending VF;
run;
data want (drop=cnt vf);
set have2 (rename=(date2=date));
retain cnt median;
by patient_id date;
if first.date then do;
cnt=1;
median=.;
end;
else cnt=cnt+1;
if cnt=2 then median=VF;
if last.date and cnt ge 3 then output;
run;
So what does your desired result look like?
Do you still want the 29OCT18 for ID #2 to appear in your final data set even though you dont want to calculate the median based on them?
6.0 is not the median of the top 3 observations for ID 1 on 8/8/18. the top 3 are 6, 6.6, and 6.8. Are you referring to the original sort order when you say "top 3" rather than the nominal value of VF?
If that's the case, then use data step + By statement:
data dat;
set yourdataset;
by patientid;
if first.patientid then id_count=1;
else id_count + 1;
run;
You can use that to then determine if something has at least 3 observations (do a max of id_count, or a count of distinct combination of patientid and date and join back to the DAT data set), and then keep id_count=2 for each of the patient id's that appears at least 3 times.
I would count the instances for a given patient_id and date, join to this data, and delete anything less than your threshold.
eg,
proc sql;
create table counts as select patient_id,date,n(VF) as count
from yourdataset
group by patient_id,date
;
quit;
Join back on patient_id and date, and then eliminate the combinations for which you do not want the median.
Alternately, in a single step, you could use proc means to calculate both the median and count statistics and then in a subsequent step get rid of the medians you don't want.
How are you defining top 3?
It seems weird to take the highest measurements when I suspect time is more important with this type of data.
@abc44 wrote:
Hello! I need to calculate the median VF value (my outcome outcome) for each visit (date) for each Patient_ID.
Here is an example of the data:
Patient_ID Date VF
1 08AUG18 6.0
1 08AUG18 6.6
1 08AUG18 4.2
1 08AUG18 6.8
1 22SEP18 6.9
1 22SEP18 7.7
1 22SEP18 6.3
1 22SEP18 6.8
2 08AUG18 7.9
2 08AUG18 7.3
2 08AUG18 7.4
2 29OCT18 7.2
2 29OCT18 7.2
3 ....etc
I only want to calculate the median when there are at least observations (ex. I do not want to calculate the median for ID #2 who only had 2 measures on 29OCT18. Therefore, I am trying to select the top 3 observations for each date and ID. Note: different ID's may have VF measures done on the same date so I cannot just select the top 3 observations for each date.
I know how to select the top 3 observations for a given ID, but do not how to select the top 3 observations for each date for each person.
Can anyone help with this?
Sorry that I was not clear about that. I want to calculate the median of the first 3 measures on each date.
Add an enumeration variable then and keep only ones that are less than 3.
https://stats.idre.ucla.edu/sas/faq/how-can-i-create-an-enumeration-variable-by-groups/
@abc44 wrote:
Sorry that I was not clear about that. I want to calculate the median of the first 3 measures on each date.
You want the value of the 2nd highest then? You could use PROC RANK, or a sort + data step with BY statement.
Here's what I came up with. It sorts by VF descending in the proc sort, then a data step with a by statement. It basically establishes the "median" when the cnt = 2 and retains that until the last iteration for that ID & date and outputs it.
proc sort data=have2;
by patient_id date2 descending VF;
run;
data want (drop=cnt vf);
set have2 (rename=(date2=date));
retain cnt median;
by patient_id date;
if first.date then do;
cnt=1;
median=.;
end;
else cnt=cnt+1;
if cnt=2 then median=VF;
if last.date and cnt ge 3 then output;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.