Solved: Re: Calculate weighted average across rows with conditions

Cruise · Posted 11-13-2018 08:00 PM

I have two different datasets which represents patients who have been diagnosed with colon or rectum cancer (colorectal cancer).

The first data (attached to this post) has following four variables (date of diagnosis known thus interval variable DOD_DIFF was calculated):

Date: ‘date of visit’
DOD_DIFF: the difference between the ‘date of visit’ and the date of diagnosis confirmed, ranges 1 thru 30
CODE: 12 distinct diagnosis code
WGT_DX: arbitrary weight to indicate how close the diagnosis is to a true diagnosis, a colorectal cancer. For example, 153 takes value 1 because it is an exact code for a colon cancer. 154 takes value 1 because it is an exact code for a rectum cancer. 211 takes value 6 because it’s a code for a benign cancer thus a weaker evidence of a cancer.

The second data:

Same variables but DOD_DIFF is not known due to missing in date of diagnosis.

My goal is to answer a question: if you pick the first diagnosis associated with the highest weight ‘WGT_DX’ how much error will you produce based on DOD_DIFF variable?

Therefore, I want to estimate this error using the first data with known information by picking the first diagnosis associated with the highest weight ‘WGT_DX’, before applying this approach to the data with unknown date of diagnosis.

To serve this purpose, I’d like to calculate weighted average score for each patient and assess the distribution of WEIGHTED_AVERAGE_SCORE for the entire population afterwards.

The logic used to calculate a weighted average error for each patient is following:

MIN(DOD_DIFF) AS MIN_DIFF

MIN(WGT_DX) AS MIN_WGT,

MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,

SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE PER PATIENT

I really hope this makes sense. Please let me know if not!

The SAS codes used is below which led me to an error:

DATA HAVE(KEEP=WGT_DX ID CODE DOD_DIFF SER_DATE_DX); SET HAVE; /*N_ROWS=142,598, N_DISTINCT_PATIENTS=6,375*/
IF CODE IN ('153') THEN WGT_DX=1; ELSE    
IF CODE IN ('154') THEN WGT_DX=1; ELSE   
IF CODE IN ('159') THEN WGT_DX=2; ELSE  
IF CODE IN ('197') THEN WGT_DX=3; ELSE  
IF CODE IN ('199') THEN WGT_DX=3; ELSE  
IF CODE IN ('209') THEN WGT_DX=5; ELSE  
IF CODE IN ('211') THEN WGT_DX=6; ELSE  
IF CODE IN ('230') THEN WGT_DX=2; ELSE  
IF CODE IN ('235') THEN WGT_DX=3; ELSE  
IF CODE IN ('239') THEN WGT_DX=3; ELSE  
IF CODE IN ('V10') THEN WGT_DX=4; ELSE  
IF CODE IN ('V76') THEN WGT_DX=4; 
RUN;

PROC SORT DATA=HAVE;
BY ID CODE SER_DATE_DX;
FORMAT SER_DATE_DX DATE9.; 
RUN; 

PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select CODE, ID, MIN(DOD_DIFF) AS MIN_DIFF,
                 MIN(WGT_DX) AS MIN_WGT,
				 MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,
   SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER
from HAVE
GROUP BY ID;
QUIT; 

PROC UNIVARIATE DATA=ERROR_FIRST_DIAGNOSIS;
VAR WEIGHTED_AVER;
HISTOGRAM;
RUN;

817 PROC SQL;

818 create table P.ERROR_FIRST_DIAGNOSIS as

819 select CODE, ID, MIN(DOD_DIFF) AS MIN_DIFF,

820 MIN(WGT_DX) AS MIN_WGT,

821 MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,

822 SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER

823 from P.HAVE

824 GROUP BY ID;

ERROR: The SUM summary function requires a numeric argument.

ERROR: The following columns were not found in the contributing tables: DIFF_WEIGHT, MIN_DIFF,

MIN_WGT.

825 QUIT;

picking first diagnosis blindly can cause serious error or none. but the range or error need to be estimated using the date with known information

mkeintz · Posted 11-14-2018 02:44 PM

@Cruise

Here's what I understand you to be saying:

In the absence of DOD_DIFF, then for each ID:
1. choose the earliest data of visit
2. If there are multiple records for that data, choose the highest WGT_DX, call is WGT_DX_PARTIAL
3. You want be one such value per ID. (There seems to be no reference to the variable CODE in your numeric example).
In the presence of DOD_DIFF, for each ID:
1. choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF< choose the highest WGT_DX, call it WGT_DX_FULL
3. Again you want one such value per ID
I presume your measurement of error, is just WGT_DX_ERROR= WGT_DX_FULL - WGT_DX_PARTIAL. Again this is one value per ID, correct?
So dos this mean you simply want to regress WGT_DX_FULL on WGT_DX_PARTIAL? If so, you don't need table 2, just table 1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Cruise · Posted 11-13-2018 08:02 PM

This post was inspired by my previous post that you guys had kindly responded.

@novinosrin @Reeza @mkeintz

novinosrin · Posted 11-13-2018 08:15 PM

Hi @Cruise Good evening, I am afraid I am about to run to catch red line CTA to get back home(Chicago). I will certainly look into it tomorrow soon as I get back to my lab, but I am pretty sure somebody else will respond tonight. Have a good one!

mkeintz · Posted 11-13-2018 11:37 PM

@Cruise wrote:

To serve this purpose, I’d like to calculate weighted average score for each patient and assess the distribution of WEIGHTED_AVERAGE_SCORE for the entire population afterwards.

The logic used to calculate a weighted average error for each patient is following:

MIN(DOD_DIFF) AS MIN_DIFF

MIN(WGT_DX) AS MIN_WGT,

MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,

SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE PER PATIENT

I really hope this makes sense. Please let me know if not!

I'm a little confused. Is this correct?

MIN(DOD_DIFF) AS MIN_DIFF is a constant within a given ID

MIN(WGT_DX) AS MIN_WGT is also a constant within a given ID

MIN_DIFF*MIN_WGT AS DIFF_WEIGHT is therefore also a constant within a given ID

So wouldn't

SUM(DIFF_WEIGHT)/SUM(MIN_WGT) be same ratio as DIFF_WEIGHT/MIN_WGT?
which by definition would end up being
(MIN_DIFF*MIN_WGT)/MIN_WGT = MIN_DIFF?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 11-14-2018 12:33 AM

Good catch, I think the proc sql needs to have group by: Id and code. Because I was trying to deduplicate the data first by Id and code then estimate the weighted average per patient.
After dedup do: Min-diff*min-wgt and sum this for each patient and divide by sum of min-wgt for each patient.
Does that make sense?
Any alternative approach to estimate the error associated with the picks of the first high evidence diagnoses regardless of dod-diff? Please see the image for the best and worst scenario where dod-diff varies depending on when high evidence diagnosis happened in relation to dod-diff.

novinosrin · Posted 11-14-2018 10:56 AM

Good morning @Cruise Not familiar with the subject, so let's take the piecemeal approach and you will have to lead us

I amended Proc sql section in your code and tested for ID=1(just for one ID)

Substitute group by id,code; for group by id; if you want to group by both making it composite

(where=(id=1))

PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select *, MIN(DOD_DIFF) AS MIN_DIFF,
                 MIN(WGT_DX) AS MIN_WGT,
				calculated MIN_DIFF* calculated MIN_WGT AS DIFF_WEIGHT
   /*SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER*/
from HAVE(where=(id=1))
GROUP BY ID;
QUIT;

Can you confirm the below one is what you intended

SAS Output

The SAS System

SER_DATE_DX	DOD_DIFF	CODE	ID	WGT_DX	MIN_DIFF	MIN_WGT	DIFF_WEIGHT
17006	1	197	1	3	1	1	1
17028	21	197	1	3	1	1	1
17032	25	153	1	1	1	1	1
17031	24	153	1	1	1	1	1
17028	21	153	1	1	1	1	1
17006	1	153	1	1	1	1	1
17013	6	197	1	3	1	1	1
17013	6	153	1	1	1	1	1
17006	1	197	1	3	1	1	1
17028	21	197	1	3	1	1	1

novinosrin · Posted 11-14-2018 10:58 AM

Next one:

/*WEIGHTED_AVER*/
PROC SQL;
create table ERROR_FIRST_DIAGNOSIS1 as
select *,SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER
from
(select *, MIN(DOD_DIFF) AS MIN_DIFF,
                 MIN(WGT_DX) AS MIN_WGT,
				calculated MIN_DIFF* calculated MIN_WGT AS DIFF_WEIGHT
from HAVE(where=(id=1))
GROUP BY ID)
group by id;
QUIT;

SAS Output

The SAS System

SER_DATE_DX	DOD_DIFF	CODE	ID	WGT_DX	MIN_DIFF	MIN_WGT	DIFF_WEIGHT	WEIGHTED_AVER
17006	1	197	1	3	1	1	1	1
17028	21	197	1	3	1	1	1	1
17032	25	153	1	1	1	1	1	1
17031	24	153	1	1	1	1	1	1
17028	21	153	1	1	1	1	1	1
17006	1	153	1	1	1	1	1	1
17013	6	197	1	3	1	1	1	1
17013	6	153	1	1	1	1	1	1
17006	1	197	1	3	1	1	1	1
17028	21	197	1	3	1	1	1	1

All 1s being 10/10=1

PaigeMiller · Posted 11-14-2018 11:03 AM

Here's my opinion:

Computing weighted averages in PROC SUMMARY is an awful lot easier than doing it in PROC SQL. In this case, you would have to rearrange the data so that each score is its own record. But it often happens that the data shown is the result of taking a data set where it is already in the desired format and then for presentation purposes making long rows for each patient/observation. But I admit that I haven't gone through the information provided in the original post in detail to see if PROC SUMMARY is really better than PROC SQL for this specific application.

In addition, this SQL code

SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE

in certain situations with missing data will not produce the proper weighted average. If the variable in the numerator is missing and the variable in the denominator is not missing, or vice versa, then you get an incorrect weighted average. PROC SUMMARY does not have this deficiency.

--
Paige Miller

Cruise · Posted 11-14-2018 01:07 PM

Thanks a lot! @novinosrin, @PaigeMiller, @mkeintz

I'm working on it with all your comments back in mind.

From the interactions and the questions posted here, I admit that that I'm getting to know what I want to achieve better. This is a simulation exercise using known and unknown data. In the known data, I have both 'date of visit' and 'date of diagnosis' variables available. In the unknown data, I have only 'date of visit' (ser_date_dx) but no 'date of diagnosis'. therefore, the goal is to estimate the error using the known data and apply the algorithm associated with the least error to the unknown data.

DOD_DIFF in known data is the difference between the date of visit and the date of diagnosis , attached to this forum.

The error that I'm trying to estimate is:

- If pick the row associated with the earliest 'date of visit' (min(ser_date_dx)) and the highest weight diagnosis code (min(wgt_dx)) how much error will be produced to compare that you would have used (DOD_DIFF) which is available only in known data but not available in unknown data. Sorry for repeating seemingly simple concept here. But it helps myself to understand the problem better 🙂 to be honest.

RED: WHAT YOU CHOOSE WITH UNKNOWN DATA. BLUE: WITH KNOWN DATA

So the code below cover the part to calculate the weighted average using the date of visit (ser_date_dx). But the missing piece now is the error estimate compare to the scenario using DOD_DIFF.

PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select *, MIN(SER_DATE_DX) AS MIN_DIFF,
          MIN(WGT_DX) AS MIN_WGT,
		  calculated MIN_DIFF*calculated MIN_WGT AS DIFF_WEIGHT
from HAVE(where=(id=4350))
GROUP BY ID;
QUIT;

Please keep questioning me. This post is only evolving with your critical thinking! I truly greatly appreciate your time!

novinosrin · Posted 11-14-2018 01:26 PM

" But the missing pieace now is the relative error to the scenario using DOD_DIFF. "

Do you have the logic for that?

Cruise · Posted 11-14-2018 01:30 PM

@novinosrin, would you please rephrase your question? It sounds important aspect that I have to understand.

novinosrin · Posted 11-14-2018 01:35 PM

Honestly, I am the one who is lagging here although it is so very interesting. I was basically trying to see where and how we can progress to the next step in code development. I assumed the logic for calculating relative error as you mentioned seemed the next. I see an explanation here

"The error that I'm trying to estimate is:

- If pick the row associated with the earliest 'date of visit' (min(ser_date_dx)) and the highest weight diagnosis code (min(wgt_dx)) how much error will be produced to compare that you would have used (DOD_DIFF) which is available only in known data but not available in unknown data. Sorry for repeating seemingly simple concept here. But it helps myself to understand the problem better 🙂 to be honest. "

However I haven't comprehended that enough to translate to a code. If Paige, Mark or somebody could push that understanding further, my fingers are waiting

Cruise · Posted 11-14-2018 01:47 PM

I attempted to visualize the logic where error is produced in the image attached. The error is produced between two scenarios: red and blue. Most often times, you end up picking the same row based on the date of visit (ser_date_dx) and the highest evidence diagnosis (min(WGT_DX)) even if you don't know the difference between the date of visit and date of diagnosis which is DOD_DIFF. However, you're potential to choose different row if your decision was based on the DOD_DIFF and the highest weight diagnosis (min(WGT_DX)) in some cases. So, the question becomes, how much error are you making if you make your decision based on the best available information to you which are data of visit (ser_date_dx) and min(WGT_DX) because you did not know the true value of DOD_DIFF. Unfortunately, my understanding is limited to a conceptualization but not coding. I hope the experts you mentioned would reach out to this post and share their insights too. Thanks for asking! Bear with me please.

mkeintz · Posted 11-14-2018 02:44 PM

@Cruise

Here's what I understand you to be saying:

In the absence of DOD_DIFF, then for each ID:
1. choose the earliest data of visit
2. If there are multiple records for that data, choose the highest WGT_DX, call is WGT_DX_PARTIAL
3. You want be one such value per ID. (There seems to be no reference to the variable CODE in your numeric example).
In the presence of DOD_DIFF, for each ID:
1. choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF< choose the highest WGT_DX, call it WGT_DX_FULL
3. Again you want one such value per ID
I presume your measurement of error, is just WGT_DX_ERROR= WGT_DX_FULL - WGT_DX_PARTIAL. Again this is one value per ID, correct?
So dos this mean you simply want to regress WGT_DX_FULL on WGT_DX_PARTIAL? If so, you don't need table 2, just table 1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Cruise · Posted 11-14-2018 03:49 PM

@mkeintz, @novinosrin, @PaigeMiller

Mkeintz, Thank you for your amazingly systematic presentation of the problem in clear subsequent logics. You also had coined in a new idea that I have not seen for possibility. That is to use two parallel scenario: absence vs presence of DOD_DIFF using the 'known' data. And also suggesting an alternative way to look at a measurement error. Please, hoping your time allows, review my responses in red in the text below.

In the absence of DOD_DIFF, then for each ID:
1. Choose the earliest data of visit
2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
3. You want be one such value per ID
4. (There seems to be no reference to the variable CODE in your numeric example) The derivation of WGT_DX is following:
  IF CODE IN ('153') THEN WGT_DX=1; ELSE
  
  IF CODE IN ('154') THEN WGT_DX=1; ELSE
  
  IF CODE IN ('159') THEN WGT_DX=2; ELSE
  
  IF CODE IN ('197') THEN WGT_DX=3; ELSE
  
  IF CODE IN ('199') THEN WGT_DX=3; ELSE
  
  IF CODE IN ('209') THEN WGT_DX=5; ELSE
  
  IF CODE IN ('211') THEN WGT_DX=6; ELSE
  
  IF CODE IN ('230') THEN WGT_DX=2; ELSE
  
  IF CODE IN ('235') THEN WGT_DX=3; ELSE
  
  IF CODE IN ('239') THEN WGT_DX=3; ELSE
  
  IF CODE IN ('V10') THEN WGT_DX=4; ELSE
  
  IF CODE IN ('V76') THEN WGT_DX=4;
In the presence of DOD_DIFF, for each ID:
1. Choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF< choose the lowest WGT_DX, call it WGT_DX_FULL
3. Again you want one such value per ID
I presume your measurement of error, is just WGT_DX_ERROR= WGT_DX_FULL - WGT_DX_PARTIAL. Again this is one value per ID, correct? CORRECT. However, for the next step, I will have to quantify the error associated with each diagnostic CODE (diagnosis, in this case, 12 distinct diagnosis codes, as listed above. For example, 153 (colon cancer) and 154(rectum cancer) are the strongest evidence to suggest a colorectal cancer. However, how much error would be associated with each codes in terms of time factor regardless the level of evidence to lead to a true diagnosis. But for now, I'm focusing on the error per patient due to DOD_DIFF known vs unknown scenarios.
- Regarding WGT_DX_ERROR= WGT_DX_FULL - WGT_DX_PARTIAL. I agree, the difference between weights could be a proxy for an error. However, would it be considered weighted? Also, WGT_DX is monotonic arbitrary numbers that I have assigned to each diagnosis CODE myself based on my personal judgment as to how far/close the given diagnosis CODE for a colorectal cancer, where 153 and 154 being the correct reference. With that said, how about using the difference between 'date of visit' that is corresponded with absence vs presence of DOD_DIFF scenarios respectively and multiply by the WGT_DX_ERROR you suggested? to have time factor weighed in the error estimate? Please correct me, if it doesn't make sense. The error produced here is not only the compromise to the correct diagnosis but also a time-factor. At the end of this project, we're aiming for the best value to replace a missing 'date of diagnosis' in the unknown data.
So this mean you simply want to regress WGT_DX_FULL on WGT_DX_PARTIAL? Are you suggesting a linear regression between these two continuous variables and look at the estimates? It sounds very interesting, do you have time to elaborate on this a little more? If so, you don't need table 2, just table 1. Do you mind to clarify which table are you referring to?

This is a rewarding learning experience. I'm so excited to continuously learning from you guys!

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away