I have two different datasets which represents patients who have been diagnosed with colon or rectum cancer (colorectal cancer).
The first data (attached to this post) has following four variables (date of diagnosis known thus interval variable DOD_DIFF was calculated):
The second data:
Same variables but DOD_DIFF is not known due to missing in date of diagnosis.
My goal is to answer a question: if you pick the first diagnosis associated with the highest weight ‘WGT_DX’ how much error will you produce based on DOD_DIFF variable?
Therefore, I want to estimate this error using the first data with known information by picking the first diagnosis associated with the highest weight ‘WGT_DX’, before applying this approach to the data with unknown date of diagnosis.
To serve this purpose, I’d like to calculate weighted average score for each patient and assess the distribution of WEIGHTED_AVERAGE_SCORE for the entire population afterwards.
The logic used to calculate a weighted average error for each patient is following:
MIN(DOD_DIFF) AS MIN_DIFF
MIN(WGT_DX) AS MIN_WGT,
MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,
SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE PER PATIENT
I really hope this makes sense. Please let me know if not!
The SAS codes used is below which led me to an error:
DATA HAVE(KEEP=WGT_DX ID CODE DOD_DIFF SER_DATE_DX); SET HAVE; /*N_ROWS=142,598, N_DISTINCT_PATIENTS=6,375*/
IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;
RUN;
PROC SORT DATA=HAVE;
BY ID CODE SER_DATE_DX;
FORMAT SER_DATE_DX DATE9.;
RUN;
PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select CODE, ID, MIN(DOD_DIFF) AS MIN_DIFF,
MIN(WGT_DX) AS MIN_WGT,
MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,
SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER
from HAVE
GROUP BY ID;
QUIT;
PROC UNIVARIATE DATA=ERROR_FIRST_DIAGNOSIS;
VAR WEIGHTED_AVER;
HISTOGRAM;
RUN;
817 PROC SQL;
818 create table P.ERROR_FIRST_DIAGNOSIS as
819 select CODE, ID, MIN(DOD_DIFF) AS MIN_DIFF,
820 MIN(WGT_DX) AS MIN_WGT,
821 MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,
822 SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER
823 from P.HAVE
824 GROUP BY ID;
ERROR: The SUM summary function requires a numeric argument.
ERROR: The SUM summary function requires a numeric argument.
ERROR: The following columns were not found in the contributing tables: DIFF_WEIGHT, MIN_DIFF,
MIN_WGT.
825 QUIT;
Here's what I understand you to be saying:
This post was inspired by my previous post that you guys had kindly responded.
Hi @Cruise Good evening, I am afraid I am about to run to catch red line CTA to get back home(Chicago). I will certainly look into it tomorrow soon as I get back to my lab, but I am pretty sure somebody else will respond tonight. Have a good one!
@Cruise wrote:
To serve this purpose, I’d like to calculate weighted average score for each patient and assess the distribution of WEIGHTED_AVERAGE_SCORE for the entire population afterwards.
The logic used to calculate a weighted average error for each patient is following:
MIN(DOD_DIFF) AS MIN_DIFF
MIN(WGT_DX) AS MIN_WGT,
MIN_DIFF*MIN_WGT AS DIFF_WEIGHT,
SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE PER PATIENT
I really hope this makes sense. Please let me know if not!
I'm a little confused. Is this correct?
MIN(DOD_DIFF) AS MIN_DIFF is a constant within a given ID
MIN(WGT_DX) AS MIN_WGT is also a constant within a given ID
MIN_DIFF*MIN_WGT AS DIFF_WEIGHT is therefore also a constant within a given ID
So wouldn't
SUM(DIFF_WEIGHT)/SUM(MIN_WGT) be same ratio as DIFF_WEIGHT/MIN_WGT?
which by definition would end up being
(MIN_DIFF*MIN_WGT)/MIN_WGT = MIN_DIFF?
Good morning @Cruise Not familiar with the subject, so let's take the piecemeal approach and you will have to lead us
I amended Proc sql section in your code and tested for ID=1(just for one ID)
Substitute group by id,code; for group by id; if you want to group by both making it composite
(where=(id=1))
PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select *, MIN(DOD_DIFF) AS MIN_DIFF,
MIN(WGT_DX) AS MIN_WGT,
calculated MIN_DIFF* calculated MIN_WGT AS DIFF_WEIGHT
/*SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER*/
from HAVE(where=(id=1))
GROUP BY ID;
QUIT;
Can you confirm the below one is what you intended
SAS Output
The SAS System |
SER_DATE_DX | DOD_DIFF | CODE | ID | WGT_DX | MIN_DIFF | MIN_WGT | DIFF_WEIGHT |
---|---|---|---|---|---|---|---|
17006 | 1 | 197 | 1 | 3 | 1 | 1 | 1 |
17028 | 21 | 197 | 1 | 3 | 1 | 1 | 1 |
17032 | 25 | 153 | 1 | 1 | 1 | 1 | 1 |
17031 | 24 | 153 | 1 | 1 | 1 | 1 | 1 |
17028 | 21 | 153 | 1 | 1 | 1 | 1 | 1 |
17006 | 1 | 153 | 1 | 1 | 1 | 1 | 1 |
17013 | 6 | 197 | 1 | 3 | 1 | 1 | 1 |
17013 | 6 | 153 | 1 | 1 | 1 | 1 | 1 |
17006 | 1 | 197 | 1 | 3 | 1 | 1 | 1 |
17028 | 21 | 197 | 1 | 3 | 1 | 1 | 1 |
Next one:
/*WEIGHTED_AVER*/
PROC SQL;
create table ERROR_FIRST_DIAGNOSIS1 as
select *,SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVER
from
(select *, MIN(DOD_DIFF) AS MIN_DIFF,
MIN(WGT_DX) AS MIN_WGT,
calculated MIN_DIFF* calculated MIN_WGT AS DIFF_WEIGHT
from HAVE(where=(id=1))
GROUP BY ID)
group by id;
QUIT;
SAS Output
The SAS System |
SER_DATE_DX | DOD_DIFF | CODE | ID | WGT_DX | MIN_DIFF | MIN_WGT | DIFF_WEIGHT | WEIGHTED_AVER |
---|---|---|---|---|---|---|---|---|
17006 | 1 | 197 | 1 | 3 | 1 | 1 | 1 | 1 |
17028 | 21 | 197 | 1 | 3 | 1 | 1 | 1 | 1 |
17032 | 25 | 153 | 1 | 1 | 1 | 1 | 1 | 1 |
17031 | 24 | 153 | 1 | 1 | 1 | 1 | 1 | 1 |
17028 | 21 | 153 | 1 | 1 | 1 | 1 | 1 | 1 |
17006 | 1 | 153 | 1 | 1 | 1 | 1 | 1 | 1 |
17013 | 6 | 197 | 1 | 3 | 1 | 1 | 1 | 1 |
17013 | 6 | 153 | 1 | 1 | 1 | 1 | 1 | 1 |
17006 | 1 | 197 | 1 | 3 | 1 | 1 | 1 | 1 |
17028 | 21 | 197 | 1 | 3 | 1 | 1 | 1 | 1 |
All 1s being 10/10=1
Here's my opinion:
Computing weighted averages in PROC SUMMARY is an awful lot easier than doing it in PROC SQL. In this case, you would have to rearrange the data so that each score is its own record. But it often happens that the data shown is the result of taking a data set where it is already in the desired format and then for presentation purposes making long rows for each patient/observation. But I admit that I haven't gone through the information provided in the original post in detail to see if PROC SUMMARY is really better than PROC SQL for this specific application.
In addition, this SQL code
SUM(DIFF_WEIGHT)/SUM(MIN_WGT) as WEIGHTED_AVERAGE_SCORE
in certain situations with missing data will not produce the proper weighted average. If the variable in the numerator is missing and the variable in the denominator is not missing, or vice versa, then you get an incorrect weighted average. PROC SUMMARY does not have this deficiency.
Thanks a lot! @novinosrin, @PaigeMiller, @mkeintz
I'm working on it with all your comments back in mind.
From the interactions and the questions posted here, I admit that that I'm getting to know what I want to achieve better. This is a simulation exercise using known and unknown data. In the known data, I have both 'date of visit' and 'date of diagnosis' variables available. In the unknown data, I have only 'date of visit' (ser_date_dx) but no 'date of diagnosis'. therefore, the goal is to estimate the error using the known data and apply the algorithm associated with the least error to the unknown data.
DOD_DIFF in known data is the difference between the date of visit and the date of diagnosis , attached to this forum.
The error that I'm trying to estimate is:
- If pick the row associated with the earliest 'date of visit' (min(ser_date_dx)) and the highest weight diagnosis code (min(wgt_dx)) how much error will be produced to compare that you would have used (DOD_DIFF) which is available only in known data but not available in unknown data. Sorry for repeating seemingly simple concept here. But it helps myself to understand the problem better 🙂 to be honest.
So the code below cover the part to calculate the weighted average using the date of visit (ser_date_dx). But the missing piece now is the error estimate compare to the scenario using DOD_DIFF.
PROC SQL;
create table ERROR_FIRST_DIAGNOSIS as
select *, MIN(SER_DATE_DX) AS MIN_DIFF,
MIN(WGT_DX) AS MIN_WGT,
calculated MIN_DIFF*calculated MIN_WGT AS DIFF_WEIGHT
from HAVE(where=(id=4350))
GROUP BY ID;
QUIT;
Please keep questioning me. This post is only evolving with your critical thinking! I truly greatly appreciate your time!
" But the missing pieace now is the relative error to the scenario using DOD_DIFF. "
Do you have the logic for that?
@novinosrin, would you please rephrase your question? It sounds important aspect that I have to understand.
Honestly, I am the one who is lagging here although it is so very interesting. I was basically trying to see where and how we can progress to the next step in code development. I assumed the logic for calculating relative error as you mentioned seemed the next. I see an explanation here
"The error that I'm trying to estimate is:
- If pick the row associated with the earliest 'date of visit' (min(ser_date_dx)) and the highest weight diagnosis code (min(wgt_dx)) how much error will be produced to compare that you would have used (DOD_DIFF) which is available only in known data but not available in unknown data. Sorry for repeating seemingly simple concept here. But it helps myself to understand the problem better 🙂 to be honest. "
However I haven't comprehended that enough to translate to a code. If Paige, Mark or somebody could push that understanding further, my fingers are waiting
Here's what I understand you to be saying:
@mkeintz, @novinosrin, @PaigeMiller
Mkeintz, Thank you for your amazingly systematic presentation of the problem in clear subsequent logics. You also had coined in a new idea that I have not seen for possibility. That is to use two parallel scenario: absence vs presence of DOD_DIFF using the 'known' data. And also suggesting an alternative way to look at a measurement error. Please, hoping your time allows, review my responses in red in the text below.
IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;
This is a rewarding learning experience. I'm so excited to continuously learning from you guys!
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.