Solved: Calculate error with variable known and unknown scenarios using repeat...

Cruise · Posted 11-15-2018 10:59 AM

I'd like to calculate the error produced due to known and unknown status of a variable called DOD_DIFF when certain conditions looked after. Please see the image and the text underneath for better understanding.

What I wanted to achieve SAS code wise is following:

A. In the absence of DOD_DIFF:

Choose the earliest date of visit
If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
Pick one row per patient when these conditions meet

B. In the presence of DOD_DIFF:

Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet

C. Calculate measurement error

DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
ERROR= DIFF_WGT_DX* DOD_DIFF

At the end, each patient has one ERROR value.

Thank you very much for your help in advance. The demo data is attached to this post.

This problem was previously discussed and conceptually solved in the post below thanks to @mkeintz and @novinosrin. And I hops this post will lead to a solution for a SAS programming.

https://communities.sas.com/t5/SAS-Programming/Calculate-weighted-average-across-rows-with-condition...

novinosrin · Posted 11-15-2018 05:37 PM

Take a look at these and test it on the sample you posted for just one id, i did for ID=1

/*WGT_DX_FULL*/

proc sql;
create table WGT_DX_FULL as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;


/*WGT_DX_PARTIAL*/

proc sql;
create table WGT_DX_PARTIAL as
select distinct *,min(WGT_DX) as WGT_DX_PARTIAL
from
(select  *, min(SER_DATE_DX) as min_SER_DATE_DX
from have1(where=(id=1))
group by id
having SER_DATE_DX=min_SER_DATE_DX)
group by id, min_SER_DATE_DX
having min(wgt_dx)=wgt_dx;
quit;

Now, this should help you get the hang of where I am at.

WGT_DX_FULL uses DOD_DIFF

WGT_DX_PARTIAL uses SER_DATE_DX

and we take it from here

View solution in original post

ballardw · Posted 11-15-2018 11:34 AM

Since your data picture shows values for DOD_DIF I think it might help to describe what constitutes "DOD_DIF is unknown" or "absence of DOD_DIF"

It may also help to apply an actual SAS date format to the SER_DATE_DX value

Cruise · Posted 11-15-2018 11:45 AM

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS

ballardw · Posted 11-15-2018 12:36 PM

@Cruise wrote:

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS

So "not known" = the variable has a missing value???

Your picture and comment seemed to indicate that the top circled row was on of the records that had "absent" data but showed an actual value for DOD_DIFF. So I asked for clarification as to what would indicate "absent".

"Absent" is not a SAS concept in general, that is a user concept. I have data that has specific codes that indicate specific forms of "not recorded" "out of range" "biologically implausible values" and other indications that the value should not be used depending on the specific data source. I create SAS variables with missing values when appropriate to avoid calculations with the coded values. So you still have not quite defined "absent" in terms of what appears in the data. The SAS term might be "missing value" as SAS has missing as a value. It does not have "not known". When I talk to an SQL programmer I do not use "missing" because generally that is not the correct terminology; NULL however would be.

One reason to be specific is that SAS has functions to test for missing values such a MISSING. So the code might include

if missing (DOD_DIFF) then do ...

which would be different from

if DOD_DIFF = <somecode> then do ..

It is also extremely hard to code with pictures of data.

Cruise · Posted 11-15-2018 12:47 PM

@ballardw

I understand. Sorry for a confusion. Thanks for explaining how missing/unknown/null can be deferentially understood in different programming settings.

In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF. Therefore, 'date of visit' (ser_date_dx) will be used instead DOD_DIFF. The rows will be picked in Scenario A and B (DOD_DIFF is available ) as following:

What I wanted to achieve is following:

A. Without using DOD_DIFF:

Choose the earliest date of visit
If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet

C. Calculate measurement error

DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
ERROR= DIFF_WGT_DX* DOD_DIFF

At the end, each patient has one ERROR value.

Does it make clearer? Would you suggest me to delete the image attached if that will cause a lot of confusion in this forum.

Thanks for asking!

novinosrin · Posted 11-15-2018 02:38 PM

Hi @Cruise Ok I have just begin on piecemeal approach starting with

B. Using DOD_DIFF:

Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet
Test with ID=1 --Initial data examination
```
proc sql;
create table want as
select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF
order by id ,DOD_DIFF, wgt_dx;
quit;
proc print noobs;run;
```
SAS Output--> Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?

SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF

24JUL2006 1 153 1 1 1

24JUL2006 1 197 1 3 1

24JUL2006 1 197 1 3 1

There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

Cruise · Posted 11-15-2018 02:52 PM

1. Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?

Yes, select distinct as they are identical. Because this is an insurance claim data these static information such as CODE/WGT_DX is repeated every time they enter in other health care related information.

IF CODE IN ('153') THEN WGT_DX=1; ELSE    
IF CODE IN ('154') THEN WGT_DX=1; ELSE   
IF CODE IN ('159') THEN WGT_DX=2; ELSE  
IF CODE IN ('197') THEN WGT_DX=3; ELSE  
IF CODE IN ('199') THEN WGT_DX=3; ELSE  
IF CODE IN ('209') THEN WGT_DX=5; ELSE  
IF CODE IN ('211') THEN WGT_DX=6; ELSE  
IF CODE IN ('230') THEN WGT_DX=2; ELSE  
IF CODE IN ('235') THEN WGT_DX=3; ELSE  
IF CODE IN ('239') THEN WGT_DX=3; ELSE  
IF CODE IN ('V10') THEN WGT_DX=4; ELSE  
IF CODE IN ('V76') THEN WGT_DX=4;

2. There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

WGT_DX is a monotonic variable based on the CODE. DOD_DIFF=date of visit-date of diagnosis. Therefore, the diagnosis can be tied to the date of diagnosis which is reflected in the DOD_DIFF. However, should not be one-to-one tie.

Cruise · Posted 11-15-2018 02:53 PM

Thank you so much for getting back to this forum.

novinosrin · Posted 11-15-2018 04:02 PM

B. Using DOD_DIFF is pretty straight forward

Testing (where=(id=1))

proc sql;
create table B as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;

For A

"In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF."

DOD_DIFF would be . (missing) for all records for a patient ID?

Which means min(DOD_DIFF )=. and so WGT_DX_FULL should be assinged 0 or . (missing) ?

So with N function(to check for number of non missing values,i.e if all missing returns 0, If I infuse a logic check like

select distinct *,min(WGT_DX)*n(DOD_DIFF)>0 as WGT_DX_FULL /*notice the highlighted n(DOD_DIFF)>0 results in 1 when there is atleast 1 non missing value and 0 if all are missings*/

So when 0, min(WGT_DX) *0 as WGT_DX_FULL would result in zero

And when 1,min(WGT_DX) *1 =min(WGT_DX) as WGT_DX_FULL

And then to get to this

DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL When WGT_DX_FULL has a non zero value, WGT_DX_PARTIAL will have to be missing or zero and vice versaThis is where it stands as of now

Cruise · Posted 11-15-2018 04:05 PM

Hi, in a hypothetical situation assuming DOD_DIFF is not available, we don't use DOD_DIFF at all. Because we're assuming DOD_DIFF doesn't exist. Instead DOD_DIFF, for scenario B, we use date of visit (ser_date-dx).

A. Without using DOD_DIFF (where we assume DOD_DIFF is not available. But we have 'date of visit' (ser_date_dx) instead.

Choose the earliest date of visit (ser_date_dx)
If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
Pick one row per patient when these conditions meet

novinosrin · Posted 11-15-2018 04:09 PM

Ah, DOD_DIFF doesn't exist makes it holy lot easier. I was assuming absence is missing. See over analyzing it.

Ok at dunkin for a quick coffee, I have my code ready in my mind though. brb at the soonest convenience

Cruise · Posted 11-15-2018 04:10 PM

Oh so sorry that I wasn't consistent in wordings. Ballardw warned me on that. Enjoy your coffee. I wish I bought you one 🙂

novinosrin · Posted 11-15-2018 04:19 PM

Hmm sorry again, there is still some ambiguity imo,

Assuming DOD_DIFF is not available ,WGT_DX_FULL will not be applicable in the subtraction

C. Calculate measurement error

DIFF_WGT_DX= WGT_DX_FULL(is true when DOD_DIFF exist,else missing) - WGT_DX_PARTIAL(is true always because of SER_DATE_DX always available) /*To arrive at this equation is not clear*/
DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
ERROR= DIFF_WGT_DX* DOD_DIFF

Cruise · Posted 11-15-2018 04:42 PM

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with WGT_DX_FULL and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).

2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)

You selected 'SER_DATE_DX' using DOD_DIFF. You also selected SER_DATE_DX under assumption of DOD_DIFF not available. These two selected SER_DATE_DX may or may not overlap. We're interested in the difference in days between these two dates.

3. ERROR= DIFF_WGT_DX* DIFF_DATE.

I'm curious what you would say regarding zero values generated for DIFF_WGT_DX and DIFF_DATE. The goal here is to quantify the error. However, zero values of both DIFF_WGT_DX and DIFF_DATE or one of these two might cancel to get quantity from ERROR= DIFF_WGT_DX* DIFF_DATE equation.

Therefore, shall we shift the calculated difference to the right by one unit so we don't deal with zero?

novinosrin · Posted 11-15-2018 05:14 PM

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with WGT_DX_FULL and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, How is this possible if DOD_DIFF does not exist (A situation)

So if DOD_DIFF does not exist , the potential to calculate WGT_DX_FULL doesn;t exist either , isn't that right? so what would you fill in for WGT_DX_FULL in the equation DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL ?

A. Without using DOD_DIFF:

Choose the earliest date of visit
If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet

Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Re: Calculate error with variable known and unknown scenarios using repeated measurement data

SER_DATE_DX	DOD_DIFF	CODE	ID	WGT_DX	min_DOD_DIFF
24JUL2006	1	153	1	1	1
24JUL2006	1	197	1	3	1
24JUL2006	1	197	1	3	1

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!