Ammonite | Level 13

## Calculate error with variable known and unknown scenarios using repeated measurement data

I'd like to calculate the error produced due to known and unknown status of a variable called DOD_DIFF when certain conditions looked after. Please see the image and the text underneath for better understanding.

What I wanted to achieve SAS code wise is following:

A. In the absence of DOD_DIFF:

1. Choose the earliest date of visit
2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
3. Pick one row per patient when these conditions meet

B. In the presence of DOD_DIFF:

1. Choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
3. Pick one row per patient when these conditions meet

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
3. ERROR= DIFF_WGT_DX* DOD_DIFF

At the end, each patient has one ERROR value.

Thank you very much for your help in advance. The demo data is attached to this post.

This problem was previously discussed and conceptually solved in the post below thanks to @mkeintz and @novinosrin. And I hops this post will lead to a solution for a SAS programming.

https://communities.sas.com/t5/SAS-Programming/Calculate-weighted-average-across-rows-with-condition...

1 ACCEPTED SOLUTION

Accepted Solutions
Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Take a look at these and test it on the sample you posted for just one id, i did for ID=1

``````/*WGT_DX_FULL*/

proc sql;
create table WGT_DX_FULL as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;

/*WGT_DX_PARTIAL*/

proc sql;
create table WGT_DX_PARTIAL as
select distinct *,min(WGT_DX) as WGT_DX_PARTIAL
from
(select  *, min(SER_DATE_DX) as min_SER_DATE_DX
from have1(where=(id=1))
group by id
having SER_DATE_DX=min_SER_DATE_DX)
group by id, min_SER_DATE_DX
having min(wgt_dx)=wgt_dx;
quit;``````

Now, this should help you get the hang of  where I am at.

``WGT_DX_FULL uses DOD_DIFF``
``WGT_DX_PARTIAL uses SER_DATE_DX``

and we take it from here

43 REPLIES 43
Super User

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Since your data picture shows values for DOD_DIF I think it might help to describe what constitutes "DOD_DIF is unknown" or "absence of DOD_DIF"

It may also help to apply an actual SAS date format to the SER_DATE_DX value

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS

Super User

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

@Cruise wrote:

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS

So "not known" = the variable has a missing value???

Your picture and comment seemed to indicate that the top circled row was on of the records that had "absent" data but showed an actual value for DOD_DIFF. So I asked for clarification as to what would indicate "absent".

"Absent" is not a SAS concept in general, that is a user concept. I have data that has specific codes that indicate specific forms of "not recorded" "out of range" "biologically implausible values" and other indications that the value should not be used depending on the specific data source. I create SAS variables with missing values when appropriate to avoid calculations with the coded values. So you still have not quite defined "absent" in terms of what appears in the data. The SAS term might be "missing value" as SAS has missing as a value. It does not have "not known". When I talk to an SQL programmer I do not use "missing" because generally that is not the correct terminology; NULL however would be.

One reason to be specific is that SAS has functions to test for missing values such a MISSING. So the code might include

`if missing (DOD_DIFF) then do ...`

which would be different from

`if DOD_DIFF = <somecode> then do ..`

It is also extremely hard to code with pictures of data.

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

@ballardw

I understand. Sorry for a confusion. Thanks for explaining how missing/unknown/null can be deferentially understood in different programming settings.

In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF. Therefore, 'date of visit' (ser_date_dx) will be used instead DOD_DIFF. The rows will be picked in Scenario A and B (DOD_DIFF is available ) as following:

What I wanted to achieve is following:

A. Without using DOD_DIFF:

1. Choose the earliest date of visit
2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
3. Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

1. Choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
3. Pick one row per patient when these conditions meet

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
3. ERROR= DIFF_WGT_DX* DOD_DIFF

At the end, each patient has one ERROR value.

Does it make clearer? Would you suggest me to delete the image attached if that will cause a lot of confusion in this forum.

Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Hi @Cruise Ok I have just begin on piecemeal approach starting with

1. B. Using DOD_DIFF:

Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet

Test with  ID=1 --Initial data examination
2. ``````
proc sql;
create table want as
select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF
order by id ,DOD_DIFF, wgt_dx;
quit;
proc print noobs;run;``````
SAS Output--> Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?
SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF
24JUL2006 1 153 1 1 1
24JUL2006 1 197 1 3 1
24JUL2006 1 197 1 3 1

There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

1. Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?

Yes, select distinct as they are identical. Because this is an insurance claim data these static information such as CODE/WGT_DX is       repeated every time they enter in other health care related information.

``````IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;``````

2. There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

WGT_DX is a monotonic variable based on the CODE. DOD_DIFF=date of visit-date of diagnosis. Therefore, the diagnosis can be  tied to the date of diagnosis which is reflected in the DOD_DIFF. However, should not be one-to-one tie.

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Thank you so much for getting back to this forum.
Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

B. Using DOD_DIFF is pretty straight forward

Testing (where=(id=1))

``````proc sql;
create table B as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;``````

For A

"In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF."

DOD_DIFF would be .  (missing) for all records for a patient ID?

Which means min(DOD_DIFF )=.  and so  WGT_DX_FULL should be assinged 0 or . (missing) ?

So with N function(to check for number of non missing values,i.e if all missing returns 0, If I infuse a logic check like

select distinct *,min(WGT_DX)*n(DOD_DIFF)>0 as WGT_DX_FULL  /*notice the highlighted n(DOD_DIFF)>0 results in 1 when there is atleast 1 non missing value and 0 if all are missings*/

So when 0, min(WGT_DX) *0  as WGT_DX_FULL  would result in zero

And when 1,min(WGT_DX) *1 =min(WGT_DX)   as WGT_DX_FULL

And then to get to this

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL  When WGT_DX_FULL  has a non zero value, WGT_DX_PARTIAL  will have to be missing or zero and vice versaThis is where it stands as of now

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Hi, in a hypothetical situation assuming DOD_DIFF is not available, we don't use DOD_DIFF at all. Because we're assuming DOD_DIFF doesn't exist. Instead DOD_DIFF, for scenario B, we use date of visit (ser_date-dx).

A. Without using DOD_DIFF (where we assume DOD_DIFF is not available. But we have 'date of visit' (ser_date_dx) instead.

1. Choose the earliest date of visit (ser_date_dx)
2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
3. Pick one row per patient when these conditions meet
Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Ah,  DOD_DIFF doesn't exist makes it holy lot easier.  I was assuming absence is missing. See over analyzing it.

Ok at dunkin for a quick coffee, I have my code ready in my mind though. brb at the soonest convenience

Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Oh so sorry that I wasn't consistent in wordings. Ballardw warned me on that. Enjoy your coffee. I wish I bought you one 🙂

Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

Hmm sorry again, there is still some ambiguity imo,

Assuming DOD_DIFF is not available ,WGT_DX_FULL will not be applicable in the subtraction

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL(is true when DOD_DIFF exist,else missing  - WGT_DX_PARTIAL(is true always because of SER_DATE_DX always available)   /*To arrive at this equation is not clear*/
2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
3. ERROR= DIFF_WGT_DX* DOD_DIFF
Ammonite | Level 13

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with  WGT_DX_FULL  and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).

2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)

You selected 'SER_DATE_DX' using DOD_DIFF. You also selected SER_DATE_DX under assumption of DOD_DIFF not available. These two selected SER_DATE_DX may or may not overlap. We're interested in the difference in days between these two dates.

3. ERROR= DIFF_WGT_DX* DIFF_DATE.

I'm curious what you would say regarding zero values generated for DIFF_WGT_DX and DIFF_DATE. The goal here is to quantify the error. However, zero values of both DIFF_WGT_DX and DIFF_DATE or one of these two might cancel to get quantity from ERROR= DIFF_WGT_DX* DIFF_DATE equation.

Therefore, shall we shift the calculated difference to the right by one unit so we don't deal with zero?

Tourmaline | Level 20

## Re: Calculate error with variable known and unknown scenarios using repeated measurement data

C. Calculate measurement error

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with  WGT_DX_FULL  and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF,  How is this possible if DOD_DIFF does not exist (A situation)

So if DOD_DIFF  does not exist , the potential to calculate WGT_DX_FULL doesn;t exist either , isn't that right? so what would you fill in for WGT_DX_FULL  in the equation  DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL ?

A. Without using DOD_DIFF:

1. Choose the earliest date of visit
2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
3. Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

1. Choose the minimum DOD_DIFF
2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
3. Pick one row per patient when these conditions meet

Discussion stats
• 43 replies
• 1452 views
• 18 likes
• 3 in conversation