BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

I'd like to calculate the error produced due to known and unknown status of a variable called DOD_DIFF when certain conditions looked after. Please see the image and the text underneath for better understanding. 

 

SCENARIO NOV12.png

What I wanted to achieve SAS code wise is following: 

 

A. In the absence of DOD_DIFF:

 

  1. Choose the earliest date of visit
  2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
  3. Pick one row per patient when these conditions meet

B. In the presence of DOD_DIFF:

 

  1. Choose the minimum DOD_DIFF
  2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
  3. Pick one row per patient when these conditions meet

 C. Calculate measurement error

 

  1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
  2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
  3. ERROR= DIFF_WGT_DX* DOD_DIFF 

At the end, each patient has one ERROR value. 

 

Thank you very much for your help in advance. The demo data is attached to this post. 

 

This problem was previously discussed and conceptually solved in the post below thanks to @mkeintz and @novinosrin. And I hops this post will lead to a solution for a SAS programming. 

 

https://communities.sas.com/t5/SAS-Programming/Calculate-weighted-average-across-rows-with-condition...

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Take a look at these and test it on the sample you posted for just one id, i did for ID=1

 

/*WGT_DX_FULL*/

proc sql;
create table WGT_DX_FULL as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;


/*WGT_DX_PARTIAL*/ proc sql; create table WGT_DX_PARTIAL as select distinct *,min(WGT_DX) as WGT_DX_PARTIAL from (select *, min(SER_DATE_DX) as min_SER_DATE_DX from have1(where=(id=1)) group by id having SER_DATE_DX=min_SER_DATE_DX) group by id, min_SER_DATE_DX having min(wgt_dx)=wgt_dx; quit;

 

 

Now, this should help you get the hang of  where I am at. 

WGT_DX_FULL uses DOD_DIFF
WGT_DX_PARTIAL uses SER_DATE_DX

 and we take it from here

View solution in original post

43 REPLIES 43
ballardw
Super User

Since your data picture shows values for DOD_DIF I think it might help to describe what constitutes "DOD_DIF is unknown" or "absence of DOD_DIF"

 

 

It may also help to apply an actual SAS date format to the SER_DATE_DX value

Cruise
Ammonite | Level 13

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post. 

 

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS 

 

 

ballardw
Super User

@Cruise wrote:

DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post. 

 

DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS 

 

 


So "not known" = the variable has a missing value???

Your picture and comment seemed to indicate that the top circled row was on of the records that had "absent" data but showed an actual value for DOD_DIFF. So I asked for clarification as to what would indicate "absent".

"Absent" is not a SAS concept in general, that is a user concept. I have data that has specific codes that indicate specific forms of "not recorded" "out of range" "biologically implausible values" and other indications that the value should not be used depending on the specific data source. I create SAS variables with missing values when appropriate to avoid calculations with the coded values. So you still have not quite defined "absent" in terms of what appears in the data. The SAS term might be "missing value" as SAS has missing as a value. It does not have "not known". When I talk to an SQL programmer I do not use "missing" because generally that is not the correct terminology; NULL however would be.

 

One reason to be specific is that SAS has functions to test for missing values such a MISSING. So the code might include

if missing (DOD_DIFF) then do ...

which would be different from

 

if DOD_DIFF = <somecode> then do ..

 

It is also extremely hard to code with pictures of data.

Cruise
Ammonite | Level 13

@ballardw

I understand. Sorry for a confusion. Thanks for explaining how missing/unknown/null can be deferentially understood in different programming settings. 

 

In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF. Therefore, 'date of visit' (ser_date_dx) will be used instead DOD_DIFF. The rows will be picked in Scenario A and B (DOD_DIFF is available ) as following:

 

What I wanted to achieve is following: 

 

A. Without using DOD_DIFF:

 

  1. Choose the earliest date of visit
  2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
  3. Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

 

  1. Choose the minimum DOD_DIFF
  2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
  3. Pick one row per patient when these conditions meet

 C. Calculate measurement error

 

  1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
  2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
  3. ERROR= DIFF_WGT_DX* DOD_DIFF 

At the end, each patient has one ERROR value.

   

Does it make clearer? Would you suggest me to delete the image attached if that will cause a lot of confusion in this forum. 

Thanks for asking!

novinosrin
Tourmaline | Level 20

Hi @Cruise Ok I have just begin on piecemeal approach starting with

 

  1. B. Using DOD_DIFF:

     

    Choose the minimum DOD_DIFF
    If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
    Pick one row per patient when these conditions meet

    Test with  ID=1 --Initial data examination
  2. 
    proc sql;
    create table want as
    select  *, min(DOD_DIFF) as min_DOD_DIFF
    from have1(where=(id=1))
    group by id
    having DOD_DIFF=min_DOD_DIFF
    order by id ,DOD_DIFF, wgt_dx;
    quit;
    proc print noobs;run;
    SAS Output--> Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?
    SER_DATE_DX DOD_DIFF CODE ID WGT_DX min_DOD_DIFF
    24JUL2006 1 153 1 1 1
    24JUL2006 1 197 1 3 1
    24JUL2006 1 197 1 3 1

    There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

Cruise
Ammonite | Level 13

1. Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?

    Yes, select distinct as they are identical. Because this is an insurance claim data these static information such as CODE/WGT_DX is       repeated every time they enter in other health care related information. 

 

IF CODE IN ('153') THEN WGT_DX=1; ELSE    
IF CODE IN ('154') THEN WGT_DX=1; ELSE   
IF CODE IN ('159') THEN WGT_DX=2; ELSE  
IF CODE IN ('197') THEN WGT_DX=3; ELSE  
IF CODE IN ('199') THEN WGT_DX=3; ELSE  
IF CODE IN ('209') THEN WGT_DX=5; ELSE  
IF CODE IN ('211') THEN WGT_DX=6; ELSE  
IF CODE IN ('230') THEN WGT_DX=2; ELSE  
IF CODE IN ('235') THEN WGT_DX=3; ELSE  
IF CODE IN ('239') THEN WGT_DX=3; ELSE  
IF CODE IN ('V10') THEN WGT_DX=4; ELSE  
IF CODE IN ('V76') THEN WGT_DX=4;

2. There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.

 

WGT_DX is a monotonic variable based on the CODE. DOD_DIFF=date of visit-date of diagnosis. Therefore, the diagnosis can be  tied to the date of diagnosis which is reflected in the DOD_DIFF. However, should not be one-to-one tie.  

Cruise
Ammonite | Level 13
Thank you so much for getting back to this forum.
novinosrin
Tourmaline | Level 20

B. Using DOD_DIFF is pretty straight forward

Testing (where=(id=1))

 

 

proc sql;
create table B as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select  *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;

For A

 

"In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF."

 

DOD_DIFF would be .  (missing) for all records for a patient ID?

Which means min(DOD_DIFF )=.  and so  WGT_DX_FULL should be assinged 0 or . (missing) ?

 

So with N function(to check for number of non missing values,i.e if all missing returns 0, If I infuse a logic check like

 

select distinct *,min(WGT_DX)*n(DOD_DIFF)>0 as WGT_DX_FULL  /*notice the highlighted n(DOD_DIFF)>0 results in 1 when there is atleast 1 non missing value and 0 if all are missings*/

 

So when 0, min(WGT_DX) *0  as WGT_DX_FULL  would result in zero

And when 1,min(WGT_DX) *1 =min(WGT_DX)   as WGT_DX_FULL  

 

 

And then to get to this

  1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL  When WGT_DX_FULL  has a non zero value, WGT_DX_PARTIAL  will have to be missing or zero and vice versaThis is where it stands as of now

 

 

 

Cruise
Ammonite | Level 13

Hi, in a hypothetical situation assuming DOD_DIFF is not available, we don't use DOD_DIFF at all. Because we're assuming DOD_DIFF doesn't exist. Instead DOD_DIFF, for scenario B, we use date of visit (ser_date-dx).

 

A. Without using DOD_DIFF (where we assume DOD_DIFF is not available. But we have 'date of visit' (ser_date_dx) instead. 

 

  1. Choose the earliest date of visit (ser_date_dx)
  2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
  3. Pick one row per patient when these conditions meet
novinosrin
Tourmaline | Level 20

Ah,  DOD_DIFF doesn't exist makes it holy lot easier.  I was assuming absence is missing. See over analyzing it.

 

Ok at dunkin for a quick coffee, I have my code ready in my mind though. brb at the soonest convenience

Cruise
Ammonite | Level 13

Oh so sorry that I wasn't consistent in wordings. Ballardw warned me on that. Enjoy your coffee. I wish I bought you one 🙂

novinosrin
Tourmaline | Level 20

Hmm sorry again, there is still some ambiguity imo,

Assuming DOD_DIFF is not available ,WGT_DX_FULL will not be applicable in the subtraction

 

 C. Calculate measurement error

 

  1. DIFF_WGT_DX= WGT_DX_FULL(is true when DOD_DIFF exist,else missing  - WGT_DX_PARTIAL(is true always because of SER_DATE_DX always available)   /*To arrive at this equation is not clear*/
  2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
  3. ERROR= DIFF_WGT_DX* DOD_DIFF 
Cruise
Ammonite | Level 13

 C. Calculate measurement error

 

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL 

 

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with  WGT_DX_FULL  and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)). 

 

2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)

 

You selected 'SER_DATE_DX' using DOD_DIFF. You also selected SER_DATE_DX under assumption of DOD_DIFF not available. These two selected SER_DATE_DX may or may not overlap. We're interested in the difference in days between these two dates.    

 

3. ERROR= DIFF_WGT_DX* DIFF_DATE. 

 

I'm curious what you would say regarding zero values generated for DIFF_WGT_DX and DIFF_DATE. The goal here is to quantify the error. However, zero values of both DIFF_WGT_DX and DIFF_DATE or one of these two might cancel to get quantity from ERROR= DIFF_WGT_DX* DIFF_DATE equation. 

 

Therefore, shall we shift the calculated difference to the right by one unit so we don't deal with zero?  

 

novinosrin
Tourmaline | Level 20

 C. Calculate measurement error

 

1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL 

 

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with  WGT_DX_FULL  and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)). 

 

Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF,  How is this possible if DOD_DIFF does not exist (A situation) 

So if DOD_DIFF  does not exist , the potential to calculate WGT_DX_FULL doesn;t exist either , isn't that right? so what would you fill in for WGT_DX_FULL  in the equation  DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL ?

 

A. Without using DOD_DIFF:

 

  1. Choose the earliest date of visit
  2. If there are multiple records for that data, choose the lowest WGT_DX, call is WGT_DX_PARTIAL
  3. Pick one row per patient when these conditions meet

B. Using DOD_DIFF:

 

  1. Choose the minimum DOD_DIFF
  2. If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
  3. Pick one row per patient when these conditions meet

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 43 replies
  • 1436 views
  • 18 likes
  • 3 in conversation