I'd like to calculate the error produced due to known and unknown status of a variable called DOD_DIFF when certain conditions looked after. Please see the image and the text underneath for better understanding.
What I wanted to achieve SAS code wise is following:
A. In the absence of DOD_DIFF:
B. In the presence of DOD_DIFF:
C. Calculate measurement error
At the end, each patient has one ERROR value.
Thank you very much for your help in advance. The demo data is attached to this post.
This problem was previously discussed and conceptually solved in the post below thanks to @mkeintz and @novinosrin. And I hops this post will lead to a solution for a SAS programming.
Take a look at these and test it on the sample you posted for just one id, i did for ID=1
/*WGT_DX_FULL*/
proc sql;
create table WGT_DX_FULL as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;
/*WGT_DX_PARTIAL*/
proc sql;
create table WGT_DX_PARTIAL as
select distinct *,min(WGT_DX) as WGT_DX_PARTIAL
from
(select *, min(SER_DATE_DX) as min_SER_DATE_DX
from have1(where=(id=1))
group by id
having SER_DATE_DX=min_SER_DATE_DX)
group by id, min_SER_DATE_DX
having min(wgt_dx)=wgt_dx;
quit;
Now, this should help you get the hang of where I am at.
WGT_DX_FULL uses DOD_DIFF
WGT_DX_PARTIAL uses SER_DATE_DX
and we take it from here
Since your data picture shows values for DOD_DIF I think it might help to describe what constitutes "DOD_DIF is unknown" or "absence of DOD_DIF"
It may also help to apply an actual SAS date format to the SER_DATE_DX value
DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.
DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS
@Cruise wrote:
DOD_DIFF is the number of days between date of diagnosis and the date of visit for all visits of each patients. However, date of diagnosis is not always known. The data I'm using in this exercise has no missing in date of diagnosis and date of visit. Because the observations with missing in date of diagnosis was deleted to create this dataset with complete DOD_DIFF. Currently, in practice, when DOD_DIFF is not known, aka, absent, we go for the row associated with the earliest date of visit (min(SER_DATE_DX)) and the lowest error diagnosis code (min(WGT_DX)). However, I'd like to estimate how much error are we inducing by using this logic. The idea here is to estimate the error due to two different scenarios where DOD_DIFF known/present and DOD_DIFF not known/absent. In other words, DOD_DIFF not known/absent is a hypothetical scenario in this exercise since we have true DOD_DIFF in the data attached to this post.
DOD_DIFF=SER_DATE_DX-DATE_OF_DIAGNOSIS
So "not known" = the variable has a missing value???
Your picture and comment seemed to indicate that the top circled row was on of the records that had "absent" data but showed an actual value for DOD_DIFF. So I asked for clarification as to what would indicate "absent".
"Absent" is not a SAS concept in general, that is a user concept. I have data that has specific codes that indicate specific forms of "not recorded" "out of range" "biologically implausible values" and other indications that the value should not be used depending on the specific data source. I create SAS variables with missing values when appropriate to avoid calculations with the coded values. So you still have not quite defined "absent" in terms of what appears in the data. The SAS term might be "missing value" as SAS has missing as a value. It does not have "not known". When I talk to an SQL programmer I do not use "missing" because generally that is not the correct terminology; NULL however would be.
One reason to be specific is that SAS has functions to test for missing values such a MISSING. So the code might include
if missing (DOD_DIFF) then do ...
which would be different from
if DOD_DIFF = <somecode> then do ..
It is also extremely hard to code with pictures of data.
I understand. Sorry for a confusion. Thanks for explaining how missing/unknown/null can be deferentially understood in different programming settings.
In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF. Therefore, 'date of visit' (ser_date_dx) will be used instead DOD_DIFF. The rows will be picked in Scenario A and B (DOD_DIFF is available ) as following:
What I wanted to achieve is following:
A. Without using DOD_DIFF:
B. Using DOD_DIFF:
C. Calculate measurement error
At the end, each patient has one ERROR value.
Does it make clearer? Would you suggest me to delete the image attached if that will cause a lot of confusion in this forum.
Thanks for asking!
Hi @Cruise Ok I have just begin on piecemeal approach starting with
B. Using DOD_DIFF:
Choose the minimum DOD_DIFF
If there are multiple records for that value of DOD_DIFF, choose the lowestWGT_DX, call it WGT_DX_FULL
Pick one row per patient when these conditions meet
proc sql;
create table want as
select *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF
order by id ,DOD_DIFF, wgt_dx;
quit;
proc print noobs;run;
SAS Output--> Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?
SER_DATE_DX | DOD_DIFF | CODE | ID | WGT_DX | min_DOD_DIFF |
---|---|---|---|---|---|
24JUL2006 | 1 | 153 | 1 | 1 | 1 |
24JUL2006 | 1 | 197 | 1 | 3 | 1 |
24JUL2006 | 1 | 197 | 1 | 3 | 1 |
There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.
1. Can you explain what if the WGT_DX=3 was the min here. Is it just select distinct as they are identical?
Yes, select distinct as they are identical. Because this is an insurance claim data these static information such as CODE/WGT_DX is repeated every time they enter in other health care related information.
IF CODE IN ('153') THEN WGT_DX=1; ELSE
IF CODE IN ('154') THEN WGT_DX=1; ELSE
IF CODE IN ('159') THEN WGT_DX=2; ELSE
IF CODE IN ('197') THEN WGT_DX=3; ELSE
IF CODE IN ('199') THEN WGT_DX=3; ELSE
IF CODE IN ('209') THEN WGT_DX=5; ELSE
IF CODE IN ('211') THEN WGT_DX=6; ELSE
IF CODE IN ('230') THEN WGT_DX=2; ELSE
IF CODE IN ('235') THEN WGT_DX=3; ELSE
IF CODE IN ('239') THEN WGT_DX=3; ELSE
IF CODE IN ('V10') THEN WGT_DX=4; ELSE
IF CODE IN ('V76') THEN WGT_DX=4;
2. There apparently is a tie in WGT_DX within the group of min_DOD_DIFF. In this case we are anyway gonna pick WGT_DX of 1.
WGT_DX is a monotonic variable based on the CODE. DOD_DIFF=date of visit-date of diagnosis. Therefore, the diagnosis can be tied to the date of diagnosis which is reflected in the DOD_DIFF. However, should not be one-to-one tie.
B. Using DOD_DIFF is pretty straight forward
Testing (where=(id=1))
proc sql;
create table B as
select distinct *,min(WGT_DX) as WGT_DX_FULL
from
(select *, min(DOD_DIFF) as min_DOD_DIFF
from have1(where=(id=1))
group by id
having DOD_DIFF=min_DOD_DIFF)
group by id, min_DOD_DIFF
having min(wgt_dx)=wgt_dx;
quit;
For A
"In this exercise, we are using data with complete DOD_DIFF. However, for the purpose to estimate the measurement error in a hypothetical situation when DOD_DIFF is not available (A), we assume that we don't have any information for DOD_DIFF."
DOD_DIFF would be . (missing) for all records for a patient ID?
Which means min(DOD_DIFF )=. and so WGT_DX_FULL should be assinged 0 or . (missing) ?
So with N function(to check for number of non missing values,i.e if all missing returns 0, If I infuse a logic check like
select distinct *,min(WGT_DX)*n(DOD_DIFF)>0 as WGT_DX_FULL /*notice the highlighted n(DOD_DIFF)>0 results in 1 when there is atleast 1 non missing value and 0 if all are missings*/
So when 0, min(WGT_DX) *0 as WGT_DX_FULL would result in zero
And when 1,min(WGT_DX) *1 =min(WGT_DX) as WGT_DX_FULL
And then to get to this
Hi, in a hypothetical situation assuming DOD_DIFF is not available, we don't use DOD_DIFF at all. Because we're assuming DOD_DIFF doesn't exist. Instead DOD_DIFF, for scenario B, we use date of visit (ser_date-dx).
A. Without using DOD_DIFF (where we assume DOD_DIFF is not available. But we have 'date of visit' (ser_date_dx) instead.
Ah, DOD_DIFF doesn't exist makes it holy lot easier. I was assuming absence is missing. See over analyzing it.
Ok at dunkin for a quick coffee, I have my code ready in my mind though. brb at the soonest convenience
Oh so sorry that I wasn't consistent in wordings. Ballardw warned me on that. Enjoy your coffee. I wish I bought you one 🙂
Hmm sorry again, there is still some ambiguity imo,
Assuming DOD_DIFF is not available ,WGT_DX_FULL will not be applicable in the subtraction
C. Calculate measurement error
C. Calculate measurement error
1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with WGT_DX_FULL and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).
2. DIFF_DATE=ABS(DIF('date of visit' selected in the absence of DOD_DIFF, 'date of visit' selected in the presence of DOD_DIFF)
You selected 'SER_DATE_DX' using DOD_DIFF. You also selected SER_DATE_DX under assumption of DOD_DIFF not available. These two selected SER_DATE_DX may or may not overlap. We're interested in the difference in days between these two dates.
3. ERROR= DIFF_WGT_DX* DIFF_DATE.
I'm curious what you would say regarding zero values generated for DIFF_WGT_DX and DIFF_DATE. The goal here is to quantify the error. However, zero values of both DIFF_WGT_DX and DIFF_DATE or one of these two might cancel to get quantity from ERROR= DIFF_WGT_DX* DIFF_DATE equation.
Therefore, shall we shift the calculated difference to the right by one unit so we don't deal with zero?
C. Calculate measurement error
1. DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL
Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, and you also have a column WGT_DX_PARTIAL calculated using SER_DATE_DX. At this point, every patients is associated with WGT_DX_FULL and WGT_DX_PARTIAL where we will calculate the difference between: ABS(DIFF(WGT_DX_FULL-WGT_DX_PARTIAL)).
Yes, you have a column for WGT_DX_FULL which was calculated using DOD_DIFF, How is this possible if DOD_DIFF does not exist (A situation)
So if DOD_DIFF does not exist , the potential to calculate WGT_DX_FULL doesn;t exist either , isn't that right? so what would you fill in for WGT_DX_FULL in the equation DIFF_WGT_DX= WGT_DX_FULL - WGT_DX_PARTIAL ?
A. Without using DOD_DIFF:
B. Using DOD_DIFF:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.