Diagnostic tests to Identify the type of distribution and calculate in...

Sujithpeta · Posted 10-22-2018 02:33 PM

Hello,

I've done a propensity score matching for case and control using variables like disease comorbidities etc. Now, my goal is to calculate the incremental cost (Difference of the average) between case and control, when I look at the cost data for case and control both doesn't follow normal distribution (Right Skewed). I know there are tests to do check for normal distribution, I would like to do diagnostic test to understand the type of distribution the data is following, like log, gamma etc. and use the appropriate distribution to calculate average.

I would really appreciate if someone can help me with process of how to do diagnostics, transform to the appropriate distribution and get the incremental average cost difference.

I've attached the dataset with following variables

Paid_ID: Matched Pair
VLU: '1' for Case, '0' for Control
Post_Cost: Cost data
Proc_score: Propensity score

I'm using SAS 9.3, so I would appreciate if you could guide me the process in the version.

Thanks

PGStats · Posted 10-22-2018 03:52 PM

With paired cases you should be looking at the distribution of the paired difference between costs. The overall distribution of each group doesn't really matter.

Note: many forum members (including myself) will not download Excel files. We prefer text (e.g. csv) file formats.

PG

Sujithpeta · Posted 10-22-2018 04:03 PM

Thanks for the reply.

You mean, I've to first take difference between the pair and then look for the paired difference distribution? Could please let me know the code for the diagnostic test and how to calculate the average according to the distribution? I have attached the text file. Thanks a ton in advance.

PGStats · Posted 10-22-2018 04:38 PM

This is what I would do:

data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID	VLU	Post_Cost	Proc_score;
run;

proc transpose data=long out=temp prefix=VLU_;
by pair_id;
id VLU;
var post_cost;
run;

data have;
set temp;
VLUdiff = VLU_1 - VLU_0;
run;

proc univariate data=have normal winsorized=0.05;
var VLUdiff;
histogram;
run;

                             Tests for Location: Mu0=0

                  Test           -Statistic-    -----p Value------

                  Student's t    t  20.97301    Pr > |t|    <.0001
                  Sign           M       857    Pr >= |M|   <.0001
                  Signed Rank    S   2372105    Pr >= |S|   <.0001

                                  Winsorized Means

    Percent       Number                 Std Error
 Winsorized   Winsorized   Winsorized   Winsorized      95% Confidence
    in Tail      in Tail         Mean         Mean          Limits               DF

       5.01          231     11294.45     433.7075   10444.15   12144.75       4149

Every test shows that VLU_1 > VLU_0. The difference has very high Kurtosis (heavy tails), hence the Winsorized estimates as an extra precaution. Given the large sample, you can assume that the mean difference estimate is normally distributed and that the confidence limits are pretty good.

PG

Sujithpeta · Posted 10-31-2018 08:23 PM

Hello @PGStats

I don't know if you understood the question, apologize for not being clear.

As you mentioned, the cost is very skewed and the average cost would heavily influenced by the extreme values.

As an example, if the data is following log distribution, we transpose the data and calculate it's mean (by retransposing and adjusting for the smearing effect) we will get a mean that has less influence of the possible outliers, right.

I tried this in R with help of a friend and we got around 6-7K average, I'm new to SAS and don't know how to program box-cox test to find the distribution and how to write the code for the regression estimate in SAS.

I would appreciate if you direct me to any resource for this.

Thanks

PGStats · Posted 11-01-2018 01:29 AM

I should have looked at this first:

There is very little correlation between pairs, so there is almost nothing to gain from pairing the costs.

I tried comparing smearing-corrected back transformed means, as you suggest, with Winsorized means (another way to protect from outliers). Here is how I did it:

data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID	VLU	Post_Cost	Proc_score;
run;

/* Means of log-transformed data */
proc glimmix data=long;
class VLU;
model post_cost = VLU / dist=lognormal;
output out=pred predicted residual;
run;

/* Winsorized means */
proc univariate data=long winsorized=0.05;
class VLU;
var post_cost;
ods output WinsorizedMeans=Winsor;
run;

proc sql;
/* Apply smearing correction */
create table means as
select 
    VLU,
    mean(exp(Pred)) as mean,
    mean(exp(resid)) as smearingCorrection,
    mean(exp(Pred)) * mean(exp(resid)) as correctedMean
from pred
group by VLU;

select 
    a.*, 
    b.mean as WinsorizedMean 
from 
    means as a inner join
    Winsor as b on a.VLU=input(b.VLU, best.);

select 
    range(b.mean) as WinsorizedMeanDiff,
    range(a.correctedMean) as correctedMeanDiff
from 
    means as a inner join
    Winsor as b on a.VLU=input(b.VLU, best.);
quit;

                                     smearing     corrected  Winsorized
                  VLU      mean    Correction          Mean        Mean
             ----------------------------------------------------------
                    0  4839.834      2.575668       12465.8    10779.05
                    1  11345.13      2.154876      24447.35    22155.50

                               Winsorized     corrected
                                 MeanDiff      MeanDiff
                             --------------------------
                                 11376.46      11981.54

Both ways, I get cost difference estimates much greater than your estimate. It looks as if your smearing correction was much smaller.

PG

Sujithpeta · Posted 11-01-2018 04:15 PM

Hey @PGStats

Thanks for taking a shot at this, I don't know why the average comes be same or even higher. I found this presentation

https://www.hsrd.research.va.gov/for_researchers/cyber_seminars/archives/1258-notes.pdf

In this presentation, the author described (slide 11 and later) explains that when you use GLM model with dist or link functions, you don't have to adjust for the smearing effect. What do you think about it?

Thanks

Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Re: Diagnostic tests to Identify the type of distribution and calculate intercept in SAS

Catch up on SAS Innovate 2026