BookmarkSubscribeRSS Feed
Sujithpeta
Quartz | Level 8

Hello,

 

I've done a propensity score matching for case and control using variables like disease comorbidities etc. Now, my goal is to calculate the incremental cost (Difference of the average) between case and control, when I look at the cost data for case and control both doesn't follow normal distribution (Right Skewed). I know there are tests to do check for normal distribution, I would like to do diagnostic test to understand the type of distribution the data is following, like log, gamma etc. and use the appropriate distribution to calculate average.

 

I would really appreciate if someone can help me with process of how to do diagnostics, transform to the appropriate distribution and get the incremental average cost difference. 

 

I've attached the dataset with following variables

  1. Paid_ID: Matched Pair
  2. VLU: '1' for Case, '0' for Control
  3. Post_Cost: Cost data
  4. Proc_score: Propensity score

 

I'm using SAS 9.3, so I would appreciate if you could guide me the process in the version.

 

Thanks

 

 

 

6 REPLIES 6
PGStats
Opal | Level 21

With paired cases you should be looking at the distribution of the paired difference between costs. The overall distribution of each group doesn't really matter.

 

Note: many forum members (including myself) will not download Excel files. We prefer text (e.g. csv) file formats.

PG
Sujithpeta
Quartz | Level 8

Thanks for the reply.

 

You mean, I've to first take difference between the pair and then look for the paired difference distribution? Could please let me know the code for the diagnostic test and how to calculate the average according to the distribution? I have attached the text file. Thanks a ton in advance.

PGStats
Opal | Level 21

This is what I would do:

 

data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID	VLU	Post_Cost	Proc_score;
run;

proc transpose data=long out=temp prefix=VLU_;
by pair_id;
id VLU;
var post_cost;
run;

data have;
set temp;
VLUdiff = VLU_1 - VLU_0;
run;

proc univariate data=have normal winsorized=0.05;
var VLUdiff;
histogram;
run;
                             Tests for Location: Mu0=0

                  Test           -Statistic-    -----p Value------

                  Student's t    t  20.97301    Pr > |t|    <.0001
                  Sign           M       857    Pr >= |M|   <.0001
                  Signed Rank    S   2372105    Pr >= |S|   <.0001
                                  Winsorized Means

    Percent       Number                 Std Error
 Winsorized   Winsorized   Winsorized   Winsorized      95% Confidence
    in Tail      in Tail         Mean         Mean          Limits               DF

       5.01          231     11294.45     433.7075   10444.15   12144.75       4149

Every test shows that VLU_1 > VLU_0. The difference has very high Kurtosis (heavy tails), hence the Winsorized estimates as an extra precaution. Given the large sample, you can assume that the mean difference estimate is normally distributed and that the confidence limits are pretty good. 

 

PG
Sujithpeta
Quartz | Level 8

Hello @PGStats 

 

I don't know if you understood the question, apologize for not being clear.

As you mentioned, the cost is very skewed and the average cost would heavily influenced by the extreme values. 

 

As an example, if the data is following log distribution, we transpose the data and calculate it's mean (by retransposing and adjusting for the smearing effect) we will get a mean that has less influence of the possible outliers, right. 

 

I tried this in R with help of a friend and we got around 6-7K average, I'm new to SAS and don't know how to program box-cox test to find the distribution and how to write the code for the regression estimate in SAS.

 

I would appreciate if you direct me to any resource for this.

 

Thanks

PGStats
Opal | Level 21

I should have looked at this first:

 

SGPlot1.png

There is very little correlation between pairs, so there is almost nothing to gain from pairing the costs.

 

I tried comparing smearing-corrected back transformed means, as you suggest, with Winsorized means (another way to protect from outliers). Here is how I did it:

 

data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID	VLU	Post_Cost	Proc_score;
run;

/* Means of log-transformed data */
proc glimmix data=long;
class VLU;
model post_cost = VLU / dist=lognormal;
output out=pred predicted residual;
run;

/* Winsorized means */
proc univariate data=long winsorized=0.05;
class VLU;
var post_cost;
ods output WinsorizedMeans=Winsor;
run;

proc sql;
/* Apply smearing correction */
create table means as
select 
    VLU,
    mean(exp(Pred)) as mean,
    mean(exp(resid)) as smearingCorrection,
    mean(exp(Pred)) * mean(exp(resid)) as correctedMean
from pred
group by VLU;

select 
    a.*, 
    b.mean as WinsorizedMean 
from 
    means as a inner join
    Winsor as b on a.VLU=input(b.VLU, best.);

select 
    range(b.mean) as WinsorizedMeanDiff,
    range(a.correctedMean) as correctedMeanDiff
from 
    means as a inner join
    Winsor as b on a.VLU=input(b.VLU, best.);
quit;
                                     smearing     corrected  Winsorized
                  VLU      mean    Correction          Mean        Mean
             ----------------------------------------------------------
                    0  4839.834      2.575668       12465.8    10779.05
                    1  11345.13      2.154876      24447.35    22155.50
                               Winsorized     corrected
                                 MeanDiff      MeanDiff
                             --------------------------
                                 11376.46      11981.54

Both ways, I get cost difference estimates much greater than your estimate. It looks as if your smearing correction was much smaller.

 

PG
Sujithpeta
Quartz | Level 8

Hey @PGStats 

 

Thanks for taking a shot at this, I don't know why the average comes be same or even higher. I found this presentation

https://www.hsrd.research.va.gov/for_researchers/cyber_seminars/archives/1258-notes.pdf

 

In this presentation, the author described (slide 11 and later) explains that when you use GLM model with dist or link functions, you don't have to adjust for the smearing effect. What do you think about it?

 

Thanks

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2701 views
  • 0 likes
  • 2 in conversation