Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Diagnostic tests to Identify the type of distribution and calculate in...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 10-22-2018 02:33 PM
(2868 views)

Hello,

I've done a propensity score matching for case and control using variables like disease comorbidities etc. Now, my goal is to calculate the incremental cost (Difference of the average) between case and control, when I look at the cost data for case and control both doesn't follow normal distribution (Right Skewed). I know there are tests to do check for normal distribution, I would like to do diagnostic test to understand the type of distribution the data is following, like log, gamma etc. and use the appropriate distribution to calculate average.

I would really appreciate if someone can help me with process of how to do diagnostics, transform to the appropriate distribution and get the incremental average cost difference.

I've attached the dataset with following variables

- Paid_ID: Matched Pair
- VLU: '1' for Case, '0' for Control
- Post_Cost: Cost data
- Proc_score: Propensity score

I'm using SAS 9.3, so I would appreciate if you could guide me the process in the version.

Thanks

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

With paired cases you should be looking at the distribution of the paired difference between costs. The overall distribution of each group doesn't really matter.

Note: many forum members (including myself) will not download Excel files. We prefer text (e.g. csv) file formats.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the reply.

You mean, I've to first take difference between the pair and then look for the paired difference distribution? Could please let me know the code for the diagnostic test and how to calculate the average according to the distribution? I have attached the text file. Thanks a ton in advance.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This is what I would do:

```
data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID VLU Post_Cost Proc_score;
run;
proc transpose data=long out=temp prefix=VLU_;
by pair_id;
id VLU;
var post_cost;
run;
data have;
set temp;
VLUdiff = VLU_1 - VLU_0;
run;
proc univariate data=have normal winsorized=0.05;
var VLUdiff;
histogram;
run;
```

Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 20.97301 Pr > |t| <.0001 Sign M 857 Pr >= |M| <.0001 Signed Rank S 2372105 Pr >= |S| <.0001

Winsorized Means Percent Number Std Error Winsorized Winsorized Winsorized Winsorized 95% Confidence in Tail in Tail Mean Mean Limits DF 5.01 231 11294.45 433.7075 10444.15 12144.75 4149

Every test shows that VLU_1 > VLU_0. The difference has very high Kurtosis (heavy tails), hence the Winsorized estimates as an extra precaution. Given the large sample, you can assume that the mean difference estimate is normally distributed and that the confidence limits are pretty good.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello @PGStats

I don't know if you understood the question, apologize for not being clear.

As you mentioned, the cost is very skewed and the average cost would heavily influenced by the extreme values.

As an example, if the data is following log distribution, we transpose the data and calculate it's mean (by retransposing and adjusting for the smearing effect) we will get a mean that has less influence of the possible outliers, right.

I tried this in R with help of a friend and we got around 6-7K average, I'm new to SAS and don't know how to program box-cox test to find the distribution and how to write the code for the regression estimate in SAS.

I would appreciate if you direct me to any resource for this.

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I should have looked at this first:

There is very little correlation between pairs, so there is almost nothing to gain from pairing the costs.

I tried comparing smearing-corrected back transformed means, as you suggest, with Winsorized means (another way to protect from outliers). Here is how I did it:

```
data long;
infile "&sasforum.\datasets\long.txt" firstobs=2 dlm='09'x;
input Pair_ID VLU Post_Cost Proc_score;
run;
/* Means of log-transformed data */
proc glimmix data=long;
class VLU;
model post_cost = VLU / dist=lognormal;
output out=pred predicted residual;
run;
/* Winsorized means */
proc univariate data=long winsorized=0.05;
class VLU;
var post_cost;
ods output WinsorizedMeans=Winsor;
run;
proc sql;
/* Apply smearing correction */
create table means as
select
VLU,
mean(exp(Pred)) as mean,
mean(exp(resid)) as smearingCorrection,
mean(exp(Pred)) * mean(exp(resid)) as correctedMean
from pred
group by VLU;
select
a.*,
b.mean as WinsorizedMean
from
means as a inner join
Winsor as b on a.VLU=input(b.VLU, best.);
select
range(b.mean) as WinsorizedMeanDiff,
range(a.correctedMean) as correctedMeanDiff
from
means as a inner join
Winsor as b on a.VLU=input(b.VLU, best.);
quit;
```

smearing corrected Winsorized VLU mean Correction Mean Mean ---------------------------------------------------------- 0 4839.834 2.575668 12465.8 10779.05 1 11345.13 2.154876 24447.35 22155.50

Winsorized corrected MeanDiff MeanDiff -------------------------- 11376.46 11981.54

Both ways, I get cost difference estimates much greater than your estimate. It looks as if your smearing correction was much smaller.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hey @PGStats

Thanks for taking a shot at this, I don't know why the average comes be same or even higher. I found this presentation

https://www.hsrd.research.va.gov/for_researchers/cyber_seminars/archives/1258-notes.pdf

In this presentation, the author described (slide 11 and later) explains that when you use GLM model with dist or link functions, you don't have to adjust for the smearing effect. What do you think about it?

Thanks

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.