BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PLarsen
Calcite | Level 5

Hello-

Longtime SAS programmer, first time posting to this site.

Here is what I know....

*I know that some data that I am interested in studying is shaped like a normal distribution with a slight skew to the right or, in some cases, closely approximated by a lognormal distribution.

*I do not have access to the raw data, but I have seen a histogram which led me to the above conclusion about the shape and bounds.

*The data is lower-bounded by zero and upper-bounded at 100.  The mean is 29, but I can only infer the standard deviation by visual inspection of the graph.  I am using the "range rule" to estimate the standard deviation--range divided by four is an approximate of the standard deviation.

*I would like to repeatedly draw from a distribution that lies somewhere between a right-skewed normal and a lognormal with mean=29.

*I have read up about the randgen call and rand function, but the scale and shape parameters for the beta and lognormal functions confuse me.

QUESTION: Can you send a code example so that I can play around with alternative distributions that are lower bounded by zero and upper-bounded at 100--with mean at 29.

Thank you so much for the help.  I have never had to fit a probability distribution based on a graphic of a histogram--with no underlying data.

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Rick,

Both distribution could accepted , since we don't know what data distribution exactly is .

But I am rooting for Normal, since Normal distribution represent the error item in most Stat Model .

About your question " how to choose the width of the noise ".

I know how to settle it . I will make the width of bin equal 6*sigma which could guarantee 99.8% data could fall into a bin,

or 5*sigma 99.3% . That doesn't matter. The following code is fixed in order to making all data fall into the bin .

Code: Program1.sas

data have;
infile cards truncover expandtabs;
input a b p : percent8.;
n+1;
x=mean(a,b);
pp+p;
cards;
5 10 6%
10 15 9%
15 20 11%
20 25 17%
25 30 7%
30 35 13%
35 40 8%
40 45 8%
45 50 6%
50 55 4%
55 60 4%
60 65 4%
65 70 3%
;
run;
proc sql;
select p into : list separated by ',' from have;
quit;
data want;

declare hash ha(dataset:'have');
ha.definekey('n');
ha.definedata('x');
ha.definedone();
std=5/6; /*5 is the width of bin*/
call streaminit(1234);
do i=1 to 10000;
n=rand('table',&list);
ha.find();
x=x+rand('normal',0,std);
output;
end;
drop i;
run;
proc univariate data=want;
var x;
histogram x/kernel;
run;
 

Xia Keshan

View solution in original post

10 REPLIES 10
ets_kps
SAS Employee

While not exactly an answer to your question, there are some really neat tools to turn graphs back into data sets.  Here are a few of the tools.

http://getdata-graph-digitizer.com/

xyExtract Graph Digitizer 5.1 - Free download

DigitizeIt - Graph Digitizer Software. Digitize graphs, charts and math data

And to get you started in SAS. http://support.sas.com/resources/papers/proceedings15/SAS1387-2015.pdf

Ken

Astounding
PROC Star

I'm not sure if this is statistically valid or not, but here's an approach.

Generate a random sample from a mean of 29 and small standard deviation.  Keep the records that fall between 0 and 29.

Generate a second random sample from a mean of 29 and a larger standard deviation (to mimic the skewness in the fat tail).  Keep the records that fall between 29 and 100.

Combine the two, should approximate the distribution you are looking for.

One issue:  you might want to take a larger number of records from the second sample than the first, possibly in proportion to the area under the curve below 29 vs. area under the curve above 29.

Good luck.

Rick_SAS
SAS Super FREQ

The hard part about these situations is deciding whether to model the distribution (as you and Astounding propose), or whether to simulate from the empirical distribution function.  I usually model when I have some domain expertise and can make informed assumptions about the distribution.  If I know nothing about the data or the proess that generated them, I would be inclined to simulate from the empirical CDF, based on the published quantiles.

If you have the histogram, then you have many quantiles of the distribution, which will be more powerful than just the mean, standard deviation, and skew.  You can use the ideas in the article "Approximating a distribution from published quantiles" to carry out the simulation. (Search for "download" if you want the SAS/IML program that analyzed the data.)

To convert the histogram to quantiles, you just take the cumulative sum of the counts or percentages for each bin. For example, if your histogram has three bins:

BIN     COUNT     CUSUM  CUM_PROPORTION

[0,1)     2               2               2/6

[1,2)     3               5               5/6

[2,3]     1               6               6/6

Then the corresponding empirical CDF is a picewise linear function passing through

(0,0),  (1, 2/6),  (3, 5/6),  (4,1).

After you have the CDF, you can use inverse CDF sampling to obtain simulated data.

PLarsen
Calcite | Level 5

All-

Thanks for all of the insights to my problem.

Rick-

I like your approach, because it is both effective and elegant.  I read your articles, but am having a difficult time figuring out how to simulate the data given my unique CDF (see below for the actual data).  I should also mention that I do not have PROC IML installed, so I will probably have to use the inverse CDF algorithm using the RAND() function.

*****Can you adapt your code to accommodate the following CDF?***** 

Many thanks for the help here.  For your part, I am happy to cite your article/technique in my manuscript.

BinCu_SumCum_Proportion
[0,5]5000%
[5,10]175006%
[10,15]4200015%
[15,20]7400026%
[20,25]12350043%
[25,30]14550050%
[30,35]18300063%
[35,40]20550071%
[40,45]22750079%
[45,50]24650085%
[50,55]25850089%
[55,60]26860093%
[60,65]28010097%
[65,70]289100100%
[70,75]289200100%
[75,80]289200100%
[80,85]289300100%
[85,90]289400100%
[90,95]289500100%
[95,100]289500100%
Rick_SAS
SAS Super FREQ

Use PROC FCMP to define the inverse CDF by using a lot if IF-THEN/ELSE statements.  The input is a number in [0,1]. The output is a number in the domain of the CDF, which is [0,100] for these data.

For example, if the input value is t=0.75, then the function would

1) Figure out that x is in the bin of cumulative proportions [0.71,  0.79].

2) Compute where x is in the interval by using p(x) = (x-0.71)/(0.79-0.71). For x=0.75, you get p(0.75)=0.5.

3) Evaluate a linear function to get the y value. The interval that corresponds to [0.71, 0.79] is [40,45].

The value that is 0.5 along that interval is y = 40 + (45-40)*0.5 = 42.5.

Thus 42.5 would be the output for an input of 0.75.   When you pass in random uniform variates, you will get out random y's that are distributed according to your empirical distribution function.

Ksharp
Super User

How about this one ?

Code: Program1.sas

data have;
infile cards truncover expandtabs;
input a b p : percent8.;
n+1;
x=mean(a,b);
pp+p;
cards;
5 10 6%
10 15 9%
15 20 11%
20 25 17%
25 30 7%
30 35 13%
35 40 8%
40 45 8%
45 50 6%
50 55 4%
55 60 4%
60 65 4%
65 70 3%
;
run;
proc sql;
select p into : list separated by ',' from have;
quit;
data want;

declare hash ha(dataset:'have');
ha.definekey('n');
ha.definedata('x');
ha.definedone();

call streaminit(1234);
do i=1 to 10000;
n=rand('table',&list);
ha.find();
x=x+rand('normal');
output;
end;
drop i;
run;
proc univariate data=want;
var x;
histogram x/kernel;
run;
 
Rick_SAS
SAS Super FREQ

This approach is known as the "smooth bootstrap." You can read about it and how to choose the width of the noise in Chapter 15.8 of Simulting Data with SAS.   For this application, I'd recommend that adding random uniform noise with width = bin width, rather than normal noise.

Ksharp
Super User

Rick,

Both distribution could accepted , since we don't know what data distribution exactly is .

But I am rooting for Normal, since Normal distribution represent the error item in most Stat Model .

About your question " how to choose the width of the noise ".

I know how to settle it . I will make the width of bin equal 6*sigma which could guarantee 99.8% data could fall into a bin,

or 5*sigma 99.3% . That doesn't matter. The following code is fixed in order to making all data fall into the bin .

Code: Program1.sas

data have;
infile cards truncover expandtabs;
input a b p : percent8.;
n+1;
x=mean(a,b);
pp+p;
cards;
5 10 6%
10 15 9%
15 20 11%
20 25 17%
25 30 7%
30 35 13%
35 40 8%
40 45 8%
45 50 6%
50 55 4%
55 60 4%
60 65 4%
65 70 3%
;
run;
proc sql;
select p into : list separated by ',' from have;
quit;
data want;

declare hash ha(dataset:'have');
ha.definekey('n');
ha.definedata('x');
ha.definedone();
std=5/6; /*5 is the width of bin*/
call streaminit(1234);
do i=1 to 10000;
n=rand('table',&list);
ha.find();
x=x+rand('normal',0,std);
output;
end;
drop i;
run;
proc univariate data=want;
var x;
histogram x/kernel;
run;
 

Xia Keshan

Ksharp
Super User

Rick,

Hope one day ,I will buy your these two books , which I definitely should read and digest .

But I am now focus on other things.

Anyway, I have learned tons of Mathematics knowledge from your Blog . Thanks.

Xia Keshan

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 4543 views
  • 7 likes
  • 5 in conversation