turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Code examples for creating random generated values

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-20-2015 12:28 PM

Hello-

Longtime SAS programmer, first time posting to this site.

Here is what I know....

*I know that some data that I am interested in studying is shaped like a normal distribution with a slight skew to the right or, in some cases, closely approximated by a lognormal distribution.

*I do not have access to the raw data, but I have seen a histogram which led me to the above conclusion about the shape and bounds.

*The data is lower-bounded by zero and upper-bounded at 100. The mean is 29, but I can only infer the standard deviation by visual inspection of the graph. I am using the "range rule" to estimate the standard deviation--range divided by four is an approximate of the standard deviation.

*I would like to repeatedly draw from a distribution that lies somewhere between a right-skewed normal and a lognormal with mean=29.

*I have read up about the randgen call and rand function, but the scale and shape parameters for the beta and lognormal functions confuse me.

QUESTION: Can you send a code example so that I can play around with alternative distributions that are lower bounded by zero and upper-bounded at 100--with mean at 29.

Thank you so much for the help. I have never had to fit a probability distribution based on a graphic of a histogram--with no underlying data.

Accepted Solutions

Solution

07-23-2015
07:51 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-23-2015 07:51 AM

Rick,

Both distribution could accepted , since we don't know what data distribution exactly is .

But I am rooting for Normal, since Normal distribution represent the error item in most Stat Model .

About your question " how to choose the width of the noise ".

I know how to settle it . I will make the width of bin equal 6*sigma which could guarantee 99.8% data could fall into a bin,

or 5*sigma 99.3% . That doesn't matter. The following code is fixed in order to making all data fall into the bin .

data have;

infile cards truncover expandtabs;

input a b p : percent8.;

n+1;

x=mean(a,b);

pp+p;

cards;

5 10 6%

10 15 9%

15 20 11%

20 25 17%

25 30 7%

30 35 13%

35 40 8%

40 45 8%

45 50 6%

50 55 4%

55 60 4%

60 65 4%

65 70 3%

;

run;

proc sql;

select p into : list separated by ',' from have;

quit;

data want;

declare hash ha(dataset:'have');

ha.definekey('n');

ha.definedata('x');

ha.definedone();

std=5/6; /*5 is the width of bin*/

call streaminit(1234);

do i=1 to 10000;

n=rand('table',&list);

ha.find();

x=x+rand('normal',0,std);

output;

end;

drop i;

run;

proc univariate data=want;

var x;

histogram x/kernel;

run;

Xia Keshan

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-20-2015 01:04 PM

While not exactly an answer to your question, there are some really neat tools to turn graphs back into data sets. Here are a few of the tools.

http://getdata-graph-digitizer.com/

xyExtract Graph Digitizer 5.1 - Free download

DigitizeIt - Graph Digitizer Software. Digitize graphs, charts and math data

And to get you started in SAS. http://support.sas.com/resources/papers/proceedings15/SAS1387-2015.pdf

Ken

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-20-2015 06:42 PM

I'm not sure if this is statistically valid or not, but here's an approach.

Generate a random sample from a mean of 29 and small standard deviation. Keep the records that fall between 0 and 29.

Generate a second random sample from a mean of 29 and a larger standard deviation (to mimic the skewness in the fat tail). Keep the records that fall between 29 and 100.

Combine the two, should approximate the distribution you are looking for.

One issue: you might want to take a larger number of records from the second sample than the first, possibly in proportion to the area under the curve below 29 vs. area under the curve above 29.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-21-2015 06:36 AM

The hard part about these situations is deciding whether to model the distribution (as you and Astounding propose), or whether to simulate from the empirical distribution function. I usually model when I have some domain expertise and can make informed assumptions about the distribution. If I know nothing about the data or the proess that generated them, I would be inclined to simulate from the empirical CDF, based on the published quantiles.

If you have the histogram, then you have many quantiles of the distribution, which will be more powerful than just the mean, standard deviation, and skew. You can use the ideas in the article "Approximating a distribution from published quantiles" to carry out the simulation. (Search for "download" if you want the SAS/IML program that analyzed the data.)

To convert the histogram to quantiles, you just take the cumulative sum of the counts or percentages for each bin. For example, if your histogram has three bins:

BIN COUNT CUSUM CUM_PROPORTION

[0,1) 2 2 2/6

[1,2) 3 5 5/6

[2,3] 1 6 6/6

Then the corresponding empirical CDF is a picewise linear function passing through

(0,0), (1, 2/6), (3, 5/6), (4,1).

After you have the CDF, you can use inverse CDF sampling to obtain simulated data.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-21-2015 11:00 AM

All-

Thanks for all of the insights to my problem.

Rick-

I like your approach, because it is both effective and elegant. I read your articles, but am having a difficult time figuring out how to simulate the data given my unique CDF (see below for the actual data). I should also mention that I do not have PROC IML installed, so I will probably have to use the inverse CDF algorithm using the RAND() function.

*****Can you adapt your code to accommodate the following CDF?*****

Many thanks for the help here. For your part, I am happy to cite your article/technique in my manuscript.

Bin | Cu_Sum | Cum_Proportion |

[0,5] | 500 | 0% |

[5,10] | 17500 | 6% |

[10,15] | 42000 | 15% |

[15,20] | 74000 | 26% |

[20,25] | 123500 | 43% |

[25,30] | 145500 | 50% |

[30,35] | 183000 | 63% |

[35,40] | 205500 | 71% |

[40,45] | 227500 | 79% |

[45,50] | 246500 | 85% |

[50,55] | 258500 | 89% |

[55,60] | 268600 | 93% |

[60,65] | 280100 | 97% |

[65,70] | 289100 | 100% |

[70,75] | 289200 | 100% |

[75,80] | 289200 | 100% |

[80,85] | 289300 | 100% |

[85,90] | 289400 | 100% |

[90,95] | 289500 | 100% |

[95,100] | 289500 | 100% |

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-21-2015 11:44 AM

Use PROC FCMP to define the inverse CDF by using a lot if IF-THEN/ELSE statements. The input is a number in [0,1]. The output is a number in the domain of the CDF, which is [0,100] for these data.

For example, if the input value is t=0.75, then the function would

1) Figure out that x is in the bin of cumulative proportions [0.71, 0.79].

2) Compute where x is in the interval by using p(x) = (x-0.71)/(0.79-0.71). For x=0.75, you get p(0.75)=0.5.

3) Evaluate a linear function to get the y value. The interval that corresponds to [0.71, 0.79] is [40,45].

The value that is 0.5 along that interval is y = 40 + (45-40)*0.5 = 42.5.

Thus 42.5 would be the output for an input of 0.75. When you pass in random uniform variates, you will get out random y's that are distributed according to your empirical distribution function.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-22-2015 10:22 AM

How about this one ?

data have;

infile cards truncover expandtabs;

input a b p : percent8.;

n+1;

x=mean(a,b);

pp+p;

cards;

5 10 6%

10 15 9%

15 20 11%

20 25 17%

25 30 7%

30 35 13%

35 40 8%

40 45 8%

45 50 6%

50 55 4%

55 60 4%

60 65 4%

65 70 3%

;

run;

proc sql;

select p into : list separated by ',' from have;

quit;

data want;

declare hash ha(dataset:'have');

ha.definekey('n');

ha.definedata('x');

ha.definedone();

call streaminit(1234);

do i=1 to 10000;

n=rand('table',&list);

ha.find();

x=x+rand('normal');

output;

end;

drop i;

run;

proc univariate data=want;

var x;

histogram x/kernel;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-22-2015 11:30 AM

This approach is known as the "smooth bootstrap." You can read about it and how to choose the width of the noise in Chapter 15.8 of *Simulting Data with SAS.* For this application, I'd recommend that adding random uniform noise with width = bin width, rather than normal noise.

Solution

07-23-2015
07:51 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-23-2015 07:51 AM

Rick,

Both distribution could accepted , since we don't know what data distribution exactly is .

But I am rooting for Normal, since Normal distribution represent the error item in most Stat Model .

About your question " how to choose the width of the noise ".

I know how to settle it . I will make the width of bin equal 6*sigma which could guarantee 99.8% data could fall into a bin,

or 5*sigma 99.3% . That doesn't matter. The following code is fixed in order to making all data fall into the bin .

data have;

infile cards truncover expandtabs;

input a b p : percent8.;

n+1;

x=mean(a,b);

pp+p;

cards;

5 10 6%

10 15 9%

15 20 11%

20 25 17%

25 30 7%

30 35 13%

35 40 8%

40 45 8%

45 50 6%

50 55 4%

55 60 4%

60 65 4%

65 70 3%

;

run;

proc sql;

select p into : list separated by ',' from have;

quit;

data want;

declare hash ha(dataset:'have');

ha.definekey('n');

ha.definedata('x');

ha.definedone();

std=5/6; /*5 is the width of bin*/

call streaminit(1234);

do i=1 to 10000;

n=rand('table',&list);

ha.find();

x=x+rand('normal',0,std);

output;

end;

drop i;

run;

proc univariate data=want;

var x;

histogram x/kernel;

run;

Xia Keshan

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-23-2015 07:56 AM

Rick,

Hope one day ,I will buy your these two books , which I definitely should read and digest .

But I am now focus on other things.

Anyway, I have learned tons of Mathematics knowledge from your Blog . Thanks.

Xia Keshan

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

07-22-2015 10:23 AM