turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- Normal Weight distribution problem in simulated da...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

12-11-2013 01:14 PM

Created a simulated dataset however when I look at the distribution for weight, it includes negative values. If I put a Min value - then it skews the distribution. Any ideas on what to change in the code?

/*'if age ge 20 then do;

if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;*/

With a minimum, it skews the distribution. See below.

if age ge 20 then do;

if gender="M" then do;

weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

end;

else if gender="F" then do;

weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

end;

end;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-11-2013 02:41 PM

Well, the problem lies in assuming that the data has a normal distribution. We know, a priori, that there is a cutoff on the low end at zero, so the real underlying data is skewed, and most likely follows a log normal distribution or something similar. When you plug in values from historical controls with a specified mean and standard deviation, you have to expect that you could get negative values (and here I am not at all surprised, as the mean for female weight is only about 2.5 standard deviations above zero, and for males about 3 standard deviations above zero).

I see two choices. You have already coded the first--you live with the skew generated by truncating. The other involves simulating a log normal distribution with specified parameters, which is a bit more difficult than a single function call. See Rick Wicklin's *Simulating Data with SAS*, page 111 for an example.

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use. This might be more useful. But be warned, the resulting data will be skewed. No getting around that.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-11-2013 03:33 PM

You're also likely to have an issue with simulated data containing people that are too tall, or at least more in the over 7 feet range than you expect.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-13-2013 02:30 PM

As others have said, the real issue is "waht is the model." Once you decide on a model, then the simulation simuates from that model. If you get nonsensical data, you have to revise the model.

If you want data that looks normally distributed but is truncated outside of some interval [a,b], you can use the truncated normal distribution: Implement the truncated normal distribution in SAS - The DO Loop

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-17-2013 02:57 PM

Rick, I reviewed your documentation and still have some confusion. I have pasted my codeing below. I am trying to create a model that is based off of literature regarding the US population's mean on height and weight. However, still after my codeing, my distribution on the weight, is one sided.

libname save "libname location here";

%let numsims=10000;

data sim1;

input x;

datalines;

.

;

run;

data sim1;

set sim1;

do i=1 to &numsims; *change this to desired number of simulations (e.g. 10,000);

output;

end;

drop x;

run;

data sim1;

set sim1;

id=i;

race_ethnicity_rand=rand('UNIFORM');

if race_ethnicity_rand le 0.8 then race_ethnicity="A";

else if race_ethnicity_rand le 5.3 then race_ethnicity="B";

else if race_ethnicity_rand le 17.6 then race_ethnicity="C";

else if race_ethnicity_rand le 82.7 then race_ethnicity="D";

else if race_ethnicity_rand le 98.5 then race_ethnicity="E";

else race_ethnicity="X";

systolic_rand=rand('UNIFORM');

if systolic_rand le .286 then systolic=round(140+39*rand('UNIFORM'));

else systolic=round(90+39*rand('UNIFORM'));

gender_rand=rand('BERNOULLI',0.5);

if gender_rand=0 then gender="M";

else gender="F"; **need to check the gender distribution;

age10_rand=rand('UNIFORM'); **check distribution of age ranges below;

if age10_rand le .05 then age10=0; *5%;

else if age10_rand le .15 then age10=1; *10%;

else if age10_rand le .35 then age10=2; *20%;

else if age10_rand le .55 then age10=3; *20%;

else if age10_rand le .74 then age10=4; *19%;

else if age10_rand le .84 then age10=5; *10%;

else if age10_rand le .93 then age10=6; *9%;

else if age10_rand le .97 then age10=7; *4%;

else age10=8; *3%;

age=10*age10 + floor(10*rand('UNIFORM'));

if age ge 20 then do;

*if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

*else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;

if age ge 20 then do;

if gender="M" then do;

weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

end;

else if gender="F" then do;

weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

end;

end;

proc print data=sim1;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-17-2013 03:11 PM

That is because the true distribution is "one-sided." Weight and height are not normally distributed, so assuming that they are. There are biological reasons for this.

Also, check your coding for generation of the race/ethnicity variable. Rand('Uniform") should return a value between 0 and 1, so you may want to replace the code with:

race_ethnicity_rand=100*rand('UNIFORM');

Also, at some point, you will probably want to install some sort of seed control, or else you will get different values every time you run.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-17-2013 03:15 PM

So if its not "normal," how would you code for height/weight.

This might be the seed of my confusion.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-17-2013 04:19 PM

If I had a large enough data set of actual values I'd be very tempted to select from that using Proc Surveyselect. Check the CDC website for NHIS or BRFSS datasets.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to slivingston

12-18-2013 01:21 PM

I would code these as lognormally distributed, with the parameters being the natural logs of the mean and standard deviation. As I said above:

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use. This might be more useful. But be warned, the resulting data will be skewed. No getting around that.

And I mean, there is no getting around the fact that the real data are not normally distributed. If it were, there would be a nonzero probability of people with negative heights or weights.

Steve Denham