## Normal Weight distribution problem in simulated dataset

Occasional Contributor
Posts: 9

# Normal Weight distribution problem in simulated dataset

Created a simulated dataset however when I look at the distribution for weight, it includes negative values. If I put a Min value - then it skews the distribution. Any ideas on what to change in the code?

/*'if age ge 20 then do;

if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;*/

With a minimum, it skews the distribution. See below.

if age ge 20 then do;

if gender="M" then do;

weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

end;

else if gender="F" then do;

weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

end;

end;

Posts: 2,655

## Re: Normal Weight distribution problem in simulated dataset

Well, the problem lies in assuming that the data has a normal distribution.  We know, a priori, that there is a cutoff on the low end at zero, so the real underlying data is skewed, and most likely follows a log normal distribution or something similar.  When you plug in values from historical controls with a specified mean and standard deviation, you have to expect that you could get negative values (and here I am not at all surprised, as the mean for female weight is only about 2.5 standard deviations above zero, and for males about 3 standard deviations above zero).

I see two choices.  You have already coded the first--you live with the skew generated by truncating.  The other involves simulating a log normal distribution with specified parameters, which is a bit more difficult than a single function call.  See Rick Wicklin's Simulating Data with SAS, page 111 for an example.

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use.  This might be more useful.  But be warned, the resulting data will be skewed.  No getting around that.

Steve Denham

Super User
Posts: 13,583

## Re: Normal Weight distribution problem in simulated dataset

You're also likely to have an issue with simulated data containing people that are too tall, or at least more in the over 7 feet range than you expect.

SAS Super FREQ
Posts: 4,247

## Re: Normal Weight distribution problem in simulated dataset

As others have said, the real issue is "waht is the model."  Once you decide on a model, then the simulation simuates from that model.  If you get nonsensical data, you have to revise the model.

If you want data that looks normally distributed but is truncated outside of some interval [a,b], you can use the truncated normal distribution: Implement the truncated normal distribution in SAS - The DO Loop

Occasional Contributor
Posts: 9

## Re: Normal Weight distribution problem in simulated dataset

Rick, I reviewed your documentation and still have some confusion. I have pasted my codeing below. I am trying to create a model that is based off of literature regarding the US population's mean on height and weight. However, still after my codeing, my distribution on the weight, is one sided.

libname save "libname location here";

%let numsims=10000;

data sim1;

input x;

datalines;

.

;

run;

data sim1;

set sim1;

do i=1 to &numsims; *change this to desired number of simulations (e.g. 10,000);

output;

end;

drop x;

run;

data sim1;

set sim1;

id=i;

race_ethnicity_rand=rand('UNIFORM');

if race_ethnicity_rand le 0.8 then race_ethnicity="A";

else if race_ethnicity_rand le 5.3 then race_ethnicity="B";

else if race_ethnicity_rand le 17.6 then race_ethnicity="C";

else if race_ethnicity_rand le 82.7 then race_ethnicity="D";

else if race_ethnicity_rand le 98.5 then race_ethnicity="E";

else race_ethnicity="X";

systolic_rand=rand('UNIFORM');

if systolic_rand le .286 then systolic=round(140+39*rand('UNIFORM'));

else systolic=round(90+39*rand('UNIFORM'));

gender_rand=rand('BERNOULLI',0.5);

if gender_rand=0 then gender="M";

else gender="F"; **need to check the gender distribution;

age10_rand=rand('UNIFORM'); **check distribution of age ranges below;

if age10_rand le .05 then age10=0; *5%;

else if age10_rand le .15 then age10=1; *10%;

else if age10_rand le .35 then age10=2; *20%;

else if age10_rand le .55 then age10=3; *20%;

else if age10_rand le .74 then age10=4; *19%;

else if age10_rand le .84 then age10=5; *10%;

else if age10_rand le .93 then age10=6; *9%;

else if age10_rand le .97 then age10=7; *4%;

else age10=8; *3%;

age=10*age10 + floor(10*rand('UNIFORM'));

if age ge 20 then do;

*if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

*else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;

if age ge 20 then do;

if gender="M" then do;

weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

end;

else if gender="F" then do;

weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

end;

end;

proc print data=sim1;

run;

Posts: 2,655

## Re: Normal Weight distribution problem in simulated dataset

That is because the true distribution is "one-sided."  Weight and height are not normally distributed, so assuming that they are.  There are biological reasons for this.

Also, check your coding for generation of the race/ethnicity variable.  Rand('Uniform") should return a value between 0 and 1, so you may want to replace the code with:

race_ethnicity_rand=100*rand('UNIFORM');

Also, at some point, you will probably want to install some sort of seed control, or else you will get different values every time you run.

Steve Denham

Occasional Contributor
Posts: 9

## Re: Normal Weight distribution problem in simulated dataset

So if its not "normal," how would you code for height/weight.

This might be the seed of my confusion.

Super User
Posts: 13,583

## Re: Normal Weight distribution problem in simulated dataset

If I had a large enough data set of actual values I'd be very tempted to select from that using Proc Surveyselect. Check the CDC website for NHIS or BRFSS datasets.

Posts: 2,655

## Re: Normal Weight distribution problem in simulated dataset

I would code these as lognormally distributed, with the parameters being the natural logs of the mean and standard deviation.  As I said above:

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use.  This might be more useful.  But be warned, the resulting data will be skewed.  No getting around that.

And I mean, there is no getting around the fact that the real data are not normally distributed.  If it were, there would be a nonzero probability of people with negative heights or weights.

Steve Denham

Discussion stats
• 8 replies
• 403 views
• 0 likes
• 4 in conversation