BookmarkSubscribeRSS Feed
slivingston
Calcite | Level 5

Created a simulated dataset however when I look at the distribution for weight, it includes negative values. If I put a Min value - then it skews the distribution. Any ideas on what to change in the code?

/*'if age ge 20 then do;

        if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

        else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

        if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

        else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;*/

With a minimum, it skews the distribution. See below.

if age ge 20 then do;

    if gender="M" then do;

        weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

               end;

    else if gender="F" then do;

        weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

               end;

end;

8 REPLIES 8
SteveDenham
Jade | Level 19

Well, the problem lies in assuming that the data has a normal distribution.  We know, a priori, that there is a cutoff on the low end at zero, so the real underlying data is skewed, and most likely follows a log normal distribution or something similar.  When you plug in values from historical controls with a specified mean and standard deviation, you have to expect that you could get negative values (and here I am not at all surprised, as the mean for female weight is only about 2.5 standard deviations above zero, and for males about 3 standard deviations above zero).

I see two choices.  You have already coded the first--you live with the skew generated by truncating.  The other involves simulating a log normal distribution with specified parameters, which is a bit more difficult than a single function call.  See Rick Wicklin's Simulating Data with SAS, page 111 for an example.

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use.  This might be more useful.  But be warned, the resulting data will be skewed.  No getting around that.

Steve Denham


ballardw
Super User

You're also likely to have an issue with simulated data containing people that are too tall, or at least more in the over 7 feet range than you expect.

Rick_SAS
SAS Super FREQ

As others have said, the real issue is "waht is the model."  Once you decide on a model, then the simulation simuates from that model.  If you get nonsensical data, you have to revise the model.

If you want data that looks normally distributed but is truncated outside of some interval [a,b], you can use the truncated normal distribution: Implement the truncated normal distribution in SAS - The DO Loop

slivingston
Calcite | Level 5

Rick, I reviewed your documentation and still have some confusion. I have pasted my codeing below. I am trying to create a model that is based off of literature regarding the US population's mean on height and weight. However, still after my codeing, my distribution on the weight, is one sided.

libname save "libname location here";

%let numsims=10000;

data sim1;

input x;

datalines;

.

;

run;

data sim1;

set sim1;

do i=1 to &numsims; *change this to desired number of simulations (e.g. 10,000);

    output;

end;

drop x;

run;

data sim1;

set sim1;

id=i;

race_ethnicity_rand=rand('UNIFORM');

if race_ethnicity_rand le 0.8 then race_ethnicity="A";

    else if race_ethnicity_rand le 5.3 then race_ethnicity="B";

    else if race_ethnicity_rand le 17.6 then race_ethnicity="C";

    else if race_ethnicity_rand le 82.7 then race_ethnicity="D";

    else if race_ethnicity_rand le 98.5 then race_ethnicity="E";

    else race_ethnicity="X";

systolic_rand=rand('UNIFORM');

if systolic_rand le .286 then systolic=round(140+39*rand('UNIFORM'));

         else systolic=round(90+39*rand('UNIFORM'));

gender_rand=rand('BERNOULLI',0.5);

if gender_rand=0 then gender="M";

    else gender="F"; **need to check the gender distribution;

age10_rand=rand('UNIFORM'); **check distribution of age ranges below;

if age10_rand le .05 then age10=0; *5%;

    else if age10_rand le .15 then age10=1; *10%;

    else if age10_rand le .35 then age10=2; *20%;

    else if age10_rand le .55 then age10=3; *20%;

    else if age10_rand le .74 then age10=4; *19%;

    else if age10_rand le .84 then age10=5; *10%;

    else if age10_rand le .93 then age10=6; *9%;

    else if age10_rand le .97 then age10=7; *4%;

    else age10=8; *3%;

age=10*age10 + floor(10*rand('UNIFORM'));

if age ge 20 then do;

        *if gender="M" then weight_lbs=round(rand('NORMAL',189.8,59.1),1);

        *else if gender="F" then weight_lbs=round(rand('NORMAL',162.9,65.6),1);

        if gender="M" then height_inches=round(rand('NORMAL',69.2,6.6),1);

        else if gender="F" then height_inches=round(rand('NORMAL',63.8,6.6),1);

end;

if age ge 20 then do;

    if gender="M" then do;

        weight_lbs=max(100,round(rand('NORMAL',189.8+5*(height_inches-69.2),59.1),1));

               end;

    else if gender="F" then do;

        weight_lbs=max(90,round(rand('NORMAL',162.9+5*(height_inches-63.8),65.6),1));

               end;

end;

proc print data=sim1;

run;

SteveDenham
Jade | Level 19

That is because the true distribution is "one-sided."  Weight and height are not normally distributed, so assuming that they are.  There are biological reasons for this.

Also, check your coding for generation of the race/ethnicity variable.  Rand('Uniform") should return a value between 0 and 1, so you may want to replace the code with:

race_ethnicity_rand=100*rand('UNIFORM');

Also, at some point, you will probably want to install some sort of seed control, or else you will get different values every time you run.

Steve Denham

slivingston
Calcite | Level 5

So if its not "normal," how would you code for height/weight.

This might be the seed of my confusion.

ballardw
Super User

If I had a large enough data set of actual values I'd be very tempted to select from that using Proc Surveyselect. Check the CDC website for NHIS or BRFSS datasets.

SteveDenham
Jade | Level 19

I would code these as lognormally distributed, with the parameters being the natural logs of the mean and standard deviation.  As I said above:

Something like:

lnweight_m=rand('NORMAL', 5.24597, 4.0792);

weight_lbs_m=round(exp(lnweight_m),1);

where the parameters passed to the rand function are the natural logs of the values you currently use.  This might be more useful.  But be warned, the resulting data will be skewed.  No getting around that.

And I mean, there is no getting around the fact that the real data are not normally distributed.  If it were, there would be a nonzero probability of people with negative heights or weights.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 845 views
  • 0 likes
  • 4 in conversation