Solved: Issue with random number generation and how to create random numbers f...

mgrasmussen · Posted 09-23-2021 06:34 AM

Dear SAS experts

I am a relatively new user, but have used Stata for some years.

I would like to be able to generate random numbers uniformily distributed. I want to do this seperately for certain subgroups in my dataset, although it might actually suffice to do it for the whole dataset.

As an example:

data example;
input name $ value;
datalines;
Maria 2
Maria 3
Maria 6
John 2
John 6
Peter 2
;
run;

I would like to randomly select a 'Maria' observation and a 'John' observation (only 1 Peter, no selection needed) such that I am left with only one of the values of 'value'. To the first issue: I thought it would be easy to generate random data for the whole dataset:

proc sql noprint;
select count(*) into :N from example;
quit;

data example (drop=i);
set example;
do i=1 to &N;
uni_ran=rand("Uniform");
output;
end;
run;

But the amount of observations in the dataset goes from 6 to 36 - why?

The second issue that I would like to be able to generate random numbers seperately for 'Maria' and 'John'. In general, how would one go about doing this? I have tried by including a -by- but I cannot get it to work. I imagine I have to somehow confine the do loop to the span of observations in each category, e.g. 3 for Maria and 2 for John, perhaps using some functions which identifies the amount of observations in a category of a certain variable. In the actual dataset I am using there are in many cases >3 observations within each category.

In general I have an interest in finding out how to perform different operations for each subgroup of a dataset without having to qualify the subgroup using fx -if- and -where', i.e. just "do this for each level of var". If there are 1000 levels of categorical variable this would require many lines of code using -if-.

Does someone have a suggestion on how I can resolve the issues mentioned above?

Thank you

PaigeMiller · Posted 09-23-2021 07:24 AM

@mgrasmussen wrote:

Dear SAS experts

I am a relatively new user, but have used Stata for some years.

I would like to be able to generate random numbers uniformily distributed. I want to do this seperately for certain subgroups in my dataset, although it might actually suffice to do it for the whole dataset.

As an example:

data example;
input name $ value;
datalines;
Maria 2
Maria 3
Maria 6
John 2
John 6
Peter 2
;
run;

I would like to randomly select a 'Maria' observation and a 'John' observation (only 1 Peter, no selection needed) such that I am left with only one of the values of 'value'. To the first issue: I thought it would be easy to generate random data for the whole dataset:

proc sql noprint;
select count(*) into :N from example;
quit;

data example (drop=i);
set example;
do i=1 to &N;
uni_ran=rand("Uniform");
output;
end;
run;

But the amount of observations in the dataset goes from 6 to 36 - why?

&n = 6 and you have a loop that goes 1 to &n and each time outputs the record.

The second issue that I would like to be able to generate random numbers seperately for 'Maria' and 'John'. In general, how would one go about doing this? I have tried by including a -by- but I cannot get it to work. I imagine I have to somehow confine the do loop to the span of observations in each category, e.g. 3 for Maria and 2 for John, perhaps using some functions which identifies the amount of observations in a category of a certain variable. In the actual dataset I am using there are in many cases >3 observations within each category.

In general I have an interest in finding out how to perform different operations for each subgroup of a dataset without having to qualify the subgroup using fx -if- and -where', i.e. just "do this for each level of var". If there are 1000 levels of categorical variable this would require many lines of code using -if-.

You don't need separate generation of random variables for Maria and John.

data have1;
    set have;
    uni_ran=rand("Uniform");
run;
/* Find the max random number for each name */
proc sort data=have1;
    by name uni_ran;
run;
data want;
    set have1;
    by name uni_ran;
    if last.name;
run;

If you want to exclude names with only a single record like Peter, then instead of the line if last.name; you can use

if last.name and not first.name;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 09-23-2021 07:24 AM

@mgrasmussen wrote:

Dear SAS experts

I am a relatively new user, but have used Stata for some years.

I would like to be able to generate random numbers uniformily distributed. I want to do this seperately for certain subgroups in my dataset, although it might actually suffice to do it for the whole dataset.

As an example:

data example;
input name $ value;
datalines;
Maria 2
Maria 3
Maria 6
John 2
John 6
Peter 2
;
run;

I would like to randomly select a 'Maria' observation and a 'John' observation (only 1 Peter, no selection needed) such that I am left with only one of the values of 'value'. To the first issue: I thought it would be easy to generate random data for the whole dataset:

proc sql noprint;
select count(*) into :N from example;
quit;

data example (drop=i);
set example;
do i=1 to &N;
uni_ran=rand("Uniform");
output;
end;
run;

But the amount of observations in the dataset goes from 6 to 36 - why?

&n = 6 and you have a loop that goes 1 to &n and each time outputs the record.

The second issue that I would like to be able to generate random numbers seperately for 'Maria' and 'John'. In general, how would one go about doing this? I have tried by including a -by- but I cannot get it to work. I imagine I have to somehow confine the do loop to the span of observations in each category, e.g. 3 for Maria and 2 for John, perhaps using some functions which identifies the amount of observations in a category of a certain variable. In the actual dataset I am using there are in many cases >3 observations within each category.

In general I have an interest in finding out how to perform different operations for each subgroup of a dataset without having to qualify the subgroup using fx -if- and -where', i.e. just "do this for each level of var". If there are 1000 levels of categorical variable this would require many lines of code using -if-.

You don't need separate generation of random variables for Maria and John.

data have1;
    set have;
    uni_ran=rand("Uniform");
run;
/* Find the max random number for each name */
proc sort data=have1;
    by name uni_ran;
run;
data want;
    set have1;
    by name uni_ran;
    if last.name;
run;

If you want to exclude names with only a single record like Peter, then instead of the line if last.name; you can use

if last.name and not first.name;

--
Paige Miller

mgrasmussen · Posted 09-23-2021 07:40 AM

Thank you. Appreciate it.

Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Re: Issue with random number generation and how to create random numbers for subgroups in data

Registration is open

SAS Training: Just a Click Away