Team,
I have a dataset of 5000 rows. I want to create a new column (say x) that generates random integers so that the sum of those integers (in the 5000 rows) comes to 25000. How do I achieve this in SAS?
Regards,
Suhrit
I don't see integers in the Excel link you provide.
This works:
data HAVE;
do VAL=1 to 5000;
output;
end;
run;
data WANT;
set HAVE end=LASTOBS;
array R [5000] _temporary_;
if _N_=1 then do;
do I=1 to 5000;
R[I]=int(rand('UNIFORM',10,0));
end;
DIFF=25000-sum(of R[*]);
do I=1 to DIFF;
R[int(rand('UNIFORM',5000,1))]+(DIFF>0);
end;
VET+sum(of R[*]);
end;
RAN=R[_N_];
run;
proc print data=WANT(firstobs=4990 keep=VAL RAN VET) noobs;
run;
VAL | VET | RAN |
---|---|---|
4990 | 25000 | 5 |
4991 | 25000 | 10 |
4992 | 25000 | 8 |
4993 | 25000 | 6 |
4994 | 25000 | 9 |
4995 | 25000 | 4 |
4996 | 25000 | 2 |
4997 | 25000 | 8 |
4998 | 25000 | 9 |
4999 | 25000 | 1 |
5000 | 25000 | 1 |
@Chugvenk wrote:
Team,
I have a dataset of 5000 rows. I want to create a new column (say x) that generates random integers so that the sum of those integers (in the 5000 rows) comes to 25000. How do I achieve this in SAS?
Regards,
Suhrit
Do they have to be positive integers?
What test for randomness do they have to pass?
And the purpose of this exercise is what?
Yes, they need to be positive integers.
Purpose: For creating a mockup for reporting, until we actually get the data from the warehouse.
Test for Randomness: Honestly, I'm not sure what this means. It should just be 5000 random numbers summing to 25000. I can do this in excel, I'm new to SAS, so I am hoping there are functions for this.
To create random numbers or unique id's (i.e. a sequence number) is no problem. It's the "summing up to 25000" requirement where one would need to come up with some custom logic and then implement the code for it..
Not sure how you would implement such a "summing up to 25000" easily in Excel. Can you explain?
Patrick,
It can be done in excel, I have done it before. I need to do this in SAS.
And yes, we would need a custom logic, I was asking help with that since I am new to SAS.
Regards,
Suhrit
I don't see integers in the Excel link you provide.
This works:
data HAVE;
do VAL=1 to 5000;
output;
end;
run;
data WANT;
set HAVE end=LASTOBS;
array R [5000] _temporary_;
if _N_=1 then do;
do I=1 to 5000;
R[I]=int(rand('UNIFORM',10,0));
end;
DIFF=25000-sum(of R[*]);
do I=1 to DIFF;
R[int(rand('UNIFORM',5000,1))]+(DIFF>0);
end;
VET+sum(of R[*]);
end;
RAN=R[_N_];
run;
proc print data=WANT(firstobs=4990 keep=VAL RAN VET) noobs;
run;
VAL | VET | RAN |
---|---|---|
4990 | 25000 | 5 |
4991 | 25000 | 10 |
4992 | 25000 | 8 |
4993 | 25000 | 6 |
4994 | 25000 | 9 |
4995 | 25000 | 4 |
4996 | 25000 | 2 |
4997 | 25000 | 8 |
4998 | 25000 | 9 |
4999 | 25000 | 1 |
5000 | 25000 | 1 |
@ChrisNZ Thank you so much. This absolutely works!
I had no idea this code would be so complex. You are awesome! Thanks again!
@Chugvenk wrote:
Yes, they need to be positive integers.
Purpose: For creating a mockup for reporting, until we actually get the data from the warehouse.
Test for Randomness: Honestly, I'm not sure what this means. It should just be 5000 random numbers summing to 25000. I can do this in excel, I'm new to SAS, so I am hoping there are functions for this.
There are statistical tests that provide information as to how likely your data is to be "random" just like a Ttest can be used to see if the means of two sets are the same (or not).
50 random integers summing up to 250:
%let sum=250;
%let nb=50;
data want;
call streaminit(976785676);
array r {0:&nb} _temporary_;
r{0} = 0;
do i = 1 to &nb;
r{i} = r{i-1} + rand("uniform");
end;
do i = 1 to &nb;
r{i} = round(r{i}*&sum/r{&nb});
number = r{i} - r{i-1};
output;
end;
keep number;
run;
The solution I proposed above DID work, I tested it. If it didn't work for you, I'd like to know how.
the dataset you created was a bunch of numbers that went uptil 250 (250 was the last number), which was not quite what I was looking for. I was looking for a column of random numbers and the total of that should have been 250. Again, maybe I didn't explain it correctly, so my bad. Thanks for getting back to me! Have a good day!
My own testing goes like this:
NOTE: AUTOEXEC processing completed. 1 %let sum=250; 2 %let nb=50; 3 4 data want; 5 call streaminit(976785676); 6 array r {0:&nb} _temporary_; 7 r{0} = 0; 8 do i = 1 to &nb; 9 r{i} = r{i-1} + rand("uniform"); 10 end; 11 do i = 1 to &nb; 12 r{i} = round(r{i}*&sum/r{&nb}); 13 number = r{i} - r{i-1}; 14 output; 15 end; 16 keep number; 17 run; NOTE: The data set WORK.WANT has 50 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.90 seconds cpu time 0.15 seconds 18 19 data _null_; 20 do j = 1 to 5; 21 do i = 1 to 10; 22 set want; 23 put number 2. @; 24 end; 25 put; 26 end; 27 run; 7 4 5 5 3 7 5 7 2 5 6 2 3 4 8 5 4 0 7 4 8 2 5 8 7 3 7 6 8 5 4 6 2 8 1 4 7 2 4 3 4 5 9 9 4 2 7 6 3 8 NOTE: There were 50 observations read from the data set WORK.WANT. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.03 seconds
with the sum of these 50 numbers being 250.
Hi @Chugvenk,
You asked for random numbers, but your specifications (5000 positive integers with sum 25000) do not uniquely determine a distribution. That's why different solutions have been presented. Below is an approach that allows you to specify a suitable distribution. This is likely overkill for your purpose, but later readers of this thread may have other ideas what to do with those random integers.
Obviously, a "suitable" distribution must have an expected value (close to) 25000/5000=5. So, the uniform distribution on {1, 2, ..., 9} is a good candidate. If you want this (and you have SAS 9.4M5 or later), specify
%let dist=rand('integer', 9);
But maybe you would like to see at least a few two-digit integers. Then, how about a binomial distribution with n=16, p=1/4 and adding 1 to the random variates to avoid zeros and to adjust the expected value? Check out the results of the code below.
%let sum=25000;
%let nb=5000;
%let dist=rand('binom', 0.25, 16)+1; /* suitable distribution */
%let max=17; /* an upper bound for the above distribution */
%let seed=271828182;
%let maxiter=10000;
/* Create one-row dataset with frequencies n1, ..., n&max for integers 1, ..., &max */
data frq(keep=n:);
call streaminit(&seed);
array tn[&max] _temporary_;
array n[&max];
d=1e15;
do _n_=1 to &maxiter until(s=&sum);
call missing(of tn[*]);
do j=1 to &nb;
tn[&dist]+1;
end;
ts=0;
do k=1 to &max;
tn[k]+0;
ts+k*tn[k];
end;
td=abs(&sum-ts);
if td<d then do;
d=td;
s=ts;
do k=1 to &max;
n[k]=tn[k];
end;
end;
end;
put "The optimum sum (found in <=&maxiter iterations) is " s +(-1) '.';
if s ne &sum then put 'Sorry!';
run;
/* Create sorted dataset (view) with &nb integers X */
data rndv(keep=x) / view=rndv;
set frq;
array n[&max];
do x=1 to &max;
do i=1 to n[x];
output;
end;
end;
run;
/* Sort integers randomly */
proc surveyselect data=rndv rate=1 seed=&seed out=rnd outrandom noprint;
run;
/* Final checks -- optional */
proc freq data=rnd;
tables x;
run;
proc means data=rnd sum;
var x;
run;
/* Combine HAVE dataset with random integers */
data want;
merge have rnd;
run;
@FreelanceReinh Thanks so much. I tested this and it works perfectly! Awesome!
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.