Hi,
i have a following sample data set with missing values, for which I want to set certain value by percentage randomly. 50% of the records belong to 'IT' Dept, and 30% belong to 'Business' and remaining 20% belong 'HR' dept. How do i achieve this. Output may vary but overall percentage should match.
ID | Department |
1 | |
2 | |
3 | |
4 | |
5 | |
6 | |
7 | |
8 | |
9 | |
10 |
Want:
ID | Department |
1 | IT |
2 | Business |
3 | IT |
4 | HR |
5 | IT |
6 | HR |
7 | IT |
8 | IT |
9 | Business |
10 | Business |
Hi @sas33,
Here's one of many possible solutions using only Base SAS:
data have;
do ID=1 to 10;
output;
end;
length Department $8;
Department=' ';
run;
data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
_iorc_=31415927; /* random seed */
call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;
This code could be adapted to a HAVE dataset with variable Department being only partially missing.
If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.
proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;
proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;
proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;
The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.
Hi @sas33,
Here's one of many possible solutions using only Base SAS:
data have;
do ID=1 to 10;
output;
end;
length Department $8;
Department=' ';
run;
data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
_iorc_=31415927; /* random seed */
call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;
This code could be adapted to a HAVE dataset with variable Department being only partially missing.
If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.
proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;
proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;
proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;
The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.
The function x=rand('table',...) does exactly what you want.
One way is to store your department values and the corresponding percentages in a file and then use that file to drive the assignment of the departments randomly:
data have ;
do ID = 1 to 17 ;
output ;
end ;
run ;
data department_percent ;
input department :$8. percent ;
cards ;
IT 50
Business 30
HR 20
;
run ;
data want (drop = r percent) ;
dcl hash h (ordered:"a") ;
h.definekey ("r") ;
h.definedata ("department") ;
h.definedone () ;
if 0 then set have nobs = n ;
do until (z) ;
set department_percent end = z ;
do _n_ = 1 to ceil (divide (n * percent, 100)) ;
r = ranuni (7) ;
h.add() ;
end ;
end ;
dcl hiter hi ("h") ;
do while (hi.next() = 0) ;
set have ;
output ;
end ;
stop ;
run ;
Kind regards
Paul D.
data have;
length department $8;
do i=1 to 10;
flag=rand('TABLE',0.5,0.3,0.2);
if flag=1 then department="IT";
else if flag=2 then department ="Business";
else department="HR";
output;
end;
drop i;
run;
proc freq data=have order=freq;
table department;
run;
@ed_sas_member wrote:
data have; length department $8; do i=1 to 10; flag=rand('TABLE',0.5,0.3,0.2); if flag=1 then department="IT"; else if flag=2 then department ="Business"; else department="HR"; output; end; drop i; run;
I would use custom formats here instead of DEPARTMENT="Business", etc.
As Paige said , use rand('table',........);
data want;
array x{3} $ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
do id=1 to 10;
Department=x{rand('table',0.5,0.3)};
output;
end;
drop x1-x3;
run;
Just to avoid the wrong impression that all solutions presented so far could be used exchangeably:
You need to decide whether you want to randomly assign 'IT', 'Business' and 'HR' either
A) in fixed proportions (50%:30%:20%)
or
B) in random proportions with expectations 50%, 30% and 20%, respectively (multinomial distribution with parameters n, 0.5, 0.3, 0.2)
to the n IDs with missing Department.
Note that applying B to your example (i.e. n=10) means that there's a probability of 91% (!) that at least two of the three percentages will not match their expectations. In particular, it would be more likely (p=0.136) that at least one of the three departments does not occur at all than to achieve the intended percentages 50%, 30% and 20% (p=0.085).
The rand('table',...) approach is ideal for specification B. (Of course, you could try and repeat this type of random assignment again and again until you're satisfied with the observed percentages.)
Thank you everyone.
While most of the solutions seems working as expected, i am leaning towards rand('table',...) function.
i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop. what change should i make
@sas33 wrote:
While most of the solutions seems working as expected ...
Well, three solutions work according to specification A and two according to specification B (see definitions in my previous post).
... i am leaning towards rand('table',...) function.
That is, specification B, right?
i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop. what change should i make
If those "other columns" already exist in dataset HAVE and you use a DATA step with a SET HAVE statement to create dataset WANT, then all columns will be present in dataset WANT by default. Moreover, you won't necessarily need a DO loop nor an OUTPUT statement.
Here is the code i am trying to use
data want;
set have;
array x{3} $ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
Dept=x{rand('table',0.6,0.35)};
output;
drop x1-x3;
run;
ods graphics on;
proc freq data=want;
table dept / plots=FreqPlot(scale=percent);
run;
Dept | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
Business | 343 | 33.83 | 343 | 33.83 |
HR | 41 | 4.04 | 384 | 37.87 |
IT | 630 | 62.13 | 1014 | 100.00 |
i think i wanted to see exact percentages. For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.
Exact percentages cannot be done completely at random, and if the number of total observations is not a multiple of 20, you can't get your exact percentages at all.
@sas33: Your code is correct for specification B.
@sas33 wrote:
i think i wanted to see exact percentages.
This sounds more like specification A, hence @hashman's or either of my two solutions.
@sas33 wrote:
For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.
That's what I meant by "random proportions." With n=1014 the closest you can get is, of course, 608x 'IT' (59.96%), 355x 'Business' (35.01%) and 51x 'HR' (5.03%). The probability of exactly attaining these values with the rand('table',...) method (i.e. specification B) is as tiny as 0.0015*, so the observed "little variance" is anything but surprising.
(With @hashman's solution you might occasionally get 609x 'IT' due the CEIL function and hence one 'Business' or 'HR' less, depending on the random seed, but his example random seed 7 happens to yield the "optimum" values 608, 355 and 51 for n=1014.)
* calculated as follows:
p=exp(lfact(1014)-lfact(608)-lfact(355)-lfact(51)+608*log(.6)+355*log(.35)+51*log(.05));
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.