🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Calcite | Level 5

## Distributing missing records randomly by Percentage

Hi,

i have a following sample data set with missing values, for which I want to set certain value by percentage randomly. 50% of the records belong to 'IT' Dept, and 30% belong to 'Business' and remaining 20% belong 'HR' dept. How do i achieve this. Output may vary but overall percentage should match.

 ID Department 1 2 3 4 5 6 7 8 9 10

Want:

 ID Department 1 IT 2 Business 3 IT 4 HR 5 IT 6 HR 7 IT 8 IT 9 Business 10 Business

1 ACCEPTED SOLUTION

Accepted Solutions
Jade | Level 19

## Re: Distributing missing records randomly by Percentage

Hi @sas33,

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
output;
end;
length Department \$8;
Department=' ';
run;

data want;
array _d[10] \$8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
_iorc_=31415927; /* random seed */
call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

12 REPLIES 12
Jade | Level 19

## Re: Distributing missing records randomly by Percentage

Hi @sas33,

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
output;
end;
length Department \$8;
Department=' ';
run;

data want;
array _d[10] \$8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
_iorc_=31415927; /* random seed */
call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

Diamond | Level 26

## Re: Distributing missing records randomly by Percentage

The function x=rand('table',...) does exactly what you want.

https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p0fpeei0opypg8n1b06qe4r040lv.htm...

--
Paige Miller
Ammonite | Level 13

## Re: Distributing missing records randomly by Percentage

One way is to store your department values and the corresponding percentages in a file and then use that file to drive the assignment of the departments randomly:

data have ;
do ID = 1 to 17 ;
output ;
end ;
run ;

data department_percent ;
input department :\$8. percent ;
cards ;
IT       50
HR       20
;
run ;

data want (drop = r percent) ;
dcl hash h (ordered:"a") ;
h.definekey ("r") ;
h.definedata ("department") ;
h.definedone () ;
if 0 then set have nobs = n ;
do until (z) ;
set department_percent end = z ;
do _n_ = 1 to ceil (divide (n * percent, 100)) ;
r = ranuni (7) ;
end ;
end ;
dcl hiter hi ("h") ;
do while (hi.next() = 0) ;
set have ;
output ;
end ;
stop ;
run ;

Kind regards

Paul D.

Meteorite | Level 14

## Re: Distributing missing records randomly by Percentage

data have;
length department \$8;
do i=1 to 10;
flag=rand('TABLE',0.5,0.3,0.2);
if flag=1 then department="IT";
else if flag=2 then department ="Business";
else department="HR";
output;
end;
drop i;
run;

proc freq data=have order=freq;
table department;
run;
Diamond | Level 26

## Re: Distributing missing records randomly by Percentage

@ed_sas_member wrote:
data have;
length department \$8;
do i=1 to 10;
flag=rand('TABLE',0.5,0.3,0.2);
if flag=1 then department="IT";
else if flag=2 then department ="Business";
else department="HR";
output;
end;
drop i;
run;

I would use custom formats here instead of DEPARTMENT="Business", etc.

--
Paige Miller
Super User

## Re: Distributing missing records randomly by Percentage

As Paige said , use rand('table',........);

data want;
array x{3} \$ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
do id=1 to 10;
Department=x{rand('table',0.5,0.3)};
output;
end;
drop x1-x3;
run;
Jade | Level 19

## Re: Distributing missing records randomly by Percentage

Just to avoid the wrong impression that all solutions presented so far could be used exchangeably:

You need to decide whether you want to randomly assign 'IT', 'Business' and 'HR' either

A) in fixed proportions (50%:30%:20%)

or

B) in random proportions with expectations 50%, 30% and 20%, respectively (multinomial distribution with parameters n, 0.5, 0.3, 0.2)

to the n IDs with missing Department.

Note that applying B to your example (i.e. n=10) means that there's a probability of 91% (!) that at least two of the three percentages will not match their expectations. In particular, it would be more likely (p=0.136) that at least one of the three departments does not occur at all than to achieve the intended percentages 50%, 30% and 20% (p=0.085).

The rand('table',...) approach is ideal for specification B. (Of course, you could try and repeat this type of random assignment again and again until you're satisfied with the observed percentages.)

Calcite | Level 5

## Re: Distributing missing records randomly by Percentage

Thank you everyone.

While most of the solutions seems working as expected, i am leaning towards rand('table',...) function.

i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop.  what change should i make

Jade | Level 19

## Re: Distributing missing records randomly by Percentage

@sas33 wrote:

While most of the solutions seems working as expected ...

Well, three solutions work according to specification A and two according to specification B (see definitions in my previous post).

... i am leaning towards rand('table',...) function.

That is, specification B, right?

i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop.  what change should i make

If those "other columns" already exist in dataset HAVE and you use a DATA step with a SET HAVE statement to create dataset WANT, then all columns will be present in dataset WANT by default. Moreover, you won't necessarily need a DO loop nor an OUTPUT statement.

Calcite | Level 5

## Re: Distributing missing records randomly by Percentage

Here is the code i am trying to use

data want;
set have;
array x{3} \$ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
Dept=x{rand('table',0.6,0.35)};
output;
drop x1-x3;
run;

ods graphics on;
proc freq data=want;
table dept / plots=FreqPlot(scale=percent);
run;

 Dept Frequency Percent CumulativeFrequency CumulativePercent Business 343 33.83 343 33.83 HR 41 4.04 384 37.87 IT 630 62.13 1014 100.00

i think i wanted to see exact percentages. For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.

Diamond | Level 26

## Re: Distributing missing records randomly by Percentage

Exact percentages cannot be done completely at random, and if the number of total observations is not a multiple of 20, you can't get your exact percentages at all.

--
Paige Miller
Jade | Level 19

## Re: Distributing missing records randomly by Percentage

@sas33: Your code is correct for specification B.

@sas33 wrote:

i think i wanted to see exact percentages.

This sounds more like specification A, hence @hashman's or either of my two solutions.

@sas33 wrote:

For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.

That's what I meant by "random proportions." With n=1014 the closest you can get is, of course, 608x 'IT' (59.96%), 355x 'Business' (35.01%) and 51x 'HR' (5.03%). The probability of exactly attaining these values with the rand('table',...) method (i.e. specification B) is as tiny as 0.0015*, so the observed "little variance" is anything but surprising.

(With @hashman's solution you might occasionally get 609x 'IT' due the CEIL function and hence one 'Business' or 'HR' less, depending on the random seed, but his example random seed 7 happens to yield the "optimum" values 608, 355 and 51 for n=1014.)

* calculated as follows:

p=exp(lfact(1014)-lfact(608)-lfact(355)-lfact(51)+608*log(.6)+355*log(.35)+51*log(.05));
Discussion stats
• 12 replies
• 1418 views
• 3 likes
• 6 in conversation