Solved: Re: Distributing missing records randomly by Percentage

sas33 · Posted 11-17-2019 02:12 PM

Hi,

i have a following sample data set with missing values, for which I want to set certain value by percentage randomly. 50% of the records belong to 'IT' Dept, and 30% belong to 'Business' and remaining 20% belong 'HR' dept. How do i achieve this. Output may vary but overall percentage should match.

ID	Department
1
2
3
4
5
6
7
8
9
10

Want:

ID	Department
1	IT
2	Business
3	IT
4	HR
5	IT
6	HR
7	IT
8	IT
9	Business
10	Business

FreelanceReinh · Posted 11-17-2019 05:57 PM

Hi @sas33,

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
  output;
end;
length Department $8;
Department=' ';
run;

data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
  _iorc_=31415927; /* random seed */
  call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

View solution in original post

FreelanceReinh · Posted 11-17-2019 05:57 PM

Hi @sas33,

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
  output;
end;
length Department $8;
Department=' ';
run;

data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
  _iorc_=31415927; /* random seed */
  call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

PaigeMiller · Posted 11-17-2019 06:19 PM

The function x=rand('table',...) does exactly what you want.

https://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p0fpeei0opypg8n1b06qe4r040lv.htm...

--
Paige Miller

hashman · Posted 11-18-2019 12:46 AM

@sas33:

One way is to store your department values and the corresponding percentages in a file and then use that file to drive the assignment of the departments randomly:

data have ;                                                                                                                             
  do ID = 1 to 17 ;                                                                                                                     
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;                                                                                                                                   
                                                                                                                                        
data department_percent ;                                                                                                               
  input department :$8. percent ;                                                                                                       
  cards ;                                                                                                                               
IT       50                                                                                                                             
Business 30                                                                                                                             
HR       20                                                                                                                             
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = r percent) ;                                                                                                          
  dcl hash h (ordered:"a") ;                                                                                                            
  h.definekey ("r") ;                                                                                                                   
  h.definedata ("department") ;                                                                                                         
  h.definedone () ;                                                                                                                     
  if 0 then set have nobs = n ;                                                                                                         
  do until (z) ;                                                                                                                        
    set department_percent end = z ;                                                                                                    
    do _n_ = 1 to ceil (divide (n * percent, 100)) ;                                                                                    
      r = ranuni (7) ;                                                                                                                  
      h.add() ;                                                                                                                         
    end ;                                                                                                                               
  end ;                                                                                                                                 
  dcl hiter hi ("h") ;                                                                                                                  
  do while (hi.next() = 0) ;                                                                                                            
    set have ;                                                                                                                          
    output ;                                                                                                                            
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ;

Kind regards

Paul D.

ed_sas_member · Posted 11-18-2019 03:10 AM

data have;
	length department $8;
	do i=1 to 10;
		flag=rand('TABLE',0.5,0.3,0.2);
		if flag=1 then department="IT";
		else if flag=2 then department ="Business";
		else department="HR";
		output;
	end;
	drop i;
run;

proc freq data=have order=freq;
	table department;
run;

PaigeMiller · Posted 11-18-2019 06:46 AM

@ed_sas_member wrote:

data have;
	length department $8;
	do i=1 to 10;
		flag=rand('TABLE',0.5,0.3,0.2);
		if flag=1 then department="IT";
		else if flag=2 then department ="Business";
		else department="HR";
		output;
	end;
	drop i;
run;

I would use custom formats here instead of DEPARTMENT="Business", etc.

--
Paige Miller

Ksharp · Posted 11-18-2019 07:53 AM

As Paige said , use rand('table',........);

data want;
 array x{3} $ 10 ('IT' 'Business' 'HR');
 call streaminit(1234);
 do id=1 to 10;
  Department=x{rand('table',0.5,0.3)};
  output;
 end;
 drop x1-x3;
run;

FreelanceReinh · Posted 11-18-2019 08:59 AM

@sas33:

Just to avoid the wrong impression that all solutions presented so far could be used exchangeably:

You need to decide whether you want to randomly assign 'IT', 'Business' and 'HR' either

A) in fixed proportions (50%:30%:20%)

or

B) in random proportions with expectations 50%, 30% and 20%, respectively (multinomial distribution with parameters n, 0.5, 0.3, 0.2)

to the n IDs with missing Department.

Note that applying B to your example (i.e. n=10) means that there's a probability of 91% (!) that at least two of the three percentages will not match their expectations. In particular, it would be more likely (p=0.136) that at least one of the three departments does not occur at all than to achieve the intended percentages 50%, 30% and 20% (p=0.085).

The rand('table',...) approach is ideal for specification B. (Of course, you could try and repeat this type of random assignment again and again until you're satisfied with the observed percentages.)

sas33 · Posted 11-18-2019 09:47 AM

Thank you everyone.

While most of the solutions seems working as expected, i am leaning towards rand('table',...) function.

i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop. what change should i make

FreelanceReinh · Posted 11-18-2019 10:23 AM

@sas33 wrote:

While most of the solutions seems working as expected ...

Well, three solutions work according to specification A and two according to specification B (see definitions in my previous post).

... i am leaning towards rand('table',...) function.

That is, specification B, right?

i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop. what change should i make

If those "other columns" already exist in dataset HAVE and you use a DATA step with a SET HAVE statement to create dataset WANT, then all columns will be present in dataset WANT by default. Moreover, you won't necessarily need a DO loop nor an OUTPUT statement.

sas33 · Posted 11-18-2019 12:23 PM

Here is the code i am trying to use

data want;
set have;
array x{3} $ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
Dept=x{rand('table',0.6,0.35)};
output;
drop x1-x3;
run;

ods graphics on;
proc freq data=want;
table dept / plots=FreqPlot(scale=percent);
run;

Dept	Frequency	Percent	Cumulative Frequency	Cumulative Percent
Business	343	33.83	343	33.83
HR	41	4.04	384	37.87
IT	630	62.13	1014	100.00

i think i wanted to see exact percentages. For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.

PaigeMiller · Posted 11-18-2019 12:49 PM

Exact percentages cannot be done completely at random, and if the number of total observations is not a multiple of 20, you can't get your exact percentages at all.

--
Paige Miller

FreelanceReinh · Posted 11-18-2019 01:35 PM

@sas33: Your code is correct for specification B.

@sas33 wrote:

i think i wanted to see exact percentages.

This sounds more like specification A, hence @hashman's or either of my two solutions.

@sas33 wrote:

For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.

That's what I meant by "random proportions." With n=1014 the closest you can get is, of course, 608x 'IT' (59.96%), 355x 'Business' (35.01%) and 51x 'HR' (5.03%). The probability of exactly attaining these values with the rand('table',...) method (i.e. specification B) is as tiny as 0.0015*, so the observed "little variance" is anything but surprising.

(With @hashman's solution you might occasionally get 609x 'IT' due the CEIL function and hence one 'Business' or 'HR' less, depending on the random seed, but his example random seed 7 happens to yield the "optimum" values 608, 355 and 51 for n=1014.)

* calculated as follows:

p=exp(lfact(1014)-lfact(608)-lfact(355)-lfact(51)+608*log(.6)+355*log(.35)+51*log(.05));

Classroom Training Available!