BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas33
Calcite | Level 5

Hi, 

 

i have a following sample data set with missing values, for which I want to set certain value by percentage randomly. 50% of the records belong to 'IT' Dept, and 30% belong to 'Business' and remaining 20% belong 'HR' dept. How do i achieve this. Output may vary but overall percentage should match.

 

IDDepartment
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 

 

Want:

 

IDDepartment
1IT
2Business
3IT
4HR
5IT
6HR
7IT
8IT
9Business
10Business

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @sas33,

 

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
  output;
end;
length Department $8;
Department=' ';
run;

data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
  _iorc_=31415927; /* random seed */
  call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

 

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

 

 

View solution in original post

12 REPLIES 12
FreelanceReinh
Jade | Level 19

Hi @sas33,

 

Here's one of many possible solutions using only Base SAS:

data have;
do ID=1 to 10;
  output;
end;
length Department $8;
Department=' ';
run;

data want;
array _d[10] $8 _temporary_ (5*'IT' 3*'Business' 2*'HR');
if _n_=1 then do;
  _iorc_=31415927; /* random seed */
  call ranperm(_iorc_, of _d[*]);
end;
set have;
Department=_d[_n_];
run;

This code could be adapted to a HAVE dataset with variable Department being only partially missing.

 

If SAS/STAT is available, there are more options, in particular PROC PLAN or PROC SURVEYSELECT. The latter is demonstrated below with a numeric variable Department whose formatted values are "IT", etc.

proc format;
value dept
1='IT'
2='Business'
3='HR'
;
run;

proc surveyselect data=have(keep=id)
groups=(5 3 2)
seed=27182818 out=want;
run;

proc datasets lib=work nolist;
modify want;
rename groupid=Department;
format Department dept.;
quit;

The GROUPS= option requires SAS 9.4. I would prefer this solution if the group sizes were much larger, e.g., 500, 300, 200.

 

 

hashman
Ammonite | Level 13

@sas33:

One way is to store your department values and the corresponding percentages in a file and then use that file to drive the assignment of the departments randomly:

data have ;                                                                                                                             
  do ID = 1 to 17 ;                                                                                                                     
    output ;                                                                                                                            
  end ;                                                                                                                                 
run ;                                                                                                                                   
                                                                                                                                        
data department_percent ;                                                                                                               
  input department :$8. percent ;                                                                                                       
  cards ;                                                                                                                               
IT       50                                                                                                                             
Business 30                                                                                                                             
HR       20                                                                                                                             
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = r percent) ;                                                                                                          
  dcl hash h (ordered:"a") ;                                                                                                            
  h.definekey ("r") ;                                                                                                                   
  h.definedata ("department") ;                                                                                                         
  h.definedone () ;                                                                                                                     
  if 0 then set have nobs = n ;                                                                                                         
  do until (z) ;                                                                                                                        
    set department_percent end = z ;                                                                                                    
    do _n_ = 1 to ceil (divide (n * percent, 100)) ;                                                                                    
      r = ranuni (7) ;                                                                                                                  
      h.add() ;                                                                                                                         
    end ;                                                                                                                               
  end ;                                                                                                                                 
  dcl hiter hi ("h") ;                                                                                                                  
  do while (hi.next() = 0) ;                                                                                                            
    set have ;                                                                                                                          
    output ;                                                                                                                            
  end ;                                                                                                                                 
  stop ;                                                                                                                                
run ; 

Kind regards

Paul D. 

ed_sas_member
Meteorite | Level 14
data have;
	length department $8;
	do i=1 to 10;
		flag=rand('TABLE',0.5,0.3,0.2);
		if flag=1 then department="IT";
		else if flag=2 then department ="Business";
		else department="HR";
		output;
	end;
	drop i;
run;

proc freq data=have order=freq;
	table department;
run;
PaigeMiller
Diamond | Level 26

@ed_sas_member wrote:
data have;
	length department $8;
	do i=1 to 10;
		flag=rand('TABLE',0.5,0.3,0.2);
		if flag=1 then department="IT";
		else if flag=2 then department ="Business";
		else department="HR";
		output;
	end;
	drop i;
run;

I would use custom formats here instead of DEPARTMENT="Business", etc.

--
Paige Miller
Ksharp
Super User

As Paige said , use rand('table',........);

 

data want;
 array x{3} $ 10 ('IT' 'Business' 'HR');
 call streaminit(1234);
 do id=1 to 10;
  Department=x{rand('table',0.5,0.3)};
  output;
 end;
 drop x1-x3;
run;
FreelanceReinh
Jade | Level 19

@sas33:

Just to avoid the wrong impression that all solutions presented so far could be used exchangeably:

 

You need to decide whether you want to randomly assign 'IT', 'Business' and 'HR' either

 

A) in fixed proportions (50%:30%:20%) 

 

or

 

B) in random proportions with expectations 50%, 30% and 20%, respectively (multinomial distribution with parameters n, 0.5, 0.3, 0.2)

 

to the n IDs with missing Department.

 

Note that applying B to your example (i.e. n=10) means that there's a probability of 91% (!) that at least two of the three percentages will not match their expectations. In particular, it would be more likely (p=0.136) that at least one of the three departments does not occur at all than to achieve the intended percentages 50%, 30% and 20% (p=0.085).

 

The rand('table',...) approach is ideal for specification B. (Of course, you could try and repeat this type of random assignment again and again until you're satisfied with the observed percentages.)

sas33
Calcite | Level 5

Thank you everyone. 

 

While most of the solutions seems working as expected, i am leaning towards rand('table',...) function.

 

i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop.  what change should i make

FreelanceReinh
Jade | Level 19

@sas33 wrote:

While most of the solutions seems working as expected ...


Well, three solutions work according to specification A and two according to specification B (see definitions in my previous post).


... i am leaning towards rand('table',...) function.

That is, specification B, right?


i am also looking to see other columns in the result data set along with ID and Department. looks like output statement only puts whatever is in the do loop.  what change should i make


If those "other columns" already exist in dataset HAVE and you use a DATA step with a SET HAVE statement to create dataset WANT, then all columns will be present in dataset WANT by default. Moreover, you won't necessarily need a DO loop nor an OUTPUT statement.

sas33
Calcite | Level 5

Here is the code i am trying to use


data want;
set have;
array x{3} $ 10 ('IT' 'Business' 'HR');
call streaminit(1234);
Dept=x{rand('table',0.6,0.35)};
output;
drop x1-x3;
run;

ods graphics on;
proc freq data=want;
table dept / plots=FreqPlot(scale=percent);
run;

 

DeptFrequencyPercentCumulative
Frequency
Cumulative
Percent
Business34333.8334333.83
HR414.0438437.87
IT63062.131014100.00


i think i wanted to see exact percentages. For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.

PaigeMiller
Diamond | Level 26

Exact percentages cannot be done completely at random, and if the number of total observations is not a multiple of 20, you can't get your exact percentages at all.

--
Paige Miller
FreelanceReinh
Jade | Level 19

@sas33: Your code is correct for specification B.


@sas33 wrote:

i think i wanted to see exact percentages.


This sounds more like specification A, hence @hashman's or either of my two solutions.


@sas33 wrote:

For example in above case i want to see the distribution as 60%, 35% and 5%, however freq table showing little variance for each dept.


That's what I meant by "random proportions." With n=1014 the closest you can get is, of course, 608x 'IT' (59.96%), 355x 'Business' (35.01%) and 51x 'HR' (5.03%). The probability of exactly attaining these values with the rand('table',...) method (i.e. specification B) is as tiny as 0.0015*, so the observed "little variance" is anything but surprising.

 

(With @hashman's solution you might occasionally get 609x 'IT' due the CEIL function and hence one 'Business' or 'HR' less, depending on the random seed, but his example random seed 7 happens to yield the "optimum" values 608, 355 and 51 for n=1014.)

 

* calculated as follows:

p=exp(lfact(1014)-lfact(608)-lfact(355)-lfact(51)+608*log(.6)+355*log(.35)+51*log(.05));

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1316 views
  • 3 likes
  • 6 in conversation