Re: Need Help With Randomization With Special Criteria

biglerc · Posted 10-05-2021 05:19 PM

Hello,

I need to randomly match people from 2 lists, on a 1:1 basis. One list contains mentors and one list contains mentees.

I know I can use proc surveyselect for randomization, however there is some criteria that I don't know how to incorporate:

1) Mentees must have a lower Level than their mentor

2) Mentees must not have the same subfunction as their mentor.

Does anyone have any ideas how to do this? Sample data is attached.

Thank you in advance!!

ballardw · Posted 10-05-2021 07:02 PM

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Or find posts on the forum with data steps showing data and make one similar.

biglerc · Posted 10-06-2021 12:16 PM

Hi,

My apologies - the Excel issue didn't even occur to me. I have attached SAS datasets now.

Thanks!

Colett

FreelanceReinh · Posted 10-06-2021 08:28 AM

Hello @biglerc,

Here's an outline of a simple algorithm in a DATA step that might work for your problem:

Create a hash object H (along with a hash iterator object) containing the mentor dataset (key: mentor ID).
Read the first (later: next) observation of the mentee dataset (SET statement), traverse H and write the IDs of mentors satisfying the criteria for the current mentee to an array or another hash object S whose keys are the integers 1, 2, 3, ..., k (if k suitable mentors have been found).
Retrieve the mentor ID with key rand('integer',k) from S and write the mentee-mentor pair to the output dataset.
Remove the corresponding hash item from H.
Clear the array or hash object S.
Repeat steps 2 - 5 until the last observation of the mentee dataset has been processed.

There is one potential issue with this algorithm: Suppose that close to the end of the mentee dataset there is a mentee X with only a few suitable mentors. It might happen (with a probability depending on the data) that these few mentors have already been selected for other mentees so that no suitable mentor is left when it's mentee X's turn (k=0). To mitigate this problem one could determine the number m of suitable mentors for each mentee in a preliminary step and sort the mentee dataset by m. But even this approach does not guarantee that the maximum possible number of mentees will be assigned a mentor. Techniques to cope with this contingency range from simple repetitions of the randomization (with different random seeds) to advanced combinatorial optimization using SAS/OR procedures (e.g. PROC CLP?).

biglerc · Posted 10-06-2021 01:11 PM

Thank you so much! I am afraid hashes are way over my head. 😞 But thank you for your reply!!

FreelanceReinh · Posted 10-07-2021 07:24 AM

@biglerc wrote:

I am afraid hashes are way over my head. 😞

The hash object comes with a learning curve, but it's a worthwhile addition to the toolkit of every experienced DATA step programmer. There are great textbooks about it available.

Here's an implementation of the plan (in its simplest form) I outlined in my previous post:

data want(drop=k);
call streaminit(27182818);
if _n_=1 then do;
  if 0 then set mentor;
  dcl hash h(dataset: 'mentor');
  h.definekey('mentor_id');
  h.definedata(all: 'y');
  h.definedone();
  dcl hiter hi('h');
end;
set mentee;
array s[999] $2 _temporary_; /* increase dim. of S or define it as a hash obj. if >999 mentors */
do while(hi.next()=0);
  if .<input(mentee_level,32.)<input(mentor_level,32.) & mentee_subfunc ne mentor_subfunc
  then do;
    k=sum(k,1);
    s[k]=mentor_id;
  end;
end;
if k then do; /* i.e., if at least one suitable mentor has been found */
  h.find(key: s[rand('integer',k)]);
  h.remove();
  call missing(of s[*]);
end;
else do;
  put 'WAR' 'NING: No suitable mentor found for ' mentee_id=;
  call missing(of mentor_:);
end;
run;

It works for your sample data (thanks for providing the datasets!).

The basic structure of this DATA step is:

data want;
set mentee;
... code to randomly select suitable mentor ...
run;

So, dataset WANT is actually a copy of dataset MENTEE with all variables from the MENTOR dataset added: mentor_id, mentor_subfunc and mentor_level contain the data of a randomly selected mentor satisfying the criteria for the respective mentee. They are empty if no suitable mentor was found (e.g., because all suitable mentors were assigned to other mentees in earlier observations).

The code before the set mentee statement

sets an arbitrary positive random seed (needed to get reproducible results from the RAND function used later)
adds the variables of the MENTOR dataset to the program data vector (technical requirement for using the hash object)
declares and creates a hash object H containing all values of dataset MENTOR, assuming that mentor_id is a unique key in that dataset
declares and creates a hash iterator object HI which will be used to traverse the hash object H.

The first few elements of temporary array S will be populated with the IDs of those mentors who satisfy the criteria for the current mentee (read from the MENTEE dataset by the SET statement) and who haven't been assigned to a mentee in an earlier observation -- if any. Type and length of the array ($2) elements match the characteristics of variable mentor_id (a character variable of length 2). As defined above, it can accommodate up to 999 mentor IDs, but the dimension can be increased if needed.

You can think of the DO-WHILE loop (using the NEXT method of the hash iterator object HI) as a kind of SET statement applied to the "copy" of the MENTOR dataset stored in hash object H (or what's left of it after removing mentors assigned to other mentees), as it reads one observation per iteration. I hope the IF condition reflects your selection criteria appropriately:

I assumed that "lower level" means a smaller, non-missing numeric value of the mentee_level variable compared to the mentor_level variable (e.g. 8 < 101). Unfortunately, these variables are character variables so that I had to use the INPUT function to extract the numeric values.
For the second criterion ("not the same subfunction") I used a simple string comparison. Modifications would be needed if values of mentee_subfunc or mentor_subfunc weren't consistent regarding case, spelling, etc.

Once the hash object containing the mentor data has been read from beginning to end, the IDs of suitable mentor "candidates" for the current mentee are found in s[1], s[2], ..., s[k]. Each of the k candidates has the same probability of being selected by the RAND function. The FIND method of the hash object retrieves the values of mentor_id, mentor_subfunc and mentor_level of the selected mentor, who is then removed from the hash object to prevent a duplicate selection. The list of candidates is cleared by the CALL MISSING routine.

In the unfortunate case that no suitable mentor has been found (which doesn't happen with your sample data) a warning is written to the log and the mentor_... variables will be blank in dataset WANT for such an observation.

At the end of each iteration of the DATA step the usual implied OUTPUT statement writes the mentee and assigned mentor data to dataset WANT.

biglerc · Posted 10-07-2021 12:12 PM

So amazing, thank you for all that information and examples! It seems I do need to get this knowledge under my belt as my programming needs at work are getting more and more complex! I will pour over this information today and try to implement it.

s_lassen · Posted 10-07-2021 10:01 AM

Here is a solution that may work:

First, create a randomly ordered copy of the mentors table:

proc sql;                     
  create table rand as select 
  ID as MentorID,             
  Level as MentorLevel,       
  Subfunction as MentorSub,   
  0 as used from mentors      
  order by rand('UNIFORM');   
quit;

The USED field is put in as a marker that that mentor has been used.

Then read the mentees, and look for the first suitable mentor in the randomly ordered table:

data want(drop=used) not_found(keep=ID Subfunction level) rand;
  set mentees;                                                 
  do _N_=1 to n_mentors;                                       
    modify rand point=_N_ nobs=n_mentors;                      
    if used then continue;                                     
    if level>=MentorLevel then continue;                       
    if Subfunction=MentorSub then continue;                    
    used=1;                                                    
    replace rand;                                              
    output want;                                               
    delete;                                                    
    end;                                                       
  output not_found;                                            
run;

The RAND table has to be on the DATA statement, as it is updated in the MODIFY/REPLACE statements.

You may need to change the variable names, I did not copy yours exactly.

If you have some mentees with a high level, you may get a few records in the NOT_FOUND table. But if your data contains a reasonable number of high-value mentors, you can try running again (you will have to drop the RAND table first) until you get 0 records in NOT_FOUND. Or you can sort the mentees by descending level before the last data step, that should do the job in all cases where there are suitable mentors for all mentees.

biglerc · Posted 10-07-2021 12:13 PM

This is great, thank you so much!! That looks like it will meet my needs and I appreciate your time in looking into this!!

biglerc · Posted 10-07-2021 01:42 PM

Hi! This logic/code has been a great learning experience for me! Thank you so much.

I have a question, because I do need to run it again. After the first run, about half of the mentees were assigned mentors. In the resulting rand table there are still plenty of mentors that can be assigned to the mentees in the not_found datasets.

When I run it the first time it works great. Then I remove the mentors and mentees who have been matched up and recreate the rand dataset based on the remaining mentors. There should be additional matches but it's not finding any.

Can you help me understand how I rerun the process and drop the rand table? Sorry if I'm missing something obvious!

s_lassen · Posted 10-11-2021 04:45 AM

No idea. You will probably have to show your log if I should help more.

Maybe you should try my other suggestion: sort the mentees by descending level before running the job. That may help.

biglerc · Posted 10-11-2021 11:51 AM

Thank you so much!!

biglerc · Posted 10-11-2021 11:59 AM

No luck with the sorting, unfortunately. I don't know if you feel like looking at this once more for me, but if so, I've attached the log, code and actual datasets. Note there are more mentors than mentees so not every mentor will be assigned a mentee. Thanks for any additional time you can give me on this!

s_lassen · Posted 10-12-2021 05:46 AM

Thinking a bit more about it:

If you have run the first time and too many high-level mentors have been taken by low-level mentees, there may not be enough high-level mentors to go around. My suggestion is that you try starting all over again (throw out the random assignments made so far) with the original tables. This time sort the mentees by descending level before the first run, that ought to do the trick.

If not, let me hear.

SAS Innovate 2025: Call for Content

Classroom Training Available!