Generating Unique Random Values by Group

cghost · Posted 05-09-2018 09:33 PM

Hi All,

I'm trying to come up with a program to randomly generate UserIDs containing the user's first name followed by a string of 4 digits. Making sure the 4-digit integers themselves are unique is easy enough based on this helpful previous community post, but now I have a situation where I need to verify that the integers are unique within name groups (i.e., all people named 'Fred') while I don't care if they're repeated between names. I figure I'll need to continue doing this with arrays, as in the linked post, but I can't quite figure out the syntax and logic, particularly when it comes to dynamically creating columns and setting default values for character variables. It seems there would be two ways to do this. The first would involve concatenating the digits with the first names right from the get go and then comparing them within a large character array to hunt for duplicates. The second would involve doing a (temporary?) numeric array with my full range of 4 digit values inside a do loop for each of the unique names I have. I'm not sure which would be the easiest or most efficient, but the latter seems like it would be easier to accomplish. Right now when we do this process (once or twice a week) there's a lot of manual de-duping and playing with ranges to get enough unique UserIds for our growing pool of users, so I'm hoping to eliminate that extra work.

Pretend my work.users file looks something like this before sorting, but with hundreds more rows:

Id     FirstName
1      John
2      John
3      Jane
4      David
5      David
6      David
7      Susan
8      Jane
9      Elizabeth
10    Leonard

This is what I have so far, but I'm not at all confident it's working correctly as the random isAssigned array is showing that isAssigned values are carrying between different groups (first names). I'd like to either cut this off and make sure the randomization is triggering specifically within groups of like first names (i.e., the three Davids) or just compare concatenated ids that would look like 'David5955' 'David1497' 'David8528'.

proc sql;
select count(distinct FirstName) into: names
from work.users;
select max(value) into: recur
from
	(select count(FirstName) as value
	from work.users
	group by FirstName);
quit;
proc sort data=work.users;
by FirstName;
run;

data logtest;
set work.users (keep = Id FirstName);
by FirstName;
	array n[&recur.] n1-n%sysfunc(trim(&recur.));
			call streaminit(1);
			array isAssigned[1:9999] (9999*0);
				do while (1);
					newId = (1000 + ceil((1+9999-1000)*rand("uniform")));
					if isAssigned[newId] then continue;
					isAssigned[newId] = 1;
					leave;
				end;
			output;
run;

PGStats · Posted 05-09-2018 11:48 PM

Assuming that you have a set of old users (already given userId's) and new users for which you want to generate new distinct but random user Id's. Here is one way to do this with arrays :

data oldUsers;
length firstName userId $16;
input Id FirstName $ userId $;
datalines;
1      John      John1234
2      John      John2345 
3      Jane      Jane3456 
4      David     David4567  
5      David     David5678   
6      David     David6789  
7      Susan     Susan2134  
8      Jane      Jane3133 
9      Elizabeth Elizabeth4155  
10     Leonard   Leonard5166
; 

data newUsers;
length firstName $16;
input id firstName $;
datalines;
11 David
12 Jane
13 Donald
;

proc sort data=oldUsers; by firstName; run;
proc sort data=newUsers; by firstName; run;

data allUsers;
array _a{1000:9999} _temporary_;
set oldUsers newUsers(in=new);
by firstName;
if first.firstname then call missing(of _a{*});
if new then do;
    do i = 1 to 99999 until(not _a{no});
        no = floor(1000 + 9000*rand("uniform"));
        end;
    if not _a{no} then userId = cats(firstName, no);
    else put "WARNING: No user Id found for" Id= firstName=;
    end;
else no = input(compress(userId,,'kd'),best.);
_a{no} = 1;
drop i no;
run;

proc print data=allUsers noobs; run;

PG

ballardw · Posted 05-10-2018 10:41 AM

You actually have a very serious theoretical potential if not actual problem: if you have 10001 or more members of any group you cannot accomplish this. (and to get to 10000 you have allow use of 0000 to 9999 as the suffix).

Why is keeping the "group" identity in the single variable critical? If you need to process or keep the results generally using the group value as one of multiple variables should suffice.

cghost · Posted 05-10-2018 07:02 PM

That's a valid point, but I'm not sure if it's feasible to do this kind of generation and duplicate detection with character arrays and tens of thousands of possible concatenated name/integer values. Right now the 4 character thing is a hard set rule, but this is a project that gets revisited every couple years, and it's easy to imagine that we could start carrying the logins between years.

Do you have a suggested solution, @ballardw?

Unfortunately I did not have a chance to play with the solution provided by @PGStats today, but I am hoping to test it out tomorrow or maybe even over the weekend.

ballardw · Posted 05-10-2018 07:55 PM

By "revisit" do you mean that you will be doing this again and again and have to check against all the previous years of "fred" assignments to avoid duplications with previous years?

You just improved on your chances of hitting that theoretical problem with completely running out of 4 digit values.

I might start by making an "all possible" data set similar to:

data work.example;
   length name $ 10 code $ 14;
   do name='Fred','John';
      do num= 1 to 9999;
         code = cats(name,put(num,z4.));
         random = rand('uniform');
         used=0;
         output;
      end;
   end;
run;

as a base data set. If you get "new" names then create a similar set from the new names and append to the base set.

proc sort data=work.example;
   by name random;
run;

Gives random order to select by using the smallest random value within a name that has not been marked as used.

I think, have not had time to test, but a use of HASH with an existing data set of current/previously assigned values would be a quick way to update the base data set variable USED to 1, indicating current use.

Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Re: Generating Unique Random Values by Group

Click image to register for webinar

Classroom Training Available!