topic Re: Building a lookup table to avoid duplicates in SAS Programming

Building a lookup table to avoid duplicates

Shirin — Tue, 07 Mar 2017 19:45:14 GMT

Hello,

I have 20 different files (over 100 thousand records each), in each file, I want to generate a randomly generated number ID, and make sure that there are no duplicates across the years. My solution is to generate the random numbers for first file, then put those in a master list, then generate random numbers for the second file, check the master list to ensure those numbers don't already exist, if they do, replace, and add all the random numbers to the master list, then generate random numbers for the 3rd file and so on...

Is there an easy way of programming this? Are there procedures that are meant for this kind of process?

Thank you 🙂

Re: Building a lookup table to avoid duplicates

Astounding — Tue, 07 Mar 2017 20:08:28 GMT

For an ID variable, isn't a sequential number just as random as anything else you might assign? Just assign values of 1 through 100,000 for the first data set, 100,001 through 200,000 for the second data set, etc. Do you have some requirements down the road that makes these ID values any less "random" than a more complex approach?

If you need to select observations randomly later, PROC SURVEYSELECT can do that without requiring a randomly assigned ID variable.

Re: Building a lookup table to avoid duplicates

Shirin — Tue, 07 Mar 2017 20:10:28 GMT

Unfortunately I can't use sequential IDs, it has to be randomly generated numbers.

Re: Building a lookup table to avoid duplicates

Sven111 — Tue, 07 Mar 2017 20:22:03 GMT

It sounds like you may want to look into something like a GUID/UUID. I don't think SAS has that functionality built in, but I'm sure you could create some modular code to do it. Or very likely someone already has if you start digging around.

Re: Building a lookup table to avoid duplicates

Sven111 — Tue, 07 Mar 2017 20:28:37 GMT

Actually, I just read my own link a bit closer, and it should actually be fairly simple to create a FCMP or Macro function that creates a v3 or v5 UUID/GUID since they primarily rely on MD5/SHA-1, both of which are already existing functions in SAS.

Re: Building a lookup table to avoid duplicates

mkeintz — Tue, 07 Mar 2017 20:32:47 GMT

I wouldn't do it quite like that. Let's say your 20 datasets (named DS1, DS2, ... DS20) have N1, N2, ... N20 observations, totalling to 2015632 observations:

Then generate 2,015,632 id's in sequence, and re-order them randomly (you could get this number using proc sql drawing from dictionary.tables, but for now, just look at your datasets and calculate the total observation count):

Step 1: Generate needed id's and order randomly:

%let n_all=2015632;

data v_ranids /view=v_ranids;

do id=1 to &n_all;

rn=rand('uniform');

output;

end;

run;

proc sort data=v_ranids out=ranids (keep=id);

by rn;

run;

Step 2: Now define a macro which will later be applied in sequence to each dataset, DS1, DS2, .... It will support taking the first N1 id's in ranid for DS1, the next N2 for DS2, etc.

%macro ranid(dsname=);

data new_&dsname (drop=next_frstobs);

merge &dsname (in=inkeep) ranid (firstobs=&frstob);

if inkeep=0 then do;

next_firstobs=&frstob+_n_-1;

call symput("frstob",cats(next_frstobs));

stop;

end;

run;

%mend ranid;

Step 3.1 through 3.20: Now just run the macro once per dataset, making sure you set the macrovar FRSTOB to 1 before the first macro call.

%let frstob=1;

%ranid(ds1);

%ranid(ds2);

...

%ranid(ds20);

Note: If your program stops after, say dataset 6, then you want to run the macros starting with data set 7 ... BUT ... you have to set the macrovar FRSTOB to N1+N2+...+N6 + 1.

Re: Building a lookup table to avoid duplicates

ballardw — Tue, 07 Mar 2017 20:47:46 GMT

Assign the sequential and add a random number. Sort by the random number and drop the random number. Voila, the "sequential" values are now "random" numbers.

Maybe use a Z12. format to further "disquise" the numbers.

Re: Building a lookup table to avoid duplicates

art297 — Tue, 07 Mar 2017 20:52:13 GMT

I think the following would meet your requirements:

%macro assign(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then call streaminit(123);
    do _n_=1 to &i.;
      id=rand("uniform");
    end;
    id=rand("uniform");
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assign;
%let i=1;
%assign(have1);
%assign(have2);
%assign(have3);

Art, CEO, AnalystFinder.com

Re: Building a lookup table to avoid duplicates

Shirin — Tue, 07 Mar 2017 21:12:13 GMT

art297, this is simple and beautiful, but does this ensure the uniqueness of my IDs?

I actually don't want to generate these IDs for all records in each dataset, but for those where ID_old is missing. And this is what I'm using for creating the new ID.

ID_new = put (int (ranuni (7657) * 1e10) + 1 , z10.)

Re: Building a lookup table to avoid duplicates

art297 — Tue, 07 Mar 2017 21:17:15 GMT

It would only assure the assignment of unique IDs if you were using it to assign new IDs for all of your records across all of your files. Since you want to keep preexisting IDs, no, it wouldn't work.

Art, CEO, AnalystFinder.com

Re: Building a lookup table to avoid duplicates

Shirin — Tue, 07 Mar 2017 21:23:03 GMT

I could maybe separate those rows so that I assign IDs for all records... Unfortunately, I still don't understand how it'd assure IDs would be unique tho...

Re: Building a lookup table to avoid duplicates

art297 — Tue, 07 Mar 2017 21:30:43 GMT

Are you sure that you don't have any duplicate IDs across your existing files? The method I proposed would ensure the assignment of unique new IDs for every record. The logic is that it is using the same seed, but starting from the place where it left off with the previous file.

If you are sure that you have unique IDs, you could use proc sql to identify the maximum id across all of your files, then build the logic into the code to only assign IDs that are greater than that value.

Art, CEO, AnalystFinder.com

Re: Building a lookup table to avoid duplicates

Shirin — Tue, 07 Mar 2017 21:35:49 GMT

art297, yes I believe this could work.

All the rows I want to assign this ID to, have a bunch of blanks, some missing variables and some not missing, but I want them to have all different IDs.

Is there a way to keep my format of put (int (ranuni (7657) * 1e10) + 1 , z10.)?

Re: Building a lookup table to avoid duplicates

ballardw — Tue, 07 Mar 2017 22:44:01 GMT

May want to consider Rand('uniform') instead of ranuni.

And you just added an additional constraint missed in the original post: cannot be the same as any of the existing id values. That should have been explicitly stated up front.

How many missing values are there? Are these id's actually numeric or numeral valued character? What is the largest existing value?

It may be sufficient to increment past the largest existing.

Re: Building a lookup table to avoid duplicates

art297 — Tue, 07 Mar 2017 23:24:18 GMT

I didn't add in a check and balance for existing IDs, but the following uses your format on a test case of 20 files where each file has 100,000 records. It also does a test at the end to see if any duplicate IDs were created:

%macro buildem;
  %do i=1 %to 20;
    data have&i.;
      do x=1 to 100000;
        output;
      end;
    run;
  %end;
%mend buildem;
%buildem

%global i;
%macro assignem(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then do;
      do _n_=1 to &i.;
        id=put(1+int(1e10*ranuni(7657)),z10.) ;
      end;
    end;
    id=put(1+int(1e10*ranuni(7657)),z10.) ;
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assignem;
%let i=1;
%assignem(have1)
%assignem(have2)
%assignem(have3)
%assignem(have4)
%assignem(have5)
%assignem(have6)
%assignem(have7)
%assignem(have8)
%assignem(have9)
%assignem(have10)
%assignem(have11)
%assignem(have12)
%assignem(have13)
%assignem(have14)
%assignem(have15)
%assignem(have16)
%assignem(have17)
%assignem(have18)
%assignem(have19)
%assignem(have20)

data all;
  set have:;
run;

proc sort data=all nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

Re: Building a lookup table to avoid duplicates

Ksharp — Wed, 08 Mar 2017 03:12:17 GMT

data have;
do i=1 to 100;
 id=uuidgen();
 output;
end;
run;
proc print noobs;run;