DATA Step, Macro, Functions and more

Building a lookup table to avoid duplicates

Accepted Solution Solved
Reply
Contributor
Posts: 26
Accepted Solution

Building a lookup table to avoid duplicates

Hello,

 

I have 20 different files (over 100 thousand records each), in each file, I want to generate a randomly generated number ID, and make sure that there are no duplicates across the years. My solution is to generate the random numbers for first file, then put those in a master list, then generate random numbers for the second file, check the master list to ensure those numbers don't already exist, if they do, replace, and add all the random numbers to the master list, then generate random numbers for the 3rd file and so on...

 

Is there an easy way of programming this? Are there procedures that are meant for this kind of process?

 

Thank you Smiley Happy    


Accepted Solutions
Solution
‎03-08-2017 10:55 AM
PROC Star
Posts: 7,474

Re: Building a lookup table to avoid duplicates

I didn't add in a check and balance for existing IDs, but the following uses your format on a test case of 20 files where each file has 100,000 records. It also does a test at the end to see if any duplicate IDs were created:

 

%macro buildem;
  %do i=1 %to 20;
    data have&i.;
      do x=1 to 100000;
        output;
      end;
    run;
  %end;
%mend buildem;
%buildem

%global i;
%macro assignem(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then do;
      do _n_=1 to &i.;
        id=put(1+int(1e10*ranuni(7657)),z10.) ;
      end;
    end;
    id=put(1+int(1e10*ranuni(7657)),z10.) ;
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assignem;
%let i=1;
%assignem(have1)
%assignem(have2)
%assignem(have3)
%assignem(have4)
%assignem(have5)
%assignem(have6)
%assignem(have7)
%assignem(have8)
%assignem(have9)
%assignem(have10)
%assignem(have11)
%assignem(have12)
%assignem(have13)
%assignem(have14)
%assignem(have15)
%assignem(have16)
%assignem(have17)
%assignem(have18)
%assignem(have19)
%assignem(have20)

data all;
  set have:;
run;

proc sort data=all nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Super User
Posts: 5,509

Re: Building a lookup table to avoid duplicates

For an ID variable, isn't a sequential number just as random as anything else you might assign?  Just assign values of 1 through 100,000 for the first data set, 100,001 through 200,000 for the second data set, etc.  Do you have some requirements down the road that makes these ID values any less "random" than a more complex approach?

 

If you need to select observations randomly later, PROC SURVEYSELECT can do that without requiring a randomly assigned ID variable.

Contributor
Posts: 26

Re: Building a lookup table to avoid duplicates

Posted in reply to Astounding

Unfortunately I can't use sequential IDs, it has to be randomly generated numbers.

Frequent Contributor
Posts: 83

Re: Building a lookup table to avoid duplicates

It sounds like you may want to look into something like a GUID/UUID.  I don't think SAS has that functionality built in, but I'm sure you could create some modular code to do it.  Or very likely someone already has if you start digging around.

Frequent Contributor
Posts: 83

Re: Building a lookup table to avoid duplicates

Actually, I just read my own link a bit closer, and it should actually be fairly simple to create a FCMP or Macro function that creates a v3 or v5 UUID/GUID since they primarily rely on MD5/SHA-1, both of which are already existing functions in SAS.

Super User
Posts: 11,343

Re: Building a lookup table to avoid duplicates

[ Edited ]

Assign the sequential and add a random number. Sort by the random number and drop the random number. Voila, the "sequential" values are now "random" numbers.

 

Maybe use a Z12. format to further "disquise" the numbers.

Trusted Advisor
Posts: 1,022

Re: Building a lookup table to avoid duplicates

 

I wouldn't do it quite like that.  Let's say your 20 datasets (named DS1, DS2, ... DS20)  have N1, N2, ... N20 observations, totalling to 2015632 observations:

 

Then generate 2,015,632 id's in sequence, and re-order them randomly (you could get this number using proc sql drawing from dictionary.tables, but for now, just look at your datasets and calculate the total observation count):

 

 

Step 1: Generate needed id's and order randomly: 

 

%let n_all=2015632;

 

data v_ranids /view=v_ranids;

   do id=1 to &n_all;

     rn=rand('uniform');

     output;

  end;

run;

proc sort data=v_ranids out=ranids (keep=id);

  by rn;

run;

 

Step 2: Now define a macro which will later be applied in sequence to each dataset,   DS1,  DS2, ....  It will support taking the first N1 id's in ranid for DS1, the next N2 for DS2, etc.

 

 

%macro ranid(dsname=);

  data new_&dsname (drop=next_frstobs);

    merge &dsname (in=inkeep)  ranid (firstobs=&frstob);

    if inkeep=0 then do;

      next_firstobs=&frstob+_n_-1;

      call symput("frstob",cats(next_frstobs));

      stop;

    end;

 run;

 %mend ranid;

 

 

Step 3.1 through 3.20: Now just run the macro once per dataset, making sure you set the macrovar FRSTOB to 1 before the first macro call. 

%let frstob=1;

%ranid(ds1);

%ranid(ds2);

...

%ranid(ds20);

 

 

Note: If your program stops after, say dataset 6, then you want to run the macros starting with data set 7 ... BUT ... you have to set the macrovar FRSTOB to    N1+N2+...+N6 + 1.

PROC Star
Posts: 7,474

Re: Building a lookup table to avoid duplicates

I think the following would meet your requirements:

 

%macro assign(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then call streaminit(123);
    do _n_=1 to &i.;
      id=rand("uniform");
    end;
    id=rand("uniform");
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assign;
%let i=1;
%assign(have1);
%assign(have2);
%assign(have3);

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 26

Re: Building a lookup table to avoid duplicates

art297, this is simple and beautiful, but does this ensure the uniqueness of my IDs?

 

I actually don't want to generate these IDs for all records in each dataset, but for those where ID_old is missing. And this is what I'm using for creating the new ID.

ID_new = put (int (ranuni (7657) * 1e10) + 1 , z10.)

PROC Star
Posts: 7,474

Re: Building a lookup table to avoid duplicates

It would only assure the assignment of unique IDs if you were using it to assign new IDs for all of your records across all of your files. Since you want to keep preexisting IDs, no, it wouldn't work.

 

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 26

Re: Building a lookup table to avoid duplicates

I could maybe separate those rows so that I assign IDs for all records... Unfortunately, I still don't understand how it'd assure IDs would be unique tho...

PROC Star
Posts: 7,474

Re: Building a lookup table to avoid duplicates

Are you sure that you don't have any duplicate IDs across your existing files? The method I proposed would ensure the assignment of unique new IDs for every record. The logic is that it is using the same seed, but starting from the place where it left off with the previous file.

 

If you are sure that you have unique IDs, you could use proc sql to identify the maximum id across all of your files, then build the logic into the code to only assign IDs that are greater than that value.

 

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 26

Re: Building a lookup table to avoid duplicates

art297, yes I believe this could work.

All the rows I want to assign this ID to, have a bunch of blanks, some missing variables and some not missing, but I want them to have all different IDs.

Is there a way to keep my format of put (int (ranuni (7657) * 1e10) + 1 , z10.)?

Super User
Posts: 11,343

Re: Building a lookup table to avoid duplicates

May want to consider Rand('uniform') instead of ranuni.

 

And you just added an additional constraint missed in the original post: cannot be the same as any of the existing id values. That should have been explicitly stated up front.

How many missing values are there? Are these id's actually numeric or numeral valued character? What is the largest existing value?

It may be sufficient to increment past the largest existing.

Solution
‎03-08-2017 10:55 AM
PROC Star
Posts: 7,474

Re: Building a lookup table to avoid duplicates

I didn't add in a check and balance for existing IDs, but the following uses your format on a test case of 20 files where each file has 100,000 records. It also does a test at the end to see if any duplicate IDs were created:

 

%macro buildem;
  %do i=1 %to 20;
    data have&i.;
      do x=1 to 100000;
        output;
      end;
    run;
  %end;
%mend buildem;
%buildem

%global i;
%macro assignem(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then do;
      do _n_=1 to &i.;
        id=put(1+int(1e10*ranuni(7657)),z10.) ;
      end;
    end;
    id=put(1+int(1e10*ranuni(7657)),z10.) ;
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assignem;
%let i=1;
%assignem(have1)
%assignem(have2)
%assignem(have3)
%assignem(have4)
%assignem(have5)
%assignem(have6)
%assignem(have7)
%assignem(have8)
%assignem(have9)
%assignem(have10)
%assignem(have11)
%assignem(have12)
%assignem(have13)
%assignem(have14)
%assignem(have15)
%assignem(have16)
%assignem(have17)
%assignem(have18)
%assignem(have19)
%assignem(have20)

data all;
  set have:;
run;

proc sort data=all nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 250 views
  • 5 likes
  • 7 in conversation