BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Shirin
Obsidian | Level 7

Hello,

 

I have 20 different files (over 100 thousand records each), in each file, I want to generate a randomly generated number ID, and make sure that there are no duplicates across the years. My solution is to generate the random numbers for first file, then put those in a master list, then generate random numbers for the second file, check the master list to ensure those numbers don't already exist, if they do, replace, and add all the random numbers to the master list, then generate random numbers for the 3rd file and so on...

 

Is there an easy way of programming this? Are there procedures that are meant for this kind of process?

 

Thank you 🙂    

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

I didn't add in a check and balance for existing IDs, but the following uses your format on a test case of 20 files where each file has 100,000 records. It also does a test at the end to see if any duplicate IDs were created:

 

%macro buildem;
  %do i=1 %to 20;
    data have&i.;
      do x=1 to 100000;
        output;
      end;
    run;
  %end;
%mend buildem;
%buildem

%global i;
%macro assignem(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then do;
      do _n_=1 to &i.;
        id=put(1+int(1e10*ranuni(7657)),z10.) ;
      end;
    end;
    id=put(1+int(1e10*ranuni(7657)),z10.) ;
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assignem;
%let i=1;
%assignem(have1)
%assignem(have2)
%assignem(have3)
%assignem(have4)
%assignem(have5)
%assignem(have6)
%assignem(have7)
%assignem(have8)
%assignem(have9)
%assignem(have10)
%assignem(have11)
%assignem(have12)
%assignem(have13)
%assignem(have14)
%assignem(have15)
%assignem(have16)
%assignem(have17)
%assignem(have18)
%assignem(have19)
%assignem(have20)

data all;
  set have:;
run;

proc sort data=all nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

15 REPLIES 15
Astounding
PROC Star

For an ID variable, isn't a sequential number just as random as anything else you might assign?  Just assign values of 1 through 100,000 for the first data set, 100,001 through 200,000 for the second data set, etc.  Do you have some requirements down the road that makes these ID values any less "random" than a more complex approach?

 

If you need to select observations randomly later, PROC SURVEYSELECT can do that without requiring a randomly assigned ID variable.

Shirin
Obsidian | Level 7

Unfortunately I can't use sequential IDs, it has to be randomly generated numbers.

Sven111
Pyrite | Level 9

It sounds like you may want to look into something like a GUID/UUID.  I don't think SAS has that functionality built in, but I'm sure you could create some modular code to do it.  Or very likely someone already has if you start digging around.

Sven111
Pyrite | Level 9

Actually, I just read my own link a bit closer, and it should actually be fairly simple to create a FCMP or Macro function that creates a v3 or v5 UUID/GUID since they primarily rely on MD5/SHA-1, both of which are already existing functions in SAS.

ballardw
Super User

Assign the sequential and add a random number. Sort by the random number and drop the random number. Voila, the "sequential" values are now "random" numbers.

 

Maybe use a Z12. format to further "disquise" the numbers.

mkeintz
PROC Star

 

I wouldn't do it quite like that.  Let's say your 20 datasets (named DS1, DS2, ... DS20)  have N1, N2, ... N20 observations, totalling to 2015632 observations:

 

Then generate 2,015,632 id's in sequence, and re-order them randomly (you could get this number using proc sql drawing from dictionary.tables, but for now, just look at your datasets and calculate the total observation count):

 

 

Step 1: Generate needed id's and order randomly: 

 

%let n_all=2015632;

 

data v_ranids /view=v_ranids;

   do id=1 to &n_all;

     rn=rand('uniform');

     output;

  end;

run;

proc sort data=v_ranids out=ranids (keep=id);

  by rn;

run;

 

Step 2: Now define a macro which will later be applied in sequence to each dataset,   DS1,  DS2, ....  It will support taking the first N1 id's in ranid for DS1, the next N2 for DS2, etc.

 

 

%macro ranid(dsname=);

  data new_&dsname (drop=next_frstobs);

    merge &dsname (in=inkeep)  ranid (firstobs=&frstob);

    if inkeep=0 then do;

      next_firstobs=&frstob+_n_-1;

      call symput("frstob",cats(next_frstobs));

      stop;

    end;

 run;

 %mend ranid;

 

 

Step 3.1 through 3.20: Now just run the macro once per dataset, making sure you set the macrovar FRSTOB to 1 before the first macro call. 

%let frstob=1;

%ranid(ds1);

%ranid(ds2);

...

%ranid(ds20);

 

 

Note: If your program stops after, say dataset 6, then you want to run the macros starting with data set 7 ... BUT ... you have to set the macrovar FRSTOB to    N1+N2+...+N6 + 1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
art297
Opal | Level 21

I think the following would meet your requirements:

 

%macro assign(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then call streaminit(123);
    do _n_=1 to &i.;
      id=rand("uniform");
    end;
    id=rand("uniform");
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assign;
%let i=1;
%assign(have1);
%assign(have2);
%assign(have3);

Art, CEO, AnalystFinder.com

 

Shirin
Obsidian | Level 7

art297, this is simple and beautiful, but does this ensure the uniqueness of my IDs?

 

I actually don't want to generate these IDs for all records in each dataset, but for those where ID_old is missing. And this is what I'm using for creating the new ID.

ID_new = put (int (ranuni (7657) * 1e10) + 1 , z10.)

art297
Opal | Level 21

It would only assure the assignment of unique IDs if you were using it to assign new IDs for all of your records across all of your files. Since you want to keep preexisting IDs, no, it wouldn't work.

 

Art, CEO, AnalystFinder.com

 

Shirin
Obsidian | Level 7

I could maybe separate those rows so that I assign IDs for all records... Unfortunately, I still don't understand how it'd assure IDs would be unique tho...

art297
Opal | Level 21

Are you sure that you don't have any duplicate IDs across your existing files? The method I proposed would ensure the assignment of unique new IDs for every record. The logic is that it is using the same seed, but starting from the place where it left off with the previous file.

 

If you are sure that you have unique IDs, you could use proc sql to identify the maximum id across all of your files, then build the logic into the code to only assign IDs that are greater than that value.

 

Art, CEO, AnalystFinder.com

 

Shirin
Obsidian | Level 7

art297, yes I believe this could work.

All the rows I want to assign this ID to, have a bunch of blanks, some missing variables and some not missing, but I want them to have all different IDs.

Is there a way to keep my format of put (int (ranuni (7657) * 1e10) + 1 , z10.)?

ballardw
Super User

May want to consider Rand('uniform') instead of ranuni.

 

And you just added an additional constraint missed in the original post: cannot be the same as any of the existing id values. That should have been explicitly stated up front.

How many missing values are there? Are these id's actually numeric or numeral valued character? What is the largest existing value?

It may be sufficient to increment past the largest existing.

art297
Opal | Level 21

I didn't add in a check and balance for existing IDs, but the following uses your format on a test case of 20 files where each file has 100,000 records. It also does a test at the end to see if any duplicate IDs were created:

 

%macro buildem;
  %do i=1 %to 20;
    data have&i.;
      do x=1 to 100000;
        output;
      end;
    run;
  %end;
%mend buildem;
%buildem

%global i;
%macro assignem(file);
  data &file;
    set &file end=eof nobs=nobs;
    if _n_ eq 1 then do;
      do _n_=1 to &i.;
        id=put(1+int(1e10*ranuni(7657)),z10.) ;
      end;
    end;
    id=put(1+int(1e10*ranuni(7657)),z10.) ;
    if eof then do;
      call symput('i',&i+nobs);
    end;
  run;
%mend assignem;
%let i=1;
%assignem(have1)
%assignem(have2)
%assignem(have3)
%assignem(have4)
%assignem(have5)
%assignem(have6)
%assignem(have7)
%assignem(have8)
%assignem(have9)
%assignem(have10)
%assignem(have11)
%assignem(have12)
%assignem(have13)
%assignem(have14)
%assignem(have15)
%assignem(have16)
%assignem(have17)
%assignem(have18)
%assignem(have19)
%assignem(have20)

data all;
  set have:;
run;

proc sort data=all nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 1315 views
  • 5 likes
  • 7 in conversation