Help using Base SAS procedures

How to quickly identify variables that distinguish records uniquely

Reply
Super Contributor
Posts: 338

How to quickly identify variables that distinguish records uniquely

Hi Community,

I have a data set like below where I can easily say the variables that make the records unique are Product and Bank_number.

data a;

Informat current_date date9.;

input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;

Format current_date date9.;

datalines;

31JUL2010 1111   5 Kim Lee Personal Loan 10

31JUL2010 1111   1 Kim Lee Personal Loan 20

31JUL2010 1111  25 Kim Lee Personal Loan 30

31JUL2010 1111  75 Kim Lee Res. Mortgage 30

31JUL2010 1111   8 Kim Lee Personal OD   40

;

run;

However, my real data set has over million records and 37 variables. I started manually selecting variables one by one and examining using trial and error process to find those distinguishing variables. But this takes me no where, it seems.

Q. Is there any method that I can use to identify the variable (or combination of variables) that would distinguish records uniquely.

Thank you for your help

Mirisage

Trusted Advisor
Posts: 1,300

Re: How to quickly identify variables that distinguish records uniquely

This would depend on a single variable providing the uniqueness instead of a combination of variables like acct_no and bank_number.  With your example data the only item that is unique itself is balance.  You could expand this to account for cross-tabs in the frequency table but I do not provide example for that.

data foo;

infile cards truncover;

input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

list;

format current_date date9.;

cards;

31JUL2010 1111  5 Kim Lee Personal Loan 10

31JUL2010 1111  1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111  8 Kim Lee Personal OD   40

;

run;

ods exclude all;

ods output OneWayFreqs=f(keep=table frequency);

proc freq data=foo;

tables _all_ / missing;

run;

ods exclude none;

proc sql;

select distinct scan(table,2) "Unique Variable"

   into :u_var

   from f

  group by table

having frequency=1 and frequency=min(frequency) and frequency=max(frequency);

quit;

Super User
Super User
Posts: 6,500

Re: How to quickly identify variables that distinguish records uniquely

Depends on your definition of quick. You can get PROC SUMMARY to test all possible combinations and then find the ones that generate uniqueness.  I will leave it as an exercise to translate the binary string _TYPE_ into variable names.

proc summary data=sashelp.class chartype missing;

  class _all_;

  output out=summary / levels ways;

run;

proc sql noprint ;

  create table unique as

  select distinct _way_,_type_

  from summary

  group by _type_

  having max(_freq_)=1

  order by _way_,_type_

;

quit;

Super User
Posts: 9,681

Re: How to quickly identify variables that distinguish records uniquely

That is really not easy. You should list all the combination of these variables . and

use proc freq or prc sql to check whether their frequency equal 1 .

Ksharp

Contributor
Posts: 71

Re: How to quickly identify variables that distinguish records uniquely

Ksharp is correct.  It sounds like all you need to do is a Proc Freq on whichever variables you think are unique and ensure the count is never greater than 1.  This will help you identify which variable is a unique identifier.

The problem with not having a unique identifier built into the process or system from the start is that how can you be sure a future transaction will make it so that your proc freq ends up having a count of greater than 1?  You might do the proc freq today and think everything is fine.  But is there potential for it to change tomorrow?

You may need to combine 2 fields into a new variable to better ensure future uniqueness.  Nothing wrong with that...just be careful to think about as many scenarios as you can (i.e. choose your variables wisely).

PROC Star
Posts: 7,363

Re: How to quickly identify variables that distinguish records uniquely

Mirisage,

I really don't have time to either test or optimize the following, but it might at least give you an idea of how to solve your problem.  Of course, there is always the chance that this actually does what you are trying to accomplish:

data have;

  infile cards truncover;

  input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

  format current_date date9.;

  cards;

31JUL2010 1111  5 Kim Lee Personal Loan 10

31JUL2010 1111  1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111  8 Kim Lee Personal OD   40

;

proc sql;

  create table unique as

    select name

      from dictionary.columns

        where libname="WORK" and

              memname="HAVE"

  ;

quit;

%let clength=%eval(32*&sqlobs.);

%macro doit;

  data temp (keep=comb comb2);

    array names{&sqlobs.} $32.;

    i=0;

    do until (eof1);  /*** load the array with the variable names ***/

      set unique end=eof1;

      i+1;

      names(i)=name;

    end;

    length comb $&clength.;

    %do i=1 %to &sqlobs.;

      ncomb=comb(&sqlobs.,&i.);

      do j=1 to ncomb;

        call lexcomb(j, &i., of names

  • );
  •         comb=catx(",",of names1-names&i.);

            comb2=catx(" ",of names1-names&i.);

            n+1;

            output;

          end;

        %end;

        call symput("nrecs",put(n,best12.));

      run;

      %do i=1 %to &nrecs.;

        data _null_;

          set temp (firstobs=&i. obs=&i.);

          call symput("testit",comb);

          call symput("keepit",comb2);

        run;

        proc sql noprint;

          create table test as

            select *,count(*)

              from have (keep=&keepit.)

                group by &testit.

                  having count(*) > 1

          ;

        quit;

        %if &sqlobs. eq 0 %then %do;

          %put &testit.;

          %goto exit;

        %end;

      %end;

    %exit: %mend doit;

    %doit

    Frequent Contributor
    Posts: 139

    Re: How to quickly identify variables that distinguish records uniquely

    A quick question before I try to answer.

    If your source data is in a RDMS like Oracle, DB2, Teradata, SQL server, you can query the system tables and find the index on the tables which are unique.  If not any of the answers above will work.

    Frequent Contributor
    Posts: 129

    Re: How to quickly identify variables that distinguish records uniquely

    Mirisage,

    Here is technique to do that.  This works well if your initial guess on keys is correct.  Also before proceeding I find it useful to do a logical analysis of the data.  My guess of the uniqueness key is

    1. current_date

    2. Acct_no

    3. product

    4. bank_number

    Here is the basic technique.

    1.  start with your 'guess' that product and bank_number should be the unique identifiers.

    2.  Run this sql step to get records which have duplicate keys based upon guess.

          proc sql ;

             create table dups as

             select *

                from a

             group by product, bank_number

             having count(*) > 1

             order by product, bank_number

            ;

        quit ;

    If the table dups has zero records, then product and bank_number are a unique composite key.

    If the table dups is not empty, then you can examine it determine columns which need to be included in the composite key.  Add identified columns to the group by and repeat.

    I suspect there are at least two other columns which may be needed for uniqueness -- current_date and Acct_no.  If those are needed, add them into to the group by clause.

     


    Super User
    Super User
    Posts: 6,500

    Re: How to quickly identify variables that distinguish records uniquely

    I find that PROC SUMMARY is very useful for this (as compared to PROC FREQ or SQL) because it can handle all permutations of your class variables automatically.

    data have;

    infile cards truncover;

    input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

    list;

    format current_date date9.;

    cards;

    31JUL2010 1111  5 Kim Lee Personal Loan 10

    31JUL2010 1111  1 Kim Lee Personal Loan 20

    31JUL2010 1111 25 Kim Lee Personal Loan 30

    31JUL2010 1111 75 Kim Lee Res. Mortgage 30

    31JUL2010 1111  8 Kim Lee Personal OD   40

    ;

    run;

    proc summary data=have chartype missing;

      class _all_;

      output out=summary  / levels ways;

    run;

    proc sql noprint ;

      create table unique as

      select distinct _way_,_type_

      from summary

      group by _type_

      having max(_freq_)=1

      order by _way_,_type_

    ;

    quit;

    proc transpose data=have (obs=0) out=names;

      var _all_;

    run;

    data keys ;

      set unique ;

      length sortkey $200 ;

      do i=1 to length(_type_) ;

        if substr(_type_,i,1)='1' then do;

          set names point=i;

          sortkey=catx(' ',sortkey,_name_);

        end;

      end;

      drop _name_;

    run;

    Obs    _WAY_    _TYPE_    sortkey

      1      1      001000    balance

      2      2      000011    product bank_number

      3      2      001001    balance bank_number

      4      2      001010    balance product

      5      2      001100    balance name

      6      2      011000    acct_no balance

      7      2      101000    current_date balance

      8      3      000111    name product bank_number

      9      3      001011    balance product bank_number

    10      3      001101    balance name bank_number

    11      3      001110    balance name product

    12      3      010011    acct_no product bank_number

    13      3      011001    acct_no balance bank_number

    14      3      011010    acct_no balance product

    15      3      011100    acct_no balance name

    16      3      100011    current_date product bank_number

    17      3      101001    current_date balance bank_number

    18      3      101010    current_date balance product

    19      3      101100    current_date balance name

    20      3      111000    current_date acct_no balance

    21      4      001111    balance name product bank_number

    22      4      010111    acct_no name product bank_number

    23      4      011011    acct_no balance product bank_number

    24      4      011101    acct_no balance name bank_number

    25      4      011110    acct_no balance name product

    26      4      100111    current_date name product bank_number

    27      4      101011    current_date balance product bank_number

    28      4      101101    current_date balance name bank_number

    29      4      101110    current_date balance name product

    30      4      110011    current_date acct_no product bank_number

    31      4      111001    current_date acct_no balance bank_number

    32      4      111010    current_date acct_no balance product

    33      4      111100    current_date acct_no balance name

    34      5      011111    acct_no balance name product bank_number

    35      5      101111    current_date balance name product bank_number

    36      5      110111    current_date acct_no name product bank_number

    37      5      111011    current_date acct_no balance product bank_number

    38      5      111101    current_date acct_no balance name bank_number

    39      5      111110    current_date acct_no balance name product

    40      6      111111    current_date acct_no balance name product bank_number

    You can use the WAYS statement on PROC SUMMARY to reduce the combinations. 

    You can also eliminate the key combinations that just super sets of key combinations with fewer variables included (the _WAY_ variable from PROC SUMMARY).

    For example any combination that includes BALANCE can be eliminated because BALANCE is already a unique key.

    Super Contributor
    Posts: 338

    Re: How to quickly identify variables that distinguish records uniquely

    Hi FriedEgg, Tom, Ksharp, Art, darrylovia,  jaredp, LarryWorley,

    This is a tremendous new knowledge for me that I should digest piece by piece.

    I really, really appreciate all of your contributions.

    Warm regards

    Mirisage

    Ask a Question
    Discussion stats
    • 9 replies
    • 474 views
    • 2 likes
    • 8 in conversation