How to quickly identify variables that distinguish records uniquely

Mirisage · Posted 05-16-2012 12:31 PM

Hi Community,

I have a data set like below where I can easily say the variables that make the records unique are Product and Bank_number.

data a;

Informat current_date date9.;

input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;

Format current_date date9.;

datalines;

31JUL2010 1111 5 Kim Lee Personal Loan 10

31JUL2010 1111 1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111 8 Kim Lee Personal OD 40

;

run;

However, my real data set has over million records and 37 variables. I started manually selecting variables one by one and examining using trial and error process to find those distinguishing variables. But this takes me no where, it seems.

Q. Is there any method that I can use to identify the variable (or combination of variables) that would distinguish records uniquely.

Thank you for your help

Mirisage

FriedEgg · Posted 05-16-2012 01:24 PM

This would depend on a single variable providing the uniqueness instead of a combination of variables like acct_no and bank_number. With your example data the only item that is unique itself is balance. You could expand this to account for cross-tabs in the frequency table but I do not provide example for that.

data foo;

infile cards truncover;

input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

list;

format current_date date9.;

cards;

31JUL2010 1111 5 Kim Lee Personal Loan 10

31JUL2010 1111 1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111 8 Kim Lee Personal OD 40

;

run;

ods exclude all;

ods output OneWayFreqs=f(keep=table frequency);

proc freq data=foo;

tables _all_ / missing;

run;

ods exclude none;

proc sql;

select distinct scan(table,2) "Unique Variable"

into :u_var

from f

group by table

having frequency=1 and frequency=min(frequency) and frequency=max(frequency);

quit;

Tom · Posted 05-16-2012 01:37 PM

Depends on your definition of quick. You can get PROC SUMMARY to test all possible combinations and then find the ones that generate uniqueness. I will leave it as an exercise to translate the binary string _TYPE_ into variable names.

proc summary data=sashelp.class chartype missing;

class _all_;

output out=summary / levels ways;

run;

proc sql noprint ;

create table unique as

select distinct _way_,_type_

from summary

group by _type_

having max(_freq_)=1

order by _way_,_type_

;

quit;

Ksharp · Posted 05-17-2012 01:42 AM

That is really not easy. You should list all the combination of these variables . and

use proc freq or prc sql to check whether their frequency equal 1 .

Ksharp

jaredp · Posted 05-17-2012 11:59 AM

Ksharp is correct. It sounds like all you need to do is a Proc Freq on whichever variables you think are unique and ensure the count is never greater than 1. This will help you identify which variable is a unique identifier.

The problem with not having a unique identifier built into the process or system from the start is that how can you be sure a future transaction will make it so that your proc freq ends up having a count of greater than 1? You might do the proc freq today and think everything is fine. But is there potential for it to change tomorrow?

You may need to combine 2 fields into a new variable to better ensure future uniqueness. Nothing wrong with that...just be careful to think about as many scenarios as you can (i.e. choose your variables wisely).

art297 · Posted 05-17-2012 09:58 AM

Mirisage,

I really don't have time to either test or optimize the following, but it might at least give you an idea of how to solve your problem. Of course, there is always the chance that this actually does what you are trying to accomplish:

data have;

infile cards truncover;

input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

format current_date date9.;

cards;

31JUL2010 1111 5 Kim Lee Personal Loan 10

31JUL2010 1111 1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111 8 Kim Lee Personal OD 40

;

proc sql;

create table unique as

select name

from dictionary.columns

where libname="WORK" and

memname="HAVE"

;

quit;

%let clength=%eval(32*&sqlobs.);

%macro doit;

data temp (keep=comb comb2);

array names{&sqlobs.} $32.;

i=0;

do until (eof1); /*** load the array with the variable names ***/

set unique end=eof1;

i+1;

names(i)=name;

end;

length comb $&clength.;

%do i=1 %to &sqlobs.;

ncomb=comb(&sqlobs.,&i.);

do j=1 to ncomb;

call lexcomb(j, &i., of names

);

comb=catx(",",of names1-names&i.);

comb2=catx(" ",of names1-names&i.);

n+1;

output;

end;

%end;

call symput("nrecs",put(n,best12.));

run;

%do i=1 %to &nrecs.;

data _null_;

set temp (firstobs=&i. obs=&i.);

call symput("testit",comb);

call symput("keepit",comb2);

run;

proc sql noprint;

create table test as

select *,count(*)

from have (keep=&keepit.)

group by &testit.

having count(*) > 1

;

quit;

%if &sqlobs. eq 0 %then %do;

%put &testit.;

%goto exit;

%end;

%exit: %mend doit;

%doit

darrylovia · Posted 05-17-2012 10:19 AM

A quick question before I try to answer.

If your source data is in a RDMS like Oracle, DB2, Teradata, SQL server, you can query the system tables and find the index on the tables which are unique. If not any of the answers above will work.

LarryWorley · Posted 05-17-2012 12:13 PM

Mirisage,

Here is technique to do that. This works well if your initial guess on keys is correct. Also before proceeding I find it useful to do a logical analysis of the data. My guess of the uniqueness key is

1. current_date

2. Acct_no

3. product

4. bank_number

Here is the basic technique.

1. start with your 'guess' that product and bank_number should be the unique identifiers.

2. Run this sql step to get records which have duplicate keys based upon guess.

proc sql ;

create table dups as

select *

from a

group by product, bank_number

having count(*) > 1

order by product, bank_number

;

quit ;

If the table dups has zero records, then product and bank_number are a unique composite key.

If the table dups is not empty, then you can examine it determine columns which need to be included in the composite key. Add identified columns to the group by and repeat.

I suspect there are at least two other columns which may be needed for uniqueness -- current_date and Acct_no. If those are needed, add them into to the group by clause.

Tom · Posted 05-17-2012 12:25 PM

I find that PROC SUMMARY is very useful for this (as compared to PROC FREQ or SQL) because it can handle all permutations of your class variables automatically.

data have;

infile cards truncover;

input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;

list;

format current_date date9.;

cards;

31JUL2010 1111 5 Kim Lee Personal Loan 10

31JUL2010 1111 1 Kim Lee Personal Loan 20

31JUL2010 1111 25 Kim Lee Personal Loan 30

31JUL2010 1111 75 Kim Lee Res. Mortgage 30

31JUL2010 1111 8 Kim Lee Personal OD 40

;

run;

proc summary data=have chartype missing;

class _all_;

output out=summary / levels ways;

run;

proc sql noprint ;

create table unique as

select distinct _way_,_type_

from summary

group by _type_

having max(_freq_)=1

order by _way_,_type_

;

quit;

proc transpose data=have (obs=0) out=names;

var _all_;

run;

data keys ;

set unique ;

length sortkey $200 ;

do i=1 to length(_type_) ;

if substr(_type_,i,1)='1' then do;

set names point=i;

sortkey=catx(' ',sortkey,_name_);

end;

drop _name_;

run;

Obs _WAY_ _TYPE_ sortkey

1 1 001000 balance

2 2 000011 product bank_number

3 2 001001 balance bank_number

4 2 001010 balance product

5 2 001100 balance name

6 2 011000 acct_no balance

7 2 101000 current_date balance

8 3 000111 name product bank_number

9 3 001011 balance product bank_number

10 3 001101 balance name bank_number

11 3 001110 balance name product

12 3 010011 acct_no product bank_number

13 3 011001 acct_no balance bank_number

14 3 011010 acct_no balance product

15 3 011100 acct_no balance name

16 3 100011 current_date product bank_number

17 3 101001 current_date balance bank_number

18 3 101010 current_date balance product

19 3 101100 current_date balance name

20 3 111000 current_date acct_no balance

21 4 001111 balance name product bank_number

22 4 010111 acct_no name product bank_number

23 4 011011 acct_no balance product bank_number

24 4 011101 acct_no balance name bank_number

25 4 011110 acct_no balance name product

26 4 100111 current_date name product bank_number

27 4 101011 current_date balance product bank_number

28 4 101101 current_date balance name bank_number

29 4 101110 current_date balance name product

30 4 110011 current_date acct_no product bank_number

31 4 111001 current_date acct_no balance bank_number

32 4 111010 current_date acct_no balance product

33 4 111100 current_date acct_no balance name

34 5 011111 acct_no balance name product bank_number

35 5 101111 current_date balance name product bank_number

36 5 110111 current_date acct_no name product bank_number

37 5 111011 current_date acct_no balance product bank_number

38 5 111101 current_date acct_no balance name bank_number

39 5 111110 current_date acct_no balance name product

40 6 111111 current_date acct_no balance name product bank_number

You can use the WAYS statement on PROC SUMMARY to reduce the combinations.

You can also eliminate the key combinations that just super sets of key combinations with fewer variables included (the _WAY_ variable from PROC SUMMARY).

For example any combination that includes BALANCE can be eliminated because BALANCE is already a unique key.

Mirisage · Posted 05-17-2012 05:57 PM

Hi FriedEgg, Tom, Ksharp, Art, darrylovia, jaredp, LarryWorley,

This is a tremendous new knowledge for me that I should digest piece by piece.

I really, really appreciate all of your contributions.

Warm regards

Mirisage

How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

Re: How to quickly identify variables that distinguish records uniquely

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away