Solved: How to identify variable (or combination of variables) that would uniq...

Mirisage · Posted 08-10-2012 03:26 PM

Hi SAS Community,

I have a data set like this.

data a;
Informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41
Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;

I want to identify a variable or more that would uniquely identify a
record.

Tom has generously provided me with the very valauble code below. When you
have run the complete set of codes, it will do the job.

In the code below, you can keep increasing "ways 1" number until you'll
get variable (or variables) that would uniquely identify a record.

proc summary data=a chartype missing;
class _all_;
output out=summary / levels ways;
ways 1; /*you can keep increasing this number until you'll get variable
(or variables) that would uniquely identify a record */
run;

proc sql noprint ;
create table unique as
select distinct _way_,_type_
from summary
group by _type_
having max(_freq_)=1
order by _way_,_type_
;
quit;

proc transpose data=a (obs=0) out=names;
var _all_;
run;

data keys ;
set unique ;
length sortkey $200 ;
do i=1 to length(_type_) ;
    if substr(_type_,i,1)='1' then do;
      set names point=i;
      sortkey=catx(' ',sortkey,_name_);
    end;
end;
drop _name_;
run;

Proc freq data=keys;
tables sortkey;
run;

Question:

I put "ways 1" in the First "proc summary" code above.
Then ran the entire code pieces on my almost million data set
which has 37 variables.
Then code ran smoothly and finally said "there is no single unique
variable".
Then I put "ways 2" and re-ran. Then code ran smoothly and finally
said "there are no two variables taht would uniquely identify a record".
Then I put "ways 3" and re-ran. Then SAS got crashed.

So, now I cannot proceed to identidy the unique vairiables in my dataset.

I wonder if there is a remedy for this or an alternative way of
identifying unique record/s that could handle million dataset with 37
variables?

Thank you
Mirisage

PGStats · Posted 08-10-2012 10:34 PM

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
tables current_date Acct_no Balance Name Product Bank_number /* ... */ / noprint;
ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by "," from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

View solution in original post

SGB · Posted 08-10-2012 04:59 PM

data a;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;

Can you try the following method..

proc freq data=a nlevels;
tables current_date Acct_no Balance Name Product Bank_number; * Here you need to add all the variables;
run;

The variable balance has 5 levels.. you have five records.

The output will look like this

current_date	1
Acct_no	1
Balance	5
Name	1
Product	3
Bank_number	4

ballardw · Posted 08-10-2012 05:10 PM

Is there a reason you cannot use something like _n_ to add a new variable as an id?

PGStats · Posted 08-10-2012 09:11 PM

You realize that to find a set of three variables with the property that you want, among an ensemble of 37 variables, there are 7770 combinations to check? And that each check amounts to something like a sort?

Maybe you can focus on a smaller number of variable combinations by noticing that the product of the level counts suggested by SGB above for a combination of variables must be greater or equal to the number of observations in your dataset, otherwise that combination cannot identify uniquely all obs.

PG

PGStats · Posted 08-10-2012 10:34 PM

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
tables current_date Acct_no Balance Name Product Bank_number /* ... */ / noprint;
ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by "," from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

Patrick · Posted 08-11-2012 04:53 AM

I'm very much with "ballardw" that you should create your own ID - and this could just be a simple data step with a line of command like: ID=_N_;

You might find a unique combination of variables with an algorithm - but this combination of variables might then only be valid for exactly the set of data you've derived it from.

You should know your data and try to figure out on a conceptual level which combination of variables MUST be unique. This would also allow you to validate your actual data (data quality issues?). If there is no unique combination (=primary composite key or alternate key or natural key) then creating a surrogate key is may be a very good idea (eg. via ID=_N_;).

art297 · Posted 08-11-2012 09:06 AM

I haven't heard anyone ask it yet, so I will: why do you want to uniquely identify each record (beyond the fact that they already are physically uniquely identifiable)? Wouldn't you be better off having an id field that accurately captured all of just one person's records?

Mirisage · Posted 08-11-2012 09:02 PM

Hi SGB, Ballardw, PGStats, Patrick and Art,

Thank you very much for every one of you for putting your valuable time on this.

All responses shed me intellectual lights in different aspects that I haven’t even thought of. I am still too immature to understand some.

I had data for 9 banks.

Bank_number, Account_number and Current_date made a unique record for 8 banks out of 9. For one bank, I could not find what is the combination of variables that would make a unique record. Now I am almost there to find it.

Thank again for these intellectual responses.

Best regards

Mirisage

Patrick · Posted 08-11-2012 09:57 PM

Unfortunately there is not a world wide standard for bank account numbers. There are standards like IBAN used in a lot of European countries or SWIFT/BIC codes for international funds transfers.

A lot of banks explain on their website what kind of account numbers they are using and how they are constructed. It might be worth to check it for "your" banks as it could help you to determine the primary key for your data.

For example for your 9th bank: It could be that the branch number is needed as well {bank, branch, account, data}.

Mirisage · Posted 08-12-2012 08:45 AM

Hi Patrick,

You are right.

I have been just mechanically striving to create whatever variable combination that would make a unique record which alone wouldn't work.

I should contact the relevant bank and ask.

I learned a lot through this discussion.

Thanks again!

Mirisage

How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away