Help using Base SAS procedures

How to identify variable (or combination of variables) that would uniquely identify a record?

Accepted Solution Solved
Reply
Super Contributor
Posts: 338
Accepted Solution

How to identify variable (or combination of variables) that would uniquely identify a record?

Hi SAS Community,


I have a data set like this.

data a;
Informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41
Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;
 


I want to identify a variable or more that would uniquely identify a
record. 


Tom has generously provided me with the very valauble code below. When you
have run the complete set of codes, it will do the job.
 

In the code below, you can keep increasing "ways 1" number until you'll
get variable (or variables) that would uniquely identify a record.
 


proc summary data=a chartype missing;
  class _all_;
  output out=summary  / levels ways;
  ways 1; /*you can keep increasing this number until you'll get variable
(or variables) that would uniquely identify a record */
run;

proc sql noprint ;
  create table unique as
  select distinct _way_,_type_
  from summary
  group by _type_
  having max(_freq_)=1
  order by _way_,_type_
;
quit;
 

proc transpose data=a (obs=0) out=names;
  var _all_;
run;
 


data keys ;
  set unique ;
  length sortkey $200 ;
  do i=1 to length(_type_) ;
    if substr(_type_,i,1)='1' then do;
      set names point=i;
      sortkey=catx(' ',sortkey,_name_);
    end;
  end;
  drop _name_;
run;
 

Proc freq data=keys;
  tables sortkey;
run;
 


Question:

  • I put "ways 1" in the First "proc summary" code above.
     
  • Then ran the entire code pieces on my almost million data set
    which has 37 variables.
  • Then code ran smoothly and finally said "there is no single unique
    variable".
     
  • Then I put "ways 2" and re-ran. Then code ran smoothly and finally
    said "there are no two variables taht would uniquely identify a record".
     
  • Then I put "ways 3" and re-ran. Then SAS got crashed.

  • So, now I cannot proceed to identidy the unique vairiables in my dataset.

  • I wonder if there is a remedy for this or an alternative way of
    identifying unique record/s that could handle million dataset with 37
    variables?
     

Thank you
Mirisage


Accepted Solutions
Solution
‎08-10-2012 10:34 PM
Respected Advisor
Posts: 4,919

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number /* ... */ / noprint;
   ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by ","  from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

PG

View solution in original post


All Replies
Contributor SGB
Contributor
Posts: 41

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

data a;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;

Can you try the following method..


proc freq data=a nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number; * Here you need to add all the variables;
run;

The variable balance has 5 levels.. you have five records.

The output will look like this

current_date1
Acct_no1
Balance5
Name1
Product3
Bank_number4
Super User
Posts: 11,338

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Is there a reason you cannot use something like _n_ to add a new variable as an id?

Respected Advisor
Posts: 4,919

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

You realize that to find a set of three variables with the property that you want, among an ensemble of 37 variables, there are 7770 combinations to check? And that each check amounts to something like a sort?

Maybe you can focus on a smaller number of variable combinations by noticing that the product of the level counts suggested by SGB above for a combination of variables must be greater or equal to the number of observations in your dataset, otherwise that combination cannot identify uniquely all obs.

PG

PG
Solution
‎08-10-2012 10:34 PM
Respected Advisor
Posts: 4,919

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number /* ... */ / noprint;
   ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by ","  from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

PG
Respected Advisor
Posts: 4,173

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

I'm very much with "ballardw" that you should create your own ID - and this could just be a simple data step with a line of command like:    ID=_N_;

You might find a unique combination of variables with an algorithm - but this combination of variables might then only be valid for exactly the set of data you've derived it from.

You should know your data and try to figure out on a conceptual level which combination of variables MUST be unique. This would also allow you to validate your actual data (data quality issues?). If there is no unique combination (=primary composite key or alternate key or natural key) then creating a surrogate key is may be a very good idea (eg. via ID=_N_Smiley Wink.

PROC Star
Posts: 7,467

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

I haven't heard anyone ask it yet, so I will: why do you want to uniquely identify each record (beyond the fact that they already are physically uniquely identifiable)?  Wouldn't you be better off having an id field that accurately captured all of just one person's records?

Super Contributor
Posts: 338

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Hi SGB, Ballardw, PGStats, Patrick and Art,

Thank you very much for every one of you for putting your valuable time on this.

All responses shed me intellectual lights in different aspects that I haven’t even thought of. I am still too immature to understand some.

I had data for 9 banks.

Bank_number,  Account_number   and Current_date   made a unique record for 8 banks out of 9. For one bank, I could not find what is the combination of variables that would make a unique record. Now I am almost there to find it.

Thank again for these intellectual responses.

Best regards

Mirisage

Respected Advisor
Posts: 4,173

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Unfortunately there is not a world wide standard for bank account numbers. There are standards like IBAN used in a lot of European countries or SWIFT/BIC codes for international funds transfers.

A lot of banks explain on their website what kind of account numbers they are using and how they are constructed. It might be worth to check it for "your" banks as it could help you to determine the primary key for your data.

For example for your 9th bank: It could be that the branch number is needed as well {bank, branch, account, data}.

Super Contributor
Posts: 338

Re: How to identify variable (or combination of variables) that would uniquely identify a record?

Hi Patrick,

You are right.

I have been just mechanically striving to create whatever variable combination that would make a unique record which alone wouldn't work.

I should contact the relevant bank and ask.

I learned a lot through this discussion.

Thanks again!

Mirisage

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 802 views
  • 6 likes
  • 6 in conversation