BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Mirisage
Obsidian | Level 7

Hi SAS Community,


I have a data set like this.

data a;
Informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41
Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;
 


I want to identify a variable or more that would uniquely identify a
record. 


Tom has generously provided me with the very valauble code below. When you
have run the complete set of codes, it will do the job.
 

In the code below, you can keep increasing "ways 1" number until you'll
get variable (or variables) that would uniquely identify a record.
 


proc summary data=a chartype missing;
  class _all_;
  output out=summary  / levels ways;
  ways 1; /*you can keep increasing this number until you'll get variable
(or variables) that would uniquely identify a record */
run;

proc sql noprint ;
  create table unique as
  select distinct _way_,_type_
  from summary
  group by _type_
  having max(_freq_)=1
  order by _way_,_type_
;
quit;
 

proc transpose data=a (obs=0) out=names;
  var _all_;
run;
 


data keys ;
  set unique ;
  length sortkey $200 ;
  do i=1 to length(_type_) ;
    if substr(_type_,i,1)='1' then do;
      set names point=i;
      sortkey=catx(' ',sortkey,_name_);
    end;
  end;
  drop _name_;
run;
 

Proc freq data=keys;
  tables sortkey;
run;
 


Question:

  • I put "ways 1" in the First "proc summary" code above.
     
  • Then ran the entire code pieces on my almost million data set
    which has 37 variables.
  • Then code ran smoothly and finally said "there is no single unique
    variable".
     
  • Then I put "ways 2" and re-ran. Then code ran smoothly and finally
    said "there are no two variables taht would uniquely identify a record".
     
  • Then I put "ways 3" and re-ran. Then SAS got crashed.

  • So, now I cannot proceed to identidy the unique vairiables in my dataset.

  • I wonder if there is a remedy for this or an alternative way of
    identifying unique record/s that could handle million dataset with 37
    variables?
     

Thank you
Mirisage

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number /* ... */ / noprint;
   ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by ","  from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

PG

View solution in original post

9 REPLIES 9
SGB
Obsidian | Level 7 SGB
Obsidian | Level 7

data a;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;
run;

Can you try the following method..


proc freq data=a nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number; * Here you need to add all the variables;
run;

The variable balance has 5 levels.. you have five records.

The output will look like this

current_date1
Acct_no1
Balance5
Name1
Product3
Bank_number4
ballardw
Super User

Is there a reason you cannot use something like _n_ to add a new variable as an id?

PGStats
Opal | Level 21

You realize that to find a set of three variables with the property that you want, among an ensemble of 37 variables, there are 7770 combinations to check? And that each check amounts to something like a sort?

Maybe you can focus on a smaller number of variable combinations by noticing that the product of the level counts suggested by SGB above for a combination of variables must be greater or equal to the number of observations in your dataset, otherwise that combination cannot identify uniquely all obs.

PG

PG
PGStats
Opal | Level 21

Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :

data A;
informat current_date date9.;
input current_date Acct_no Balance  Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111   5 Kim Lee Personal Loan 10
31JUL2010 1111   1 Kim Lee Personal Loan 20
31JUL2010 1111  25 Kim Lee Personal Loan 30
31JUL2010 1111  75 Kim Lee Res. Mortgage 30
31JUL2010 1111   8 Kim Lee Personal OD   40
;

proc freq data=A nlevels;
   tables current_date Acct_no Balance  Name Product Bank_number /* ... */ / noprint;
   ods output NLevels=nlev;
run;

proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by ","  from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;

%let combs=3;

data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;

do i = 1 to comb(&Nvars, &combs);
     call allcombi(&Nvars,&combs,of II{*});
     comb = 1;
     do j = 1 to &combs;
          var{j} = tableVar{II{j}};
          comb = comb * NLevel{II{j}};
          end;
     if comb >= &nobs then output;
     end;
run;

proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;

PG

PG
Patrick
Opal | Level 21

I'm very much with "ballardw" that you should create your own ID - and this could just be a simple data step with a line of command like:    ID=_N_;

You might find a unique combination of variables with an algorithm - but this combination of variables might then only be valid for exactly the set of data you've derived it from.

You should know your data and try to figure out on a conceptual level which combination of variables MUST be unique. This would also allow you to validate your actual data (data quality issues?). If there is no unique combination (=primary composite key or alternate key or natural key) then creating a surrogate key is may be a very good idea (eg. via ID=_N_;).

art297
Opal | Level 21

I haven't heard anyone ask it yet, so I will: why do you want to uniquely identify each record (beyond the fact that they already are physically uniquely identifiable)?  Wouldn't you be better off having an id field that accurately captured all of just one person's records?

Mirisage
Obsidian | Level 7

Hi SGB, Ballardw, PGStats, Patrick and Art,

Thank you very much for every one of you for putting your valuable time on this.

All responses shed me intellectual lights in different aspects that I haven’t even thought of. I am still too immature to understand some.

I had data for 9 banks.

Bank_number,  Account_number   and Current_date   made a unique record for 8 banks out of 9. For one bank, I could not find what is the combination of variables that would make a unique record. Now I am almost there to find it.

Thank again for these intellectual responses.

Best regards

Mirisage

Patrick
Opal | Level 21

Unfortunately there is not a world wide standard for bank account numbers. There are standards like IBAN used in a lot of European countries or SWIFT/BIC codes for international funds transfers.

A lot of banks explain on their website what kind of account numbers they are using and how they are constructed. It might be worth to check it for "your" banks as it could help you to determine the primary key for your data.

For example for your 9th bank: It could be that the branch number is needed as well {bank, branch, account, data}.

Mirisage
Obsidian | Level 7

Hi Patrick,

You are right.

I have been just mechanically striving to create whatever variable combination that would make a unique record which alone wouldn't work.

I should contact the relevant bank and ask.

I learned a lot through this discussion.

Thanks again!

Mirisage

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 3365 views
  • 6 likes
  • 6 in conversation