Hi SAS Community,
I have a data set like this.
data a;
Informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41
Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
run;
I want to identify a variable or more that would uniquely identify a
record.
Tom has generously provided me with the very valauble code below. When you
have run the complete set of codes, it will do the job.
In the code below, you can keep increasing "ways 1" number until you'll
get variable (or variables) that would uniquely identify a record.
proc summary data=a chartype missing;
class _all_;
output out=summary / levels ways;
ways 1; /*you can keep increasing this number until you'll get variable
(or variables) that would uniquely identify a record */
run;
proc sql noprint ;
create table unique as
select distinct _way_,_type_
from summary
group by _type_
having max(_freq_)=1
order by _way_,_type_
;
quit;
proc transpose data=a (obs=0) out=names;
var _all_;
run;
data keys ;
set unique ;
length sortkey $200 ;
do i=1 to length(_type_) ;
if substr(_type_,i,1)='1' then do;
set names point=i;
sortkey=catx(' ',sortkey,_name_);
end;
end;
drop _name_;
run;
Proc freq data=keys;
tables sortkey;
run;
Question:
Thank you
Mirisage
Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :
data A;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
proc freq data=A nlevels;
tables current_date Acct_no Balance Name Product Bank_number /* ... */ / noprint;
ods output NLevels=nlev;
run;
proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by "," from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;
%let combs=3;
data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;
do i = 1 to comb(&Nvars, &combs);
call allcombi(&Nvars,&combs,of II{*});
comb = 1;
do j = 1 to &combs;
var{j} = tableVar{II{j}};
comb = comb * NLevel{II{j}};
end;
if comb >= &nobs then output;
end;
run;
proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;
PG
data a;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
run;
Can you try the following method..
proc freq data=a nlevels;
tables current_date Acct_no Balance Name Product Bank_number; * Here you need to add all the variables;
run;
The variable balance has 5 levels.. you have five records.
The output will look like this
current_date | 1 |
---|---|
Acct_no | 1 |
Balance | 5 |
Name | 1 |
Product | 3 |
Bank_number | 4 |
Is there a reason you cannot use something like _n_ to add a new variable as an id?
You realize that to find a set of three variables with the property that you want, among an ensemble of 37 variables, there are 7770 combinations to check? And that each check amounts to something like a sort?
Maybe you can focus on a smaller number of variable combinations by noticing that the product of the level counts suggested by SGB above for a combination of variables must be greater or equal to the number of observations in your dataset, otherwise that combination cannot identify uniquely all obs.
PG
Building on my previous suggestion and SGB idea, you could find the best variable combination candidates using this :
data A;
informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
proc freq data=A nlevels;
tables current_date Acct_no Balance Name Product Bank_number /* ... */ / noprint;
ods output NLevels=nlev;
run;
proc sql noprint;
select quote(trim(tableVar)) into :tableVars separated by "," from nlev;
select NLevels into :NLevels separated by "," from nlev;
select count(*) into :Nvars from nlev;
select count(*) into :nobs from A;
quit;
%let combs=3;
data comb(keep=var: comb);
array tableVar{&Nvars} $16 (&tableVars);
array NLevel{&Nvars} (&NLevels);
array II{&combs};
array var{&combs} $16 ;
do i = 1 to comb(&Nvars, &combs);
call allcombi(&Nvars,&combs,of II{*});
comb = 1;
do j = 1 to &combs;
var{j} = tableVar{II{j}};
comb = comb * NLevel{II{j}};
end;
if comb >= &nobs then output;
end;
run;
proc sql outobs=10; /* Limit the size of output list */
select * from comb order by comb desc;
quit;
PG
I'm very much with "ballardw" that you should create your own ID - and this could just be a simple data step with a line of command like: ID=_N_;
You might find a unique combination of variables with an algorithm - but this combination of variables might then only be valid for exactly the set of data you've derived it from.
You should know your data and try to figure out on a conceptual level which combination of variables MUST be unique. This would also allow you to validate your actual data (data quality issues?). If there is no unique combination (=primary composite key or alternate key or natural key) then creating a surrogate key is may be a very good idea (eg. via ID=_N_;).
I haven't heard anyone ask it yet, so I will: why do you want to uniquely identify each record (beyond the fact that they already are physically uniquely identifiable)? Wouldn't you be better off having an id field that accurately captured all of just one person's records?
Hi SGB, Ballardw, PGStats, Patrick and Art,
Thank you very much for every one of you for putting your valuable time on this.
All responses shed me intellectual lights in different aspects that I haven’t even thought of. I am still too immature to understand some.
I had data for 9 banks.
Bank_number, Account_number and Current_date made a unique record for 8 banks out of 9. For one bank, I could not find what is the combination of variables that would make a unique record. Now I am almost there to find it.
Thank again for these intellectual responses.
Best regards
Mirisage
Unfortunately there is not a world wide standard for bank account numbers. There are standards like IBAN used in a lot of European countries or SWIFT/BIC codes for international funds transfers.
A lot of banks explain on their website what kind of account numbers they are using and how they are constructed. It might be worth to check it for "your" banks as it could help you to determine the primary key for your data.
For example for your 9th bank: It could be that the branch number is needed as well {bank, branch, account, data}.
Hi Patrick,
You are right.
I have been just mechanically striving to create whatever variable combination that would make a unique record which alone wouldn't work.
I should contact the relevant bank and ask.
I learned a lot through this discussion.
Thanks again!
Mirisage
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.