Hi Community,
I have a data set like below where I can easily say the variables that make the records unique are Product and Bank_number.
data a;
Informat current_date date9.;
input current_date Acct_no Balance Name $ 20-26 Product $ 28-41 Bank_number;
Format current_date date9.;
datalines;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
run;
However, my real data set has over million records and 37 variables. I started manually selecting variables one by one and examining using trial and error process to find those distinguishing variables. But this takes me no where, it seems.
Q. Is there any method that I can use to identify the variable (or combination of variables) that would distinguish records uniquely.
Thank you for your help
Mirisage
This would depend on a single variable providing the uniqueness instead of a combination of variables like acct_no and bank_number. With your example data the only item that is unique itself is balance. You could expand this to account for cross-tabs in the frequency table but I do not provide example for that.
data foo;
infile cards truncover;
input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;
list;
format current_date date9.;
cards;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
run;
ods exclude all;
ods output OneWayFreqs=f(keep=table frequency);
proc freq data=foo;
tables _all_ / missing;
run;
ods exclude none;
proc sql;
select distinct scan(table,2) "Unique Variable"
into :u_var
from f
group by table
having frequency=1 and frequency=min(frequency) and frequency=max(frequency);
quit;
Depends on your definition of quick. You can get PROC SUMMARY to test all possible combinations and then find the ones that generate uniqueness. I will leave it as an exercise to translate the binary string _TYPE_ into variable names.
proc summary data=sashelp.class chartype missing;
class _all_;
output out=summary / levels ways;
run;
proc sql noprint ;
create table unique as
select distinct _way_,_type_
from summary
group by _type_
having max(_freq_)=1
order by _way_,_type_
;
quit;
That is really not easy. You should list all the combination of these variables . and
use proc freq or prc sql to check whether their frequency equal 1 .
Ksharp
Ksharp is correct. It sounds like all you need to do is a Proc Freq on whichever variables you think are unique and ensure the count is never greater than 1. This will help you identify which variable is a unique identifier.
The problem with not having a unique identifier built into the process or system from the start is that how can you be sure a future transaction will make it so that your proc freq ends up having a count of greater than 1? You might do the proc freq today and think everything is fine. But is there potential for it to change tomorrow?
You may need to combine 2 fields into a new variable to better ensure future uniqueness. Nothing wrong with that...just be careful to think about as many scenarios as you can (i.e. choose your variables wisely).
Mirisage,
I really don't have time to either test or optimize the following, but it might at least give you an idea of how to solve your problem. Of course, there is always the chance that this actually does what you are trying to accomplish:
data have;
infile cards truncover;
input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;
format current_date date9.;
cards;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
proc sql;
create table unique as
select name
from dictionary.columns
where libname="WORK" and
memname="HAVE"
;
quit;
%let clength=%eval(32*&sqlobs.);
%macro doit;
data temp (keep=comb comb2);
array names{&sqlobs.} $32.;
i=0;
do until (eof1); /*** load the array with the variable names ***/
set unique end=eof1;
i+1;
names(i)=name;
end;
length comb $&clength.;
%do i=1 %to &sqlobs.;
ncomb=comb(&sqlobs.,&i.);
do j=1 to ncomb;
call lexcomb(j, &i., of names
comb=catx(",",of names1-names&i.);
comb2=catx(" ",of names1-names&i.);
n+1;
output;
end;
%end;
call symput("nrecs",put(n,best12.));
run;
%do i=1 %to &nrecs.;
data _null_;
set temp (firstobs=&i. obs=&i.);
call symput("testit",comb);
call symput("keepit",comb2);
run;
proc sql noprint;
create table test as
select *,count(*)
from have (keep=&keepit.)
group by &testit.
having count(*) > 1
;
quit;
%if &sqlobs. eq 0 %then %do;
%put &testit.;
%goto exit;
%end;
%end;
%exit: %mend doit;
%doit
A quick question before I try to answer.
If your source data is in a RDMS like Oracle, DB2, Teradata, SQL server, you can query the system tables and find the index on the tables which are unique. If not any of the answers above will work.
Mirisage,
Here is technique to do that. This works well if your initial guess on keys is correct. Also before proceeding I find it useful to do a logical analysis of the data. My guess of the uniqueness key is
1. current_date
2. Acct_no
3. product
4. bank_number
Here is the basic technique.
1. start with your 'guess' that product and bank_number should be the unique identifiers.
2. Run this sql step to get records which have duplicate keys based upon guess.
proc sql ;
create table dups as
select *
from a
group by product, bank_number
having count(*) > 1
order by product, bank_number
;
quit ;
If the table dups has zero records, then product and bank_number are a unique composite key.
If the table dups is not empty, then you can examine it determine columns which need to be included in the composite key. Add identified columns to the group by and repeat.
I suspect there are at least two other columns which may be needed for uniqueness -- current_date and Acct_no. If those are needed, add them into to the group by clause.
 
I find that PROC SUMMARY is very useful for this (as compared to PROC FREQ or SQL) because it can handle all permutations of your class variables automatically.
data have;
infile cards truncover;
input current_date date9. acct_no balance name $ 19-25 product $ 26-40 bank_number;
list;
format current_date date9.;
cards;
31JUL2010 1111 5 Kim Lee Personal Loan 10
31JUL2010 1111 1 Kim Lee Personal Loan 20
31JUL2010 1111 25 Kim Lee Personal Loan 30
31JUL2010 1111 75 Kim Lee Res. Mortgage 30
31JUL2010 1111 8 Kim Lee Personal OD 40
;
run;
proc summary data=have chartype missing;
class _all_;
output out=summary / levels ways;
run;
proc sql noprint ;
create table unique as
select distinct _way_,_type_
from summary
group by _type_
having max(_freq_)=1
order by _way_,_type_
;
quit;
proc transpose data=have (obs=0) out=names;
var _all_;
run;
data keys ;
set unique ;
length sortkey $200 ;
do i=1 to length(_type_) ;
if substr(_type_,i,1)='1' then do;
set names point=i;
sortkey=catx(' ',sortkey,_name_);
end;
end;
drop _name_;
run;
Obs _WAY_ _TYPE_ sortkey
1 1 001000 balance
2 2 000011 product bank_number
3 2 001001 balance bank_number
4 2 001010 balance product
5 2 001100 balance name
6 2 011000 acct_no balance
7 2 101000 current_date balance
8 3 000111 name product bank_number
9 3 001011 balance product bank_number
10 3 001101 balance name bank_number
11 3 001110 balance name product
12 3 010011 acct_no product bank_number
13 3 011001 acct_no balance bank_number
14 3 011010 acct_no balance product
15 3 011100 acct_no balance name
16 3 100011 current_date product bank_number
17 3 101001 current_date balance bank_number
18 3 101010 current_date balance product
19 3 101100 current_date balance name
20 3 111000 current_date acct_no balance
21 4 001111 balance name product bank_number
22 4 010111 acct_no name product bank_number
23 4 011011 acct_no balance product bank_number
24 4 011101 acct_no balance name bank_number
25 4 011110 acct_no balance name product
26 4 100111 current_date name product bank_number
27 4 101011 current_date balance product bank_number
28 4 101101 current_date balance name bank_number
29 4 101110 current_date balance name product
30 4 110011 current_date acct_no product bank_number
31 4 111001 current_date acct_no balance bank_number
32 4 111010 current_date acct_no balance product
33 4 111100 current_date acct_no balance name
34 5 011111 acct_no balance name product bank_number
35 5 101111 current_date balance name product bank_number
36 5 110111 current_date acct_no name product bank_number
37 5 111011 current_date acct_no balance product bank_number
38 5 111101 current_date acct_no balance name bank_number
39 5 111110 current_date acct_no balance name product
40 6 111111 current_date acct_no balance name product bank_number
You can use the WAYS statement on PROC SUMMARY to reduce the combinations.
You can also eliminate the key combinations that just super sets of key combinations with fewer variables included (the _WAY_ variable from PROC SUMMARY).
For example any combination that includes BALANCE can be eliminated because BALANCE is already a unique key.
Hi FriedEgg, Tom, Ksharp, Art, darrylovia, jaredp, LarryWorley,
This is a tremendous new knowledge for me that I should digest piece by piece.
I really, really appreciate all of your contributions.
Warm regards
Mirisage
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
