I run slightly modified version of KSharp code with exceptional performance. I made minor adjustment to make generation a large sample easier (all numeric fields pan1-pan4 instead of character fields). So I generate sample data for 100,000 rows containing 4 variables with values from 1-999,999 I optionally sort columns and or rows to compare performance (there should not be huge difference since hash object lookup should not be dependent on order, however may see improvements due to fewer large reassignments, maybe). First off I must say the performance of this is extrememly good in my systems, I cannot see this taking 8-10 minutes for a set of 80,000 records as mentioned. options fullstimer; /* Generate some data for a larger test */ data have; call streaminit( 12345 ); array pan[4]; do i=1 to 10**4; do j=1 to dim(pan); pan =abs(mod(int(rand('cauchy')*10**4),10**5)); end; output; end; drop i j; run; /* End */ /* Start optional sortings for testing */ data have; set have; array pan pan1-pan4; call sortn(of pan ); run; proc sort data=have; by pan1 pan2 pan3 pan4; run; /* End */ /* Start Assign Linkage Key */ data want(keep=pan1-pan4 lkey); declare hash ha(hashexp : 20); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','pan4'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedata('_lkey'); _ha.definedone(); do until(last); set have end=last; count+1; ha.add(); end; array h{4} pan1-pan4; _rc=hi.first(); do while(_rc eq 0); lkey+1;_lkey=lkey; do i=1 to 4; if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; do until(x=1); x=1; rc=hi.first(); do while(rc=0); found=0; do j=1 to 4; key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count; end; rc=hi.next(); if found then rx=ha.remove(key : _count); end; end; _rc=hi.first(); end; run; /* End */ Results (average of 3 runs each, time is for final linkage step only): No Sorting: real time - 6.25 seconds Sortn only: real time - 4.35 seconds ( downside to utilizing is loss of variable specification if pan1, pan2 etc are meaningful unto themselves ) Sortn and Proc Sort: real time - 4.42 seconds Proc Sort only: 4.38 seconds So clearly, at least utilizing my sample data performance is stellar and can improve by preparing the data with a sort. I would still recomment removing instances will all blanks however for my test this is not necessary. My tests each generated a little over 2,000 unique linkage keys at the end of process and my spot checking all looked good. This is a fantastic thread, I too am still learning very much about the hash objects, unfortuneatly in my work thus far there are not many useful opportunities to impletement them for gains, thus far.
... View more