Because your requirement is "any value in any of the column", and the code is using a single hash table, the declaration of the key variable needs to be the longest variable you've got. I believe there are several comments in the code stating this. If this was not done, then of course it will not work.
I tested it . It works for missing values, see the following output.
But the order of obs is not the origin order of table ,You can compare between the origin data and output.
By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?
Origin Data:
aaa bbb ccc ddd
qqq rrr www aaa
rrr ppp mmm lll
. . . .
. . . .
uuu zzz ffff ppp
. . . .
p l m n
jjjj eee rrr ooo
. . . .
sss www . .
. . . eee
Output:
pan1 pan2 pan3 add1 household
aaa bbb ccc ddd 1
qqq rrr www aaa 1
rrr ppp mmm lll 1
uuu zzz ffff ppp 1
jjjj eee rrr ooo 1
sss www 1
eee 1
2
3
4
p l m n 5
6
Ksharp
Oh. I found I problem in my code for DLing's data.
Try this code. It might be correct.
data test; infile datalines truncover; input ( pan1 pan2 pan3 add1 ) ($); datalines; aa bb cc dd ee ff gg hh ii jj kk ll aa ff gg dd mm oo pp pp rr rr ss tt uu vv ww xx yy zz dd xx z1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 3 5 7 9 11 13 15 1 5 9 13 1 9 ; run; data want(keep=pan1 pan2 pan3 add1 household); if _n_ eq 1 then do; declare hash ha(hashexp : 20,ordered: 'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedata('_household'); _ha.definedone(); do until(last); set test end=last; count+1; ha.add(); end; end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; set test; count=_n_; _rc=ha.check(); if _rc eq 0 then do; household+1;_household=household; n=0; do i=1 to 4; if not missing(h{i}) then do; key=h{i}; _ha.replace();end; else n+1; end; do until(x=1); x=1; rc=hi.first(); do while(rc=0); if found then rx=ha.remove(key : _count); found=0; do j=1 to 4; key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count; end; rc=hi.next(); end; end; if n=4 then do; call missing(pan1,pan2,pan3,add1); output; end; end; run;
By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?
Ksharp
I asked the OP about that (removing missing data initially and then assigning them sequential numbers at the end) and, as I recall, that was quite permissible.
Art297.
I think that removing missing data initially will simply the question and will promote efficiency a lot.
The mssing obs complicated the question and make code hard to write.
If op could remove missing obs firstly , I believed the speed will be fast a lot and a lot......
I am surprised that you still have not gone to sleep. It is almost midnight at Canada.
But I am now in the Morning.
Ksharp
Ksharp i have tryed your code but i have ran for 80000 obs i ran for 10 mins so i stopped it as if i am having 7-8 crores of data that is the prob ,if any that can make fast...
The fastest way is to to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly.
if cmiss(pan1,pan2,pan3,add1)=4 then delete;
then use the following code:
data test; input ( pan1 pan2 pan3 add1 ) ($); datalines; aaa bbb ccc ddd qqq rrr www aaa rrr ppp mmm lll uuu zzz ffff ppp p l m n jjjj eee rrr ooo sss www . . . . . eee ; run; data want(keep=pan1 pan2 pan3 add1 household); declare hash ha(hashexp : 20); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedata('_household'); _ha.definedone(); do until(last); set test end=last; count+1; ha.add(); end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; do until(num_items eq 0); _rc=hi.first(); household+1;_household=household; do i=1 to 4; if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; do until(x=1); x=1; if found then rx=ha.remove(key : _count); rc=hi.first(); do while(rc=0); if found then rx=ha.remove(key : _count); found=0; do j=1 to 4; key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count; end; rc=hi.next(); end; end; num_items=ha.num_items;put num_items=; end; run;
Ksharp
Message was edited by: xia keshan
Removing pure blank records first is certainly a good idea and would remove some special condition handling. It adds another data step and some I/O. If the code can handle it naturally, then it could save the resource. It's a design decision that we all must make all the time.
Or
data test; input ( pan1 pan2 pan3 add1 ) ($); datalines; aaa bbb ccc ddd . . . . qqq rrr www aaa rrr ppp mmm lll uuu zzz ffff ppp p l m n . . . . . . . . jjjj eee rrr ooo sss www . . . . . eee ; run; data test; /*Remove missing obs firstly*/ set test; if cmiss(pan1,pan2,pan3,add1)=4 then delete; run; data want(keep=pan1 pan2 pan3 add1 household); declare hash ha(hashexp : 20); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20); _ha.definekey('key'); _ha.definedata('_household'); _ha.definedone(); do until(last); set test end=last; count+1; ha.add(); end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; _rc=hi.first(); do while(_rc eq 0); household+1;_household=household; do i=1 to 4; if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; do until(x=1); x=1; rc=hi.first(); do while(rc=0); found=0; do j=1 to 4; key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count; end; rc=hi.next(); if found then rx=ha.remove(key : _count); end; end; _rc=hi.first(); end; run;
Ksharp
checked on 80000 obs by cmiss i got 600 records and for the rest i have run for the 79,400 it was running for more than 8-10 mins so i have stopped it...
OH. My God.
I doubted that at every iteration there is only to find several matched obs,
So speed will be so low.
I can not think how to fast it any more.
OH.God.
Ksharp
The algorithm is iterative, so the only thing that will make it go fast is to store the whole thing in memory via SASFILE statement. But if the input data is huge, then obviously that's not an option.
If the problem is that big, then it will need processing time to do. My sample code will make multiple passes through the data until group assignment is consistent, but that's just a lot of I/O. There isn't a lot of compute. I always follow the rule of make it work before make it fast.
DLing.Yes. The op's problem is the efficiency not execution correctly.
It is very hard. I can not find another way which is faster than Hash Table. Maybe OP can find a super computer.
Ksharp
I run slightly modified version of KSharp code with exceptional performance. I made minor adjustment to make generation a large sample easier (all numeric fields pan1-pan4 instead of character fields).
So I generate sample data for 100,000 rows containing 4 variables with values from 1-999,999
I optionally sort columns and or rows to compare performance (there should not be huge difference since hash object lookup should not be dependent on order, however may see improvements due to fewer large reassignments, maybe).
First off I must say the performance of this is extrememly good in my systems, I cannot see this taking 8-10 minutes for a set of 80,000 records as mentioned.
options fullstimer;
/* Generate some data for a larger test */
data have;
call streaminit( 12345 );
array pan[4];
do i=1 to 10**4;
do j=1 to dim(pan);
pan
end;
output;
end;
drop i j;
run;
/* End */
/* Start optional sortings for testing */
data have;
set have;
array pan
call sortn(of pan
run;
proc sort data=have; by pan1 pan2 pan3 pan4; run;
/* End */
/* Start Assign Linkage Key */
data want(keep=pan1-pan4 lkey);
declare hash ha(hashexp : 20);
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','pan1','pan2','pan3','pan4');
ha.definedone();
declare hash _ha(hashexp: 20);
_ha.definekey('key');
_ha.definedata('_lkey');
_ha.definedone();
do until(last);
set have end=last;
count+1;
ha.add();
end;
array h{4} pan1-pan4;
_rc=hi.first();
do while(_rc eq 0);
lkey+1;_lkey=lkey;
do i=1 to 4;
if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
end;
do until(x=1);
x=1;
rc=hi.first();
do while(rc=0);
found=0;
do j=1 to 4;
key=h{j};rcc=_ha.check();
if rcc =0 then found=1;
end;
if found then do;
do k=1 to 4;
if not missing(h{k}) then do; key=h{k};_ha.replace();end;
end;
output;x=0; _count=count;
end;
rc=hi.next();
if found then rx=ha.remove(key : _count);
end;
end;
_rc=hi.first();
end;
run;
/* End */
Results (average of 3 runs each, time is for final linkage step only):
No Sorting: real time - 6.25 seconds
Sortn only: real time - 4.35 seconds ( downside to utilizing is loss of variable specification if pan1, pan2 etc are meaningful unto themselves )
Sortn and Proc Sort: real time - 4.42 seconds
Proc Sort only: 4.38 seconds
So clearly, at least utilizing my sample data performance is stellar and can improve by preparing the data with a sort. I would still recomment removing instances will all blanks however for my test this is not necessary.
My tests each generated a little over 2,000 unique linkage keys at the end of process and my spot checking all looked good.
This is a fantastic thread, I too am still learning very much about the hash objects, unfortuneatly in my work thus far there are not many useful opportunities to impletement them for gains, thus far.
Hi Fried,
I have cheked your code but i my values are all characeter pan1--pan4 are all characters i have changed in the
array but not working just running so i have stopped it..
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.