Giving one Unique No

DLing · Posted 09-14-2011 08:20 AM

Because your requirement is "any value in any of the column", and the code is using a single hash table, the declaration of the key variable needs to be the longest variable you've got. I believe there are several comments in the code stating this. If this was not done, then of course it will not work.

Ksharp · Posted 09-13-2011 09:43 PM

I tested it . It works for missing values, see the following output.

But the order of obs is not the origin order of table ,You can compare between the origin data and output.

By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?

Origin Data:

aaa bbb ccc ddd

qqq rrr www aaa

rrr ppp mmm lll

. . . .

uuu zzz ffff ppp

. . . .

p l m n

jjjj eee rrr ooo

. . . .

sss www . .

. . . eee

Output:

pan1    pan2    pan3    add1    household
aaa    bbb    ccc    ddd    1
qqq    rrr    www    aaa    1
rrr    ppp    mmm    lll    1
uuu    zzz    ffff    ppp    1
jjjj    eee    rrr    ooo    1
sss    www            1
            eee    1
                2
                3
                4
p    l    m    n    5
                6

Ksharp

aaa   bbb    ccc     ddd
qqq   rrr       www   aaa
rrr     ppp    mmm lll
. . . .
. . . .
uuu   zzz    ffff      ppp
. . . .
p       l        m          n
jjjj    eee     rrr       ooo
.    .   .   .
sss   www .    .
.         .        .        eee

Ksharp · Posted 09-13-2011 09:54 PM

Oh. I found I problem in my code for DLing's data.

Try this code. It might be correct.

data test;
infile datalines truncover;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aa bb cc dd
ee ff gg hh
ii jj kk ll
aa ff gg dd
mm oo pp
pp rr rr ss
 
tt uu vv ww
xx yy zz dd
xx z1
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
1 3
5 7
9 11
13 15
1 5
9 13
1 9
;
run;

data want(keep=pan1 pan2 pan3 add1 household);
if _n_ eq 1 then do;
 declare hash ha(hashexp : 20,ordered: 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;
end;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
set test;
count=_n_; _rc=ha.check();
if _rc eq 0 then do;
 household+1;_household=household; n=0;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
    else n+1;
 end;   
 do until(x=1);
    x=1;
    rc=hi.first();
    do while(rc=0);
      if found then rx=ha.remove(key : _count);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
    end;
 end;  
 if n=4 then do; call missing(pan1,pan2,pan3,add1); output; end;
end; 
run;

By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?

Ksharp

art297 · Posted 09-13-2011 10:27 PM

I asked the OP about that (removing missing data initially and then assigning them sequential numbers at the end) and, as I recall, that was quite permissible.

Ksharp · Posted 09-13-2011 11:21 PM

Art297.

I think that removing missing data initially will simply the question and will promote efficiency a lot.

The mssing obs complicated the question and make code hard to write.

If op could remove missing obs firstly , I believed the speed will be fast a lot and a lot......

I am surprised that you still have not gone to sleep. It is almost midnight at Canada.

But I am now in the Morning.

Ksharp

sas_Forum · Posted 09-14-2011 02:52 AM

Ksharp i have tryed your code but i have ran for 80000 obs i ran for 10 mins so i stopped it as if i am having 7-8 crores of data that is the prob ,if any that can make fast...

Ksharp · Posted 09-14-2011 05:05 AM

The fastest way is to to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly.

if cmiss(pan1,pan2,pan3,add1)=4 then delete;

then use the following code:

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;





data want(keep=pan1 pan2 pan3 add1 household);
 declare hash ha(hashexp : 20);
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
do until(num_items eq 0);
_rc=hi.first();
 household+1;_household=household;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
 do until(x=1);
    x=1;       if found then rx=ha.remove(key : _count);
    rc=hi.first();
    do while(rc=0);
      if found then rx=ha.remove(key : _count);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
     end;
 end;  
num_items=ha.num_items;put num_items=;
end; 
run;

Ksharp

Message was edited by: xia keshan

DLing · Posted 09-14-2011 08:26 AM

Removing pure blank records first is certainly a good idea and would remove some special condition handling. It adds another data step and some I/O. If the code can handle it naturally, then it could save the resource. It's a design decision that we all must make all the time.

Ksharp · Posted 09-14-2011 05:42 AM

Or

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

data test; /*Remove missing obs firstly*/
 set test;
 if cmiss(pan1,pan2,pan3,add1)=4 then delete;
run;


data want(keep=pan1 pan2 pan3 add1 household);
 declare hash ha(hashexp : 20);
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
_rc=hi.first();
do while(_rc eq 0);
 household+1;_household=household;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
 do until(x=1);
    x=1;
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
      if found then rx=ha.remove(key : _count);
     end;
 end;  
_rc=hi.first();
end; 
run;

Ksharp

sas_Forum · Posted 09-14-2011 06:38 AM

checked on 80000 obs by cmiss i got 600 records and for the rest i have run for the 79,400 it was running for more than 8-10 mins so i have stopped it...

Ksharp · Posted 09-14-2011 06:51 AM

OH. My God.

I doubted that at every iteration there is only to find several matched obs,

So speed will be so low.

I can not think how to fast it any more.

OH.God.

Ksharp

DLing · Posted 09-14-2011 08:37 AM

The algorithm is iterative, so the only thing that will make it go fast is to store the whole thing in memory via SASFILE statement. But if the input data is huge, then obviously that's not an option.

If the problem is that big, then it will need processing time to do. My sample code will make multiple passes through the data until group assignment is consistent, but that's just a lot of I/O. There isn't a lot of compute. I always follow the rule of make it work before make it fast.

Ksharp · Posted 09-14-2011 09:19 AM

DLing.Yes. The op's problem is the efficiency not execution correctly.

It is very hard. I can not find another way which is faster than Hash Table. Maybe OP can find a super computer.

Ksharp

FriedEgg · Posted 09-14-2011 02:30 PM

I run slightly modified version of KSharp code with exceptional performance. I made minor adjustment to make generation a large sample easier (all numeric fields pan1-pan4 instead of character fields).

So I generate sample data for 100,000 rows containing 4 variables with values from 1-999,999

I optionally sort columns and or rows to compare performance (there should not be huge difference since hash object lookup should not be dependent on order, however may see improvements due to fewer large reassignments, maybe).

First off I must say the performance of this is extrememly good in my systems, I cannot see this taking 8-10 minutes for a set of 80,000 records as mentioned.

options fullstimer;

/* Generate some data for a larger test */

data have;

call streaminit( 12345 );

array pan[4];

do i=1 to 10**4;

do j=1 to dim(pan);

pan=abs(mod(int(rand('cauchy')*10**4),10**5));

end;

output;

end;

drop i j;

run;

/* End */

/* Start optional sortings for testing */

data have;

set have;

array pan

pan1-pan4;

call sortn(of pan

);

run;

proc sort data=have; by pan1 pan2 pan3 pan4; run;

/* End */

/* Start Assign Linkage Key */

data want(keep=pan1-pan4 lkey);

declare hash ha(hashexp : 20);

declare hiter hi('ha');

ha.definekey('count');

ha.definedata('count','pan1','pan2','pan3','pan4');

ha.definedone();

declare hash _ha(hashexp: 20);

_ha.definekey('key');

_ha.definedata('_lkey');

_ha.definedone();

do until(last);

set have end=last;

count+1;

ha.add();

end;

array h{4} pan1-pan4;

_rc=hi.first();

do while(_rc eq 0);

lkey+1;_lkey=lkey;

do i=1 to 4;

if not missing(h{i}) then do; key=h{i}; _ha.replace();end;

end;

do until(x=1);

x=1;

rc=hi.first();

do while(rc=0);

found=0;

do j=1 to 4;

key=h{j};rcc=_ha.check();

if rcc =0 then found=1;

end;

if found then do;

do k=1 to 4;

if not missing(h{k}) then do; key=h{k};_ha.replace();end;

end;

output;x=0; _count=count;

end;

rc=hi.next();

if found then rx=ha.remove(key : _count);

end;

_rc=hi.first();

end;

run;

/* End */

Results (average of 3 runs each, time is for final linkage step only):

No Sorting: real time - 6.25 seconds

Sortn only: real time - 4.35 seconds ( downside to utilizing is loss of variable specification if pan1, pan2 etc are meaningful unto themselves )

Sortn and Proc Sort: real time - 4.42 seconds

Proc Sort only: 4.38 seconds

So clearly, at least utilizing my sample data performance is stellar and can improve by preparing the data with a sort. I would still recomment removing instances will all blanks however for my test this is not necessary.

My tests each generated a little over 2,000 unique linkage keys at the end of process and my spot checking all looked good.

This is a fantastic thread, I too am still learning very much about the hash objects, unfortuneatly in my work thus far there are not many useful opportunities to impletement them for gains, thus far.

sas_Forum · Posted 09-15-2011 02:38 AM

Hi Fried,

I have cheked your code but i my values are all characeter pan1--pan4 are all characters i have changed in the

array but not working just running so i have stopped it..

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Re: Giving one Unique No

Giving one Unique No

Re: Giving one Unique No

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away