BookmarkSubscribeRSS Feed
DLing
Obsidian | Level 7

Because your requirement is "any value in any of the column", and the code is using a single hash table, the declaration of the key variable needs to be the longest variable you've got.  I believe there are several comments in the code stating this.  If this was not done, then of course it will not work.

Ksharp
Super User

I tested it . It works for missing values, see the following output.

But the order of obs  is not the origin order of table ,You can compare between the origin data and output. 

By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?

Origin Data:

aaa   bbb    ccc     ddd   

qqq   rrr       www   aaa  

rrr     ppp    mmm lll      

. . . .

. . . .

uuu   zzz    ffff      ppp 

. . . .

p       l        m          n     

jjjj    eee     rrr       ooo   

.    .   .   .            

sss   www  .    .            

.         .        .        eee  

Output:

pan1    pan2    pan3    add1    household
aaa    bbb    ccc    ddd    1
qqq    rrr    www    aaa    1
rrr    ppp    mmm    lll    1
uuu    zzz    ffff    ppp    1
jjjj    eee    rrr    ooo    1
sss    www            1
            eee    1
                2
                3
                4
p    l    m    n    5
                6

Ksharp

aaa   bbb    ccc     ddd   
qqq   rrr       www   aaa  
rrr     ppp    mmm lll      
. . . .
. . . .
uuu   zzz    ffff      ppp 
. . . .
p       l        m          n     
jjjj    eee     rrr       ooo   
.    .   .   .            
sss   www  .    .            
.         .        .        eee  
Ksharp
Super User

Oh. I found I problem in my code for DLing's data.

Try this code. It might be correct.

data test;
infile datalines truncover;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aa bb cc dd
ee ff gg hh
ii jj kk ll
aa ff gg dd
mm oo pp
pp rr rr ss
 
tt uu vv ww
xx yy zz dd
xx z1
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
1 3
5 7
9 11
13 15
1 5
9 13
1 9
;
run;

data want(keep=pan1 pan2 pan3 add1 household);
if _n_ eq 1 then do;
 declare hash ha(hashexp : 20,ordered: 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;
end;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
set test;
count=_n_; _rc=ha.check();
if _rc eq 0 then do;
 household+1;_household=household; n=0;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
    else n+1;
 end;   
 do until(x=1);
    x=1;
    rc=hi.first();
    do while(rc=0);
      if found then rx=ha.remove(key : _count);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
    end;
 end;  
 if n=4 then do; call missing(pan1,pan2,pan3,add1); output; end;
end; 
run;

By the way, There is another way to promote efficient, that is to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly, then use the first code I posted yesterday.It is fastest way. What do you think about remove all the missing obs firstly?

Ksharp

art297
Opal | Level 21

I asked the OP about that (removing missing data initially and then assigning them sequential numbers at the end) and, as I recall, that was quite permissible.

Ksharp
Super User

Art297.

I think that  removing missing data initially  will simply the question and will promote efficiency a lot.

The mssing obs complicated the question and make code hard to write.

If op could remove missing obs firstly , I believed the speed will be fast a lot and a lot......

I am surprised that you still have not gone to sleep. It is almost midnight at Canada.

But I am now in the Morning.

Ksharp

sas_Forum
Calcite | Level 5

Ksharp i have tryed your code but i have ran for 80000 obs i ran for 10 mins so i stopped it as if i am having 7-8 crores of data that is the prob ,if any that can make fast...

Ksharp
Super User

The fastest way is to  to remove the missing obs(i.e. these four values are all missing ) from the origin table firstly.

if cmiss(pan1,pan2,pan3,add1)=4 then delete;

then use the following code:

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;





data want(keep=pan1 pan2 pan3 add1 household);
 declare hash ha(hashexp : 20);
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
do until(num_items eq 0);
_rc=hi.first();
 household+1;_household=household;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
 do until(x=1);
    x=1;       if found then rx=ha.remove(key : _count);
    rc=hi.first();
    do while(rc=0);
      if found then rx=ha.remove(key : _count);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
     end;
 end;  
num_items=ha.num_items;put num_items=;
end; 
run;

Ksharp

Message was edited by: xia keshan

DLing
Obsidian | Level 7

Removing pure blank records first is certainly a good idea and would remove some special condition handling.  It adds another data step and some I/O.  If the code can handle it naturally, then it could save the resource.  It's a design decision that we all must make all the time.

Ksharp
Super User

Or

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

data test; /*Remove missing obs firstly*/
 set test;
 if cmiss(pan1,pan2,pan3,add1)=4 then delete;
run;


data want(keep=pan1 pan2 pan3 add1 household);
 declare hash ha(hashexp : 20);
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();

do until(last);
 set test end=last;
 count+1;
 ha.add();
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
_rc=hi.first();
do while(_rc eq 0);
 household+1;_household=household;
 do i=1 to 4;
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
 do until(x=1);
    x=1;
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count; 
                    end;
      rc=hi.next();
      if found then rx=ha.remove(key : _count);
     end;
 end;  
_rc=hi.first();
end; 
run;

Ksharp

sas_Forum
Calcite | Level 5

checked on 80000 obs by cmiss i got 600 records and for the rest i have run for the 79,400 it was running for more than 8-10 mins so i have stopped it...

Ksharp
Super User

OH. My God.

I doubted that at every iteration there is only to find several matched obs,

So speed will be so low.

I can not think how to fast it any more.

OH.God.

Ksharp

DLing
Obsidian | Level 7

The algorithm is iterative, so the only thing that will make it go fast is to store the whole thing in memory via SASFILE statement.  But if the input data is huge, then obviously that's not an option.

If the problem is that big, then it will need processing time to do.  My sample code will make multiple passes through the data until group assignment is consistent, but that's just a lot of I/O.  There isn't a lot of compute.  I always follow the rule of make it work before make it fast.

Ksharp
Super User

DLing.Yes. The op's problem is the efficiency not execution correctly.

It is very hard. I can not find another way which is faster than Hash Table. Maybe OP can find a super computer.

Ksharp

FriedEgg
SAS Employee

I run slightly modified version of KSharp code with exceptional performance.  I made minor adjustment to make generation a large sample easier (all numeric fields pan1-pan4 instead of character fields).

So I generate sample data for 100,000 rows containing 4 variables with values from 1-999,999

I optionally sort columns and or rows to compare performance (there should not be huge difference since hash object lookup should not be dependent on order, however may see improvements due to fewer large reassignments, maybe).

First off I must say the performance of this is extrememly good in my systems, I cannot see this taking 8-10 minutes for a set of 80,000 records as mentioned.

options fullstimer;

/* Generate some data for a larger test */

data have;

call streaminit( 12345 );

array pan[4];

do i=1 to 10**4;

  do j=1 to dim(pan);

   pan=abs(mod(int(rand('cauchy')*10**4),10**5));

  end;

  output;

end;

drop i j;

run;

/* End */

/* Start optional sortings for testing */

data have;

set have;

array pan

  • pan1-pan4;
  • call sortn(of pan

  • );
  • run;

    proc sort data=have; by pan1 pan2 pan3 pan4; run;

    /* End */

    /* Start Assign Linkage Key */

    data want(keep=pan1-pan4 lkey);

    declare hash ha(hashexp : 20);

    declare hiter hi('ha');

      ha.definekey('count');

      ha.definedata('count','pan1','pan2','pan3','pan4');

      ha.definedone();

    declare hash _ha(hashexp: 20);

      _ha.definekey('key');

      _ha.definedata('_lkey');

      _ha.definedone();

    do until(last);

    set have end=last;

    count+1;

    ha.add();

    end;

    array h{4} pan1-pan4;

    _rc=hi.first();

    do while(_rc eq 0);

    lkey+1;_lkey=lkey;

    do i=1 to 4;

       if not missing(h{i}) then do; key=h{i}; _ha.replace();end;

    end;  

    do until(x=1);

        x=1;

        rc=hi.first();

        do while(rc=0);

          found=0;

          do j=1 to 4;

          key=h{j};rcc=_ha.check();

          if rcc =0 then found=1;

          end;

          if found then do;

                          do k=1 to 4;

                          if not missing(h{k}) then do; key=h{k};_ha.replace();end;

                          end;

                          output;x=0; _count=count;

                        end;

          rc=hi.next();

          if found then rx=ha.remove(key : _count);

         end;

    end; 

    _rc=hi.first();

    end;

    run;

    /* End */

    Results (average of 3 runs each, time is for final linkage step only):

    No Sorting: real time - 6.25 seconds

    Sortn only: real time - 4.35 seconds ( downside to utilizing is loss of variable specification if pan1, pan2 etc are meaningful unto themselves )

    Sortn and Proc Sort: real time - 4.42 seconds

    Proc Sort only: 4.38 seconds

    So clearly, at least utilizing my sample data performance is stellar and can improve by preparing the data with a sort.  I would still recomment removing instances will all blanks however for my test this is not necessary.

    My tests each generated a little over 2,000 unique linkage keys at the end of process and my spot checking all looked good.

    This is a fantastic thread, I too am still learning very much about the hash objects, unfortuneatly in my work thus far there are not many useful opportunities to impletement them for gains, thus far.

    sas_Forum
    Calcite | Level 5

    Hi Fried,

    I have cheked your code but i my values are all characeter pan1--pan4 are all characters i have changed in the

    array but not working just running so i have stopped it..

    SAS Innovate 2025: Save the Date

     SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

    Save the date!

    What is Bayesian Analysis?

    Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

    Find more tutorials on the SAS Users YouTube channel.

    SAS Training: Just a Click Away

     Ready to level-up your skills? Choose your own adventure.

    Browse our catalog!

    Discussion stats
    • 124 replies
    • 3699 views
    • 4 likes
    • 7 in conversation