BookmarkSubscribeRSS Feed
art297
Opal | Level 21

Does it make any difference if your records that only contain missing values aren't number sequentially?  i.e., If all of the records can be assigned to their correct groups, and say there are 60,000 groups assigned, can the records with all missing values be numbered 60,001, 60,002, etc.?

sas_Forum
Calcite | Level 5

yes it can be numbered no issue..as there is no need to sort the data and it can give the sequence number as it is..in your order if all the four are balank then you can give new number as there is no need that all the group should give first then the remaing obs that are blank that shd give...

Ksharp
Super User

OH. My God! Your question is very very complicated ! You are searching a tree.

I am afraid that you need lots of time to run code for the sake of your complicated algorithm and large table.

It is really really a big tough task for me .:smileyshocked:

And Still not quite sure my code is what you need.

BTW, my code above is actually wrong too.

I rethink and recode it again. It is really difficulty.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n      
jjjj    eee     rrr       ooo    
.    .   .   .             
sss   www  .    .             
 .         .        .        eee       
;
run;
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
 declare hash ha(hashexp : 20,ordered: 'Y');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('_pan1','_pan2','_pan3','_add1');
  ha.definedone();
do until(last);
 set test(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1)) end=last;
 count+1;
 if not missing(_pan1) or not missing(_pan2) or not missing(_pan3) or not missing(_add1) then ha.add();
end;

 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();
end;
   set test nobs=_nobs;
   length key $ 40;
   array h{4} $ 40 pan1 pan2 pan3 add1;
   array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
   do i=1 to 4;
    key=h{i};rc1=_ha.find();
     if rc1=0 then do; found=1;household=_household;end;
   end;
   if not found then do;
                     n+1; household=n;_household=n;
                      do i=1 to 4;
                       if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
                      end;                 
                    
                          do x=1 to _nobs;
                       rc=hi.first();
                       do while(rc=0);
                         _found=.;
                         do j=1 to 4;
                           key=_h(j);_rc=_ha.find();
                                if _rc =0 and household = _household then _found=1;
                               end;
                               if _found then do;
                                                do k=1 to 4;
                                            if not missing(_h{k}) then do; key=_h(k);_ha.replace();end;
                                                end;
                                                  end;
                          rc=hi.next();
                       end;
                          end;
                         end;
run;

Ksharp

sas_Forum
Calcite | Level 5

sorry sharp i have reyed your new code with 80000 obs it was not working ...i am getting as running i have checked for 25 mins and stopped it

Ksharp
Super User

Can you use 1000 or 2000 obs to test my code.

And post the output.    point where is wrong?

Ksharp

sas_Forum
Calcite | Level 5

I am not saying about the output i have runned your code now on 800 records it was still running i have waited for 10 mins and stopped the process..

Ksharp
Super User

Then try this code. and using 500 obs.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n      
jjjj    eee     rrr       ooo    
.    .   .   .             
sss   www  .    .             
 .         .        .        eee       
;
run;
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
 declare hash ha(hashexp : 20,ordered: 'Y');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('_pan1','_pan2','_pan3','_add1');
  ha.definedone();
do until(last);
 set test(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1)) end=last;
 count+1;
 if not missing(_pan1) or not missing(_pan2) or not missing(_pan3) or not missing(_add1) then ha.add();
end;

 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();
end;
   set test ;
   length key $ 40;
   array h{4} $ 40 pan1 pan2 pan3 add1;
   array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
   do i=1 to 4;
    key=h{i};rc1=_ha.find();
     if rc1=0 then do; found=1;household=_household;end;
   end;
   if not found then do;
                     n+1; household=n;_household=n;
                      do i=1 to 4;
                       if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
                      end;                 
                    
                          do x=1 to 10;
                       rc=hi.first();
                       do while(rc=0);
                         _found=.;
                         do j=1 to 4;
                           key=_h(j);_rc=_ha.find();
                                if _rc =0 and household = _household then _found=1;
                               end;
                               if _found then do;
                                                do k=1 to 4;
                                            if not missing(_h{k}) then do; key=_h(k);_ha.replace();end;
                                                end;
                                                  end;
                          rc=hi.next();
                       end;
                          end;
                         end;
run;

Ksharp

sas_Forum
Calcite | Level 5

Hi ksharp...

I have tryed for 800 rec it worked and again ran on 8000 rec again same prob waited for 6 mins and stopped it

Ksharp
Super User

Then change

                          do x=1 to 4;

                       rc=hi.first();

                       do while(rc=0);

                         _found=.;

                         do j=1 to 4;

                           key=_h(j);_rc=_ha.find();

                                if _rc =0 and household = _household then _found=1;

                               end;

                               if _found then do;

                                                do k=1 to 4;

                                            if not missing(_h{k}) then do; key=_h(k);_ha.replace();end;

                                                end;

                                                  end;

                          rc=hi.next();

                       end;

                          end;

                         end;

run;

NOTE: this value is more small then you will get more wrong result.

Ksharp

sas_Forum
Calcite | Level 5

same prob sharp it was running only for 800 rec it worked and for 8000 it was running for more than 5 mins so i have stopped it..

DLing
Obsidian | Level 7

Gentlemen, just to like add some thoughts on the structure of the problem to this already very long discussion.

1) The unique values within all the fields can be thought of as "nodes"

2) Each record is a statement of linkages or arcs that would connect some nodes together, forming a group

3) Out of each node, there would be multiple linkages (every record that contains that value is linkage statement)

This problem is a graph labeling problem: label the nodes so that all nodes that belong to a group share the same label, and all disconnected groups have a different label.

The algorithms for the graph labeling problem is not hard to grasp, but is tricky to code in SAS.  The natural language to describe this process requires recursion.  SAS data step has no builtin capability (as far as I know) that supports recusion naturally.  This poses challenges on both the data structure and the algorithm to traverse the nodes and examining / propagating label changes throughout the tree.

In my humble opinion, provably correctly code must either show recursion or some sort of loop until no more changes in order to propagate the group id throughout the nodes that all belong in that group at arbitrary depth.  Any code that loops for a finite number of passes without recursion is, by definition, flawed as there would be some data structure that extends beyond the number of passes implied by the code.

I have faced similar problems before where we have upwards of 20 million nodes and hundreds of thousands of arcs.  We wrote exceedingly crude SAS data step code to get the job done using the "loop through all nodes and all arcs until no more label changes are required condition is satisfied as we traversed the entire forest".  Running on big UNIX servers, the job took multiple hours to run.  We ended up relying on commercial software to solve the problem as the problem rapidly grew more complex (fuzzy address matching is notoriously difficult, and this was just one of the multiple match field).

This is a non-trivial problem when one does not have the right tools that support the right programming paradigm.  Just thought I'd call out the problem and algorithm characteristics as people try to assist the OP.  To reiterate, any code with finite number of passes without recursion, by definition, can only propagate so deep into the forest such that forests deeper than that would not be labeled correctly.

Ksharp
Super User

In my humble opinion, provably correctly code must either show recursion or some sort of loop until no more changes in order to propagate the group id throughout the nodes that all belong in that group at arbitrary depth.  Any code that loops for a finite number of passes without recursion is, by definition, flawed as there would be some data structure that extends beyond the number of passes implied by the code.

I understand what is your mean. But I do not think so.

as far as I loop the number which is the nobs of a table .

i.e. do x=1 to nobs;

I will make sure you can get all the labels.

I optimize my code to not search obs which has already searched.So I believe it is faster.

BTW. Hash Table is a good recursion  tool in sas.

Ksharp

Ksharp
Super User

I think and code again. It is really a big challenge.But I enjoy it Smiley Happy.

The following code would be fasten a lot , I believed.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n      
jjjj    eee     rrr       ooo    
.    .   .   .             
sss   www  .    .             
 .         .        .        eee       
;
run;
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
 declare hash ha(hashexp : 20,ordered: 'Y');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
  ha.definedone();
do until(last);
 set test(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1)) end=last;
 count+1;searched=0;
 if not missing(_pan1) or not missing(_pan2) or not missing(_pan3) or not missing(_add1) then ha.add();
end;

 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();
end;
   set test ;_count+1;
   length key $ 40;
   array h{4} $ 40 pan1 pan2 pan3 add1;
   array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;

count=_count;searched=0; r=ha.find();
if not searched then do;
                     n+1; household=n;_household=n;
                      do i=1 to 4;
                       if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
                      end;                 
                    
                  do x=1 to 20;
                     rc=hi.first();
                     do while(rc=0);
                            if not searched then do;
                         _found=0;
                         do j=1 to 4;
                           key=_h(j);_rc=_ha.check();
                           if _rc =0 then _found=1;
                         end;
                               if _found then do;
                                                do k=1 to 4;
                                            if not missing(_h{k}) then do; key=_h(k);_ha.replace();end;
                                                end;
                                                searched=1;ha.replace();
                                              end;
                                                   end;
                          rc=hi.next();
                       end;
                   end;
end;
else do;
       do z=1 to 4;
       key=h{z};rr=_ha.find();
       if rr=0 then household=_household;
        end;
     end;
run;

Ksharp

DLing
Obsidian | Level 7

Before you blow your brains out... just because it works with some test data does not guarantee it will work with arbitrarily deep data.  Worst possible example - everything is connected somehow into one group.

Ksharp
Super User

Just as I said , If the number of loop is the nobs of table. I will make sure you can get all the labels,

include - everything is connected somehow into one group.  But it is needed Monster Computer.

Ksharp


SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 124 replies
  • 3693 views
  • 4 likes
  • 7 in conversation