Reg:Getting Linkages of diierent variables.

R_Win · Posted 11-21-2011 03:12 AM

obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 4
5 104 c 106 107 108 109 4

obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=2
obs=4 as a5 of obs1 and obs 4 of a1 having link
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=4

i want like this

art297 · Posted 11-21-2011 08:42 AM

Your question is very similar to one asked a few months ago. Take a look at: http://communities.sas.com/message/105224#105224

Ksharp · Posted 11-22-2011 01:47 AM

Your post is ambiguous.

What logic you want to apply to variable R_id.

If obs 4 has not linkage with obs 2, what value of R_id will be set for obs 4?

R_Win · Posted 11-22-2011 05:10 AM

obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1

obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1

sory i have corrected it

R_Win · Posted 11-22-2011 05:10 AM

obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1

obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1

art297 · Posted 11-22-2011 09:26 AM

Did you look at the post I suggested? It sure seems like the same problem. Try the following:

data have;

input (a1 a2 a3 a4 a5 a6) ($);

cards;

1 a 5 7 8 9

. . . . . .

11 b 12 13 14 15

9 c 100 101 102 103

104 c 106 107 108 109

;

data want (keep=a1-a6 r_id);

if _n_ eq 1 then do;

declare hash ha(hashexp: 16);

ha.definekey('key');

ha.definedata('hhold');

ha.definedone();

end;

set have;

array _house{*} $ a1-a6;

do i=1 to dim(_house);

key=_house{i};

call missing(hhold);

rc=ha.find();

if rc=0 then do;

found=1;

r_id=hhold;

leave;

end;

if not found then do;

n+1;

r_id=n;

end;

do j=1 to dim(_house);

if not missing(_house{j}) then do;

key=_house{j};

hhold=r_id;

ha.replace();

end;

run;

Ksharp · Posted 11-22-2011 10:40 PM

Yes. I have already written it for sas_Form.

If your obs in a cluster are randomly stored in your dataset.

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
. . . .
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n
. . . .
. . . .
jjjj    eee     rrr       ooo    
sss   www  .    .             
 .         .        .        eee 
;
run;

 

options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
 declare hash ha(hashexp : 20,ordered : 'a');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','pan1','pan2','pan3','add1');
  ha.definedone();
 declare hash _ha(hashexp: 20,ordered : 'a');
  _ha.definekey('key');
  _ha.definedone();

do until(last);
 set test end=last; 
/*Remove obs which variable's are all missing firstly*/
 if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
                                           count+1;
                                           ha.add();
                                         end;
end;

length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
 household+1;
 do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
   if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
 end;   
/*start to run over Hash Table HA ,until can not find any more 
 observation which is the same cluster with current observation*/
 do until(x=1);
    x=1;
/*copy the first obs from Hash Table HA into PDV*/
    rc=hi.first();
    do while(rc=0);
      found=0;
      do j=1 to 4;
/*find whether any one of value is included in the current obs*/
      key=h{j};rcc=_ha.check();
      if rcc =0 then found=1;
      end;
      if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the 
current obs and output it into dataset*/
                      do k=1 to 4;
                      if not missing(h{k}) then do; key=h{k};_ha.replace();end;
                      end;
                      output;x=0; _count=count;*keep this found obs's index; 
                    end;
      rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
      if found then rx=ha.remove(key : _count);
     end;
 end;
/*clear up all the index which is the same cluster with the current obs*/ 
 _ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end; 
run;

Ksharp

R_Win · Posted 11-23-2011 07:17 AM

Thqs Ksharpthe code was working but it was taking time as i am having 8 crors of data can u help me to reduce the time

art297 · Posted 11-23-2011 08:35 AM

8 crors as in 80 million? Easiest, quickest, most reliable and cheapest solution would be to buy a faster computer.

R_Win · Posted 11-24-2011 04:59 AM

I have tryed ur code but the linkages are not getting correctly and i am running this on SERVER ..

Ksharp · Posted 11-24-2011 12:26 AM

Sorry. I cann't do it anymore. If you delete ( ordered : 'a' ) in the hash table , maybe speed would be faster a little.

Are you sure that obs of a cluster is randomly stored in dataset?

If you only consider the previous obs for current obs ,then Art's code is very fast.

R_Win · Posted 11-24-2011 02:19 AM

I have tryed art code but the linkages are not getting correctly and i am running this on SERVER ..

art297 · Posted 11-24-2011 09:24 AM

FWIW, I posted the following on SAS-L ( http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1109B&L=sas-l&D=1&H=0&O=D&T=1&P=10727 ) at the same time this discussion was going on in the forums last September. Most of the ideas are KSharps, although a number of others offered suggestions I I really can't recall who should be credited with what. Hopefully, you can use one of the methods to address your current problem:

data have;

infile cards truncover;

input (pan1 add1 pan2 pan3) (: $40.);

cards;

aaa bbb ccc ddd

qqq rrr www aaa

rrr ppp mmm lll

uuu zzz ffff ppp

p l m n

m aaa

a b c .

jjjj eee rrr ooo

. . . .

e f . .

a e . .

g h . .

i j k .

i j k a

sss www . .

. . . eee

;

run;

My macro/hash solution was:

%macro findthem;

proc datasets library=work nowarn;

delete recodes;

quit;

%let stopthis=0;

/*Assign initial household numbers*/

data for_recodes (keep=start hhold);

if _n_ eq 1 then do;

declare hash ha(hashexp: 16);

ha.definekey('key');

ha.definedata('hhold');

ha.definedone();

end;

set have;

array _house{*} $ 40 pan1--pan3;

do i=1 to dim(_house);

key=_house{i};

call missing(hhold);

rc=ha.find();

if rc=0 then do;

found=1;

leave;

end;

if not found then do;

n+1;

hhold=n;

end;

do j=1 to dim(_house);

if not missing(_house{j}) then do;

key=_house{j};

ha.replace();

start=key;

output;

end;

run;

%let recodes=;

%do %while (&stopthis eq 0);

/*Test for invalid assignments*/

proc sort data=for_recodes out=test nodupkey;

by start hhold;

run;

data test2 (drop=n);

set test end=eof;

by start;

if not(first.start and last.start) then do;

output;

n+1;

end;

if eof and n lt 1 then do;

call symput('stopthis',1);

end;

run;

%if &stopthis eq 0 %then %do;

proc append base=recodes data=test2 force;

run;

proc sql noprint;

select "if hhold eq "||

strip(put(hhold,best12.))||

" then hhold="||

strip(put(min(hhold),best12.))||

";"

into :recodes

separated by " "

from recodes

group by start

;

quit;

/*Correct identified discrepancies*/

data for_recodes (keep=start hhold);

if _n_ eq 1 then do;

declare hash ha(hashexp: 16);

ha.definekey('key');

ha.definedata('hhold');

ha.definedone();

end;

set have;

array _house{*} $ 40 pan1--pan3;

do i=1 to dim(_house);

key=_house{i};

call missing(hhold);

rc=ha.find();

if rc=0 then do;

found=1;

&recodes.

leave;

end;

if not found then do;

n+1;

hhold=n;

&recodes.

end;

do j=1 to dim(_house);

if not missing(_house{j}) then do;

key=_house{j};

ha.replace();

start=key;

output;

end;

run;

%end;

/*Get Final Dataset*/

data want (keep=pan1 pan2 pan3 add1 household);

if _n_ eq 1 then do;

declare hash ha(hashexp: 16);

ha.definekey('key');

ha.definedata('hhold');

ha.definedone();

end;

set have;

array _house{*} $ 40 pan1--pan3;

do i=1 to dim(_house);

key=_house{i};

call missing(hhold);

rc=ha.find();

if rc=0 then do;

found=1;

&recodes.

household=hhold;

leave;

end;

if not found then do;

n+1;

hhold=n;

&recodes.

household=hhold;

end;

do j=1 to dim(_house);

if not missing(_house{j}) then do;

key=_house{j};

ha.replace();

end;

run;

%mend findthem;

%findthem

Ksharp's pure hash solution was:

data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
    declare hash ha(hashexp : 20,ordered: 'Y');
    declare hiter hi('ha');
    ha.definekey('count');
    ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
    ha.definedone();
    do until(last);
     set have(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1))
              end=last;
     count+1;
     searched=0;
     if not missing(_pan1) or not missing(_pan2) or
      not missing(_pan3) or not missing(_add1) then ha.add();
    end;

    declare hash _ha(hashexp: 20);
    _ ha.definekey('key');
    _ha.definedata('_household');
    _ha.definedone();
end;
set have ;
_count+1;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
count=_count;searched=0;
r=ha.find();
if not searched then do;
    n+1;
    household=n;
    _household=n;
    do i=1 to 4;
      if not missing(h{i}) then do;
        key=h{i};
        _ha.replace();
      end;
    end;
    do until(x=1);
      x=1;
      rc=hi.first();
      do while(rc=0);
        if not searched then do;
          _found=0;
          do j=1 to 4;
            key=_h(j);
            _rc=_ha.check();
            if _rc =0 then _found=1;
          end;
          if _found then do;
            do k=1 to 4;
              if not missing(_h{k}) then do;
                key=_h(k);
                _ha.replace();
              end;
            end;
            searched=1;
            ha.replace();
            x=0;
          end;
        end;
        rc=hi.next();
      end;
    end;
end;
else do;
    do z=1 to 4;
      key=h{z};
      rr=_ha.find();
      if rr=0 then household=_household;
    end;
end;
run;

Ksharp · Posted 11-24-2011 09:58 PM

Hi. Art .Thanks to remind my code at SAS-L.

But that code is older version . I have recoded and promoted it again ,and posted here already.

That is the fastest method I can think, but for the sake of OP's large table and the table's data construction,

It is still slow to speed .

Ksharp