obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 4
5 104 c 106 107 108 109 4
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=2
obs=4 as a5 of obs1 and obs 4 of a1 having link
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=4
i want like this
Your question is very similar to one asked a few months ago. Take a look at: http://communities.sas.com/message/105224#105224
Your post is ambiguous.
What logic you want to apply to variable R_id.
If obs 4 has not linkage with obs 2, what value of R_id will be set for obs 4?
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
sory i have corrected it
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
Did you look at the post I suggested? It sure seems like the same problem. Try the following:
data have;
input (a1 a2 a3 a4 a5 a6) ($);
cards;
1 a 5 7 8 9
. . . . . .
11 b 12 13 14 15
9 c 100 101 102 103
104 c 106 107 108 109
;
data want (keep=a1-a6 r_id);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ a1-a6;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
r_id=hhold;
leave;
end;
end;
if not found then do;
n+1;
r_id=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
hhold=r_id;
ha.replace();
end;
end;
run;
Yes. I have already written it for sas_Form.
If your obs in a cluster are randomly stored in your dataset.
data test; input ( pan1 pan2 pan3 add1 ) ($); datalines; aaa bbb ccc ddd . . . . qqq rrr www aaa rrr ppp mmm lll uuu zzz ffff ppp p l m n . . . . . . . . jjjj eee rrr ooo sss www . . . . . eee ; run; options compress=yes; data want(keep=pan1 pan2 pan3 add1 household); /*to make speed faster*/ declare hash ha(hashexp : 20,ordered : 'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20,ordered : 'a'); _ha.definekey('key'); _ha.definedone(); do until(last); set test end=last; /*Remove obs which variable's are all missing firstly*/ if cmiss(pan1,pan2,pan3,add1) lt 4 then do; count+1; ha.add(); end; end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); do while(_rc eq 0); *until the end of Hash Table HA; /*assign a unique cluster flag(i.e. household)*/ household+1; do i=1 to 4; /*push not missing value of current obs into another Hash Table _HA*/ if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; /*start to run over Hash Table HA ,until can not find any more observation which is the same cluster with current observation*/ do until(x=1); x=1; /*copy the first obs from Hash Table HA into PDV*/ rc=hi.first(); do while(rc=0); found=0; do j=1 to 4; /*find whether any one of value is included in the current obs*/ key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; /*if any one of value is included,then push the obs which is copied from Hash Table HA into Hash Tables _HA,flag it the same cluster with the current obs and output it into dataset*/ do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count;*keep this found obs's index; end; rc=hi.next(); /*remove the found obs from Hash Table HA,since it has been seared*/ if found then rx=ha.remove(key : _count); end; end; /*clear up all the index which is the same cluster with the current obs*/ _ha.clear(); /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); end; run;
Ksharp
Thqs Ksharpthe code was working but it was taking time as i am having 8 crors of data can u help me to reduce the time
8 crors as in 80 million? Easiest, quickest, most reliable and cheapest solution would be to buy a faster computer.
I have tryed ur code but the linkages are not getting correctly and i am running this on SERVER ..
Sorry. I cann't do it anymore. If you delete ( ordered : 'a' ) in the hash table , maybe speed would be faster a little.
Are you sure that obs of a cluster is randomly stored in dataset?
If you only consider the previous obs for current obs ,then Art's code is very fast.
I have tryed art code but the linkages are not getting correctly and i am running this on SERVER ..
FWIW, I posted the following on SAS-L ( http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1109B&L=sas-l&D=1&H=0&O=D&T=1&P=10727 ) at the same time this discussion was going on in the forums last September. Most of the ideas are KSharps, although a number of others offered suggestions I I really can't recall who should be credited with what. Hopefully, you can use one of the methods to address your current problem:
data have;
infile cards truncover;
input (pan1 add1 pan2 pan3) (: $40.);
cards;
aaa bbb ccc ddd
qqq rrr www aaa
rrr ppp mmm lll
uuu zzz ffff ppp
p l m n
m aaa
a b c .
jjjj eee rrr ooo
. . . .
e f . .
a e . .
g h . .
i j k .
i j k a
sss www . .
. . . eee
;
run;
My macro/hash solution was:
%macro findthem;
proc datasets library=work nowarn;
delete recodes;
quit;
%let stopthis=0;
/*Assign initial household numbers*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%let recodes=;
%do %while (&stopthis eq 0);
/*Test for invalid assignments*/
proc sort data=for_recodes out=test nodupkey;
by start hhold;
run;
data test2 (drop=n);
set test end=eof;
by start;
if not(first.start and last.start) then do;
output;
n+1;
end;
if eof and n lt 1 then do;
call symput('stopthis',1);
end;
run;
%if &stopthis eq 0 %then %do;
proc append base=recodes data=test2 force;
run;
proc sql noprint;
select "if hhold eq "||
strip(put(hhold,best12.))||
" then hhold="||
strip(put(min(hhold),best12.))||
";"
into :recodes
separated by " "
from recodes
group by start
;
quit;
/*Correct identified discrepancies*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%end;
%end;
/*Get Final Dataset*/
data want (keep=pan1 pan2 pan3 add1 household);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
household=hhold;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
household=hhold;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
end;
end;
run;
%mend findthem;
%findthem
Ksharp's pure hash solution was:
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
declare hash ha(hashexp : 20,ordered: 'Y');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
ha.definedone();
do until(last);
set have(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1))
end=last;
count+1;
searched=0;
if not missing(_pan1) or not missing(_pan2) or
not missing(_pan3) or not missing(_add1) then ha.add();
end;
declare hash _ha(hashexp: 20);
_ ha.definekey('key');
_ha.definedata('_household');
_ha.definedone();
end;
set have ;
_count+1;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
count=_count;searched=0;
r=ha.find();
if not searched then do;
n+1;
household=n;
_household=n;
do i=1 to 4;
if not missing(h{i}) then do;
key=h{i};
_ha.replace();
end;
end;
do until(x=1);
x=1;
rc=hi.first();
do while(rc=0);
if not searched then do;
_found=0;
do j=1 to 4;
key=_h(j);
_rc=_ha.check();
if _rc =0 then _found=1;
end;
if _found then do;
do k=1 to 4;
if not missing(_h{k}) then do;
key=_h(k);
_ha.replace();
end;
end;
searched=1;
ha.replace();
x=0;
end;
end;
rc=hi.next();
end;
end;
end;
else do;
do z=1 to 4;
key=h{z};
rr=_ha.find();
if rr=0 then household=_household;
end;
end;
run;
Hi. Art .Thanks to remind my code at SAS-L.
But that code is older version . I have recoded and promoted it again ,and posted here already.
That is the fastest method I can think, but for the sake of OP's large table and the table's data construction,
It is still slow to speed .
Ksharp
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.