obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 4
5 104 c 106 107 108 109 4
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=2
obs=4 as a5 of obs1 and obs 4 of a1 having link
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=4
i want like this
Your question is very similar to one asked a few months ago. Take a look at: http://communities.sas.com/message/105224#105224
Your post is ambiguous.
What logic you want to apply to variable R_id.
If obs 4 has not linkage with obs 2, what value of R_id will be set for obs 4?
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
sory i have corrected it
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
Did you look at the post I suggested? It sure seems like the same problem. Try the following:
data have;
input (a1 a2 a3 a4 a5 a6) ($);
cards;
1 a 5 7 8 9
. . . . . .
11 b 12 13 14 15
9 c 100 101 102 103
104 c 106 107 108 109
;
data want (keep=a1-a6 r_id);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ a1-a6;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
r_id=hhold;
leave;
end;
end;
if not found then do;
n+1;
r_id=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
hhold=r_id;
ha.replace();
end;
end;
run;
Yes. I have already written it for sas_Form.
If your obs in a cluster are randomly stored in your dataset.
data test; input ( pan1 pan2 pan3 add1 ) ($); datalines; aaa bbb ccc ddd . . . . qqq rrr www aaa rrr ppp mmm lll uuu zzz ffff ppp p l m n . . . . . . . . jjjj eee rrr ooo sss www . . . . . eee ; run; options compress=yes; data want(keep=pan1 pan2 pan3 add1 household); /*to make speed faster*/ declare hash ha(hashexp : 20,ordered : 'a'); declare hiter hi('ha'); ha.definekey('count'); ha.definedata('count','pan1','pan2','pan3','add1'); ha.definedone(); declare hash _ha(hashexp: 20,ordered : 'a'); _ha.definekey('key'); _ha.definedone(); do until(last); set test end=last; /*Remove obs which variable's are all missing firstly*/ if cmiss(pan1,pan2,pan3,add1) lt 4 then do; count+1; ha.add(); end; end; length key $ 40; array h{4} $ 40 pan1 pan2 pan3 add1; /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); do while(_rc eq 0); *until the end of Hash Table HA; /*assign a unique cluster flag(i.e. household)*/ household+1; do i=1 to 4; /*push not missing value of current obs into another Hash Table _HA*/ if not missing(h{i}) then do; key=h{i}; _ha.replace();end; end; /*start to run over Hash Table HA ,until can not find any more observation which is the same cluster with current observation*/ do until(x=1); x=1; /*copy the first obs from Hash Table HA into PDV*/ rc=hi.first(); do while(rc=0); found=0; do j=1 to 4; /*find whether any one of value is included in the current obs*/ key=h{j};rcc=_ha.check(); if rcc =0 then found=1; end; if found then do; /*if any one of value is included,then push the obs which is copied from Hash Table HA into Hash Tables _HA,flag it the same cluster with the current obs and output it into dataset*/ do k=1 to 4; if not missing(h{k}) then do; key=h{k};_ha.replace();end; end; output;x=0; _count=count;*keep this found obs's index; end; rc=hi.next(); /*remove the found obs from Hash Table HA,since it has been seared*/ if found then rx=ha.remove(key : _count); end; end; /*clear up all the index which is the same cluster with the current obs*/ _ha.clear(); /*copy the first obs from Hash Table HA into PDV*/ _rc=hi.first(); end; run;
Ksharp
Thqs Ksharpthe code was working but it was taking time as i am having 8 crors of data can u help me to reduce the time
8 crors as in 80 million? Easiest, quickest, most reliable and cheapest solution would be to buy a faster computer.
I have tryed ur code but the linkages are not getting correctly and i am running this on SERVER ..
Sorry. I cann't do it anymore. If you delete ( ordered : 'a' ) in the hash table , maybe speed would be faster a little.
Are you sure that obs of a cluster is randomly stored in dataset?
If you only consider the previous obs for current obs ,then Art's code is very fast.
I have tryed art code but the linkages are not getting correctly and i am running this on SERVER ..
FWIW, I posted the following on SAS-L ( http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1109B&L=sas-l&D=1&H=0&O=D&T=1&P=10727 ) at the same time this discussion was going on in the forums last September. Most of the ideas are KSharps, although a number of others offered suggestions I I really can't recall who should be credited with what. Hopefully, you can use one of the methods to address your current problem:
data have;
infile cards truncover;
input (pan1 add1 pan2 pan3) (: $40.);
cards;
aaa bbb ccc ddd
qqq rrr www aaa
rrr ppp mmm lll
uuu zzz ffff ppp
p l m n
m aaa
a b c .
jjjj eee rrr ooo
. . . .
e f . .
a e . .
g h . .
i j k .
i j k a
sss www . .
. . . eee
;
run;
My macro/hash solution was:
%macro findthem;
proc datasets library=work nowarn;
delete recodes;
quit;
%let stopthis=0;
/*Assign initial household numbers*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%let recodes=;
%do %while (&stopthis eq 0);
/*Test for invalid assignments*/
proc sort data=for_recodes out=test nodupkey;
by start hhold;
run;
data test2 (drop=n);
set test end=eof;
by start;
if not(first.start and last.start) then do;
output;
n+1;
end;
if eof and n lt 1 then do;
call symput('stopthis',1);
end;
run;
%if &stopthis eq 0 %then %do;
proc append base=recodes data=test2 force;
run;
proc sql noprint;
select "if hhold eq "||
strip(put(hhold,best12.))||
" then hhold="||
strip(put(min(hhold),best12.))||
";"
into :recodes
separated by " "
from recodes
group by start
;
quit;
/*Correct identified discrepancies*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%end;
%end;
/*Get Final Dataset*/
data want (keep=pan1 pan2 pan3 add1 household);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
household=hhold;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
household=hhold;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
end;
end;
run;
%mend findthem;
%findthem
Ksharp's pure hash solution was:
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
declare hash ha(hashexp : 20,ordered: 'Y');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
ha.definedone();
do until(last);
set have(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1))
end=last;
count+1;
searched=0;
if not missing(_pan1) or not missing(_pan2) or
not missing(_pan3) or not missing(_add1) then ha.add();
end;
declare hash _ha(hashexp: 20);
_ ha.definekey('key');
_ha.definedata('_household');
_ha.definedone();
end;
set have ;
_count+1;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
count=_count;searched=0;
r=ha.find();
if not searched then do;
n+1;
household=n;
_household=n;
do i=1 to 4;
if not missing(h{i}) then do;
key=h{i};
_ha.replace();
end;
end;
do until(x=1);
x=1;
rc=hi.first();
do while(rc=0);
if not searched then do;
_found=0;
do j=1 to 4;
key=_h(j);
_rc=_ha.check();
if _rc =0 then _found=1;
end;
if _found then do;
do k=1 to 4;
if not missing(_h{k}) then do;
key=_h(k);
_ha.replace();
end;
end;
searched=1;
ha.replace();
x=0;
end;
end;
rc=hi.next();
end;
end;
end;
else do;
do z=1 to 4;
key=h{z};
rr=_ha.find();
if rr=0 then household=_household;
end;
end;
run;
Hi. Art .Thanks to remind my code at SAS-L.
But that code is older version . I have recoded and promoted it again ,and posted here already.
That is the fastest method I can think, but for the sake of OP's large table and the table's data construction,
It is still slow to speed .
Ksharp
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.