obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 4
5 104 c 106 107 108 109 4
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=2
obs=4 as a5 of obs1 and obs 4 of a1 having link
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=4
i want like this
Your question is very similar to one asked a few months ago. Take a look at: http://communities.sas.com/message/105224#105224
Your post is ambiguous.
What logic you want to apply to variable R_id.
If obs 4 has not linkage with obs 2, what value of R_id will be set for obs 4?
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
sory i have corrected it
obs a1 a2 a3 a4 a5 a6 R_id
1 1 a 5 7 8 9 1
2 . . . . . . 2
3 11 b 12 13 14 15 3
4 9 c 100 101 102 103 1
5 104 c 106 107 108 109 1
obs=1 all are unique so r_id=1
obs=2 all are unique so r_id=2
obs=3 all are unique so r_id=3
obs=4 as a5 of obs1 and obs 4 of a1 having link r_id=1
obs=5 as a2 of obs4 and obs 5 having linkage so r_id=1
Did you look at the post I suggested? It sure seems like the same problem. Try the following:
data have;
input (a1 a2 a3 a4 a5 a6) ($);
cards;
1 a 5 7 8 9
. . . . . .
11 b 12 13 14 15
9 c 100 101 102 103
104 c 106 107 108 109
;
data want (keep=a1-a6 r_id);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ a1-a6;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
r_id=hhold;
leave;
end;
end;
if not found then do;
n+1;
r_id=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
hhold=r_id;
ha.replace();
end;
end;
run;
Yes. I have already written it for sas_Form.
If your obs in a cluster are randomly stored in your dataset.
data test;
input ( pan1 pan2 pan3 add1 ) ($);
datalines;
aaa bbb ccc ddd
. . . .
qqq rrr www aaa
rrr ppp mmm lll
uuu zzz ffff ppp
p l m n
. . . .
. . . .
jjjj eee rrr ooo
sss www . .
. . . eee
;
run;
options compress=yes;
data want(keep=pan1 pan2 pan3 add1 household);
/*to make speed faster*/
declare hash ha(hashexp : 20,ordered : 'a');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','pan1','pan2','pan3','add1');
ha.definedone();
declare hash _ha(hashexp: 20,ordered : 'a');
_ha.definekey('key');
_ha.definedone();
do until(last);
set test end=last;
/*Remove obs which variable's are all missing firstly*/
if cmiss(pan1,pan2,pan3,add1) lt 4 then do;
count+1;
ha.add();
end;
end;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
do while(_rc eq 0); *until the end of Hash Table HA;
/*assign a unique cluster flag(i.e. household)*/
household+1;
do i=1 to 4;
/*push not missing value of current obs into another Hash Table _HA*/
if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
end;
/*start to run over Hash Table HA ,until can not find any more
observation which is the same cluster with current observation*/
do until(x=1);
x=1;
/*copy the first obs from Hash Table HA into PDV*/
rc=hi.first();
do while(rc=0);
found=0;
do j=1 to 4;
/*find whether any one of value is included in the current obs*/
key=h{j};rcc=_ha.check();
if rcc =0 then found=1;
end;
if found then do;
/*if any one of value is included,then push the obs which is copied from
Hash Table HA into Hash Tables _HA,flag it the same cluster with the
current obs and output it into dataset*/
do k=1 to 4;
if not missing(h{k}) then do; key=h{k};_ha.replace();end;
end;
output;x=0; _count=count;*keep this found obs's index;
end;
rc=hi.next();
/*remove the found obs from Hash Table HA,since it has been seared*/
if found then rx=ha.remove(key : _count);
end;
end;
/*clear up all the index which is the same cluster with the current obs*/
_ha.clear();
/*copy the first obs from Hash Table HA into PDV*/
_rc=hi.first();
end;
run;
Ksharp
Thqs Ksharpthe code was working but it was taking time as i am having 8 crors of data can u help me to reduce the time
8 crors as in 80 million? Easiest, quickest, most reliable and cheapest solution would be to buy a faster computer.
I have tryed ur code but the linkages are not getting correctly and i am running this on SERVER ..
Sorry. I cann't do it anymore. If you delete ( ordered : 'a' ) in the hash table , maybe speed would be faster a little.
Are you sure that obs of a cluster is randomly stored in dataset?
If you only consider the previous obs for current obs ,then Art's code is very fast.
I have tryed art code but the linkages are not getting correctly and i am running this on SERVER ..
FWIW, I posted the following on SAS-L ( http://www.listserv.uga.edu/cgi-bin/wa?A2=ind1109B&L=sas-l&D=1&H=0&O=D&T=1&P=10727 ) at the same time this discussion was going on in the forums last September. Most of the ideas are KSharps, although a number of others offered suggestions I I really can't recall who should be credited with what. Hopefully, you can use one of the methods to address your current problem:
data have;
infile cards truncover;
input (pan1 add1 pan2 pan3) (: $40.);
cards;
aaa bbb ccc ddd
qqq rrr www aaa
rrr ppp mmm lll
uuu zzz ffff ppp
p l m n
m aaa
a b c .
jjjj eee rrr ooo
. . . .
e f . .
a e . .
g h . .
i j k .
i j k a
sss www . .
. . . eee
;
run;
My macro/hash solution was:
%macro findthem;
proc datasets library=work nowarn;
delete recodes;
quit;
%let stopthis=0;
/*Assign initial household numbers*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%let recodes=;
%do %while (&stopthis eq 0);
/*Test for invalid assignments*/
proc sort data=for_recodes out=test nodupkey;
by start hhold;
run;
data test2 (drop=n);
set test end=eof;
by start;
if not(first.start and last.start) then do;
output;
n+1;
end;
if eof and n lt 1 then do;
call symput('stopthis',1);
end;
run;
%if &stopthis eq 0 %then %do;
proc append base=recodes data=test2 force;
run;
proc sql noprint;
select "if hhold eq "||
strip(put(hhold,best12.))||
" then hhold="||
strip(put(min(hhold),best12.))||
";"
into :recodes
separated by " "
from recodes
group by start
;
quit;
/*Correct identified discrepancies*/
data for_recodes (keep=start hhold);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
start=key;
output;
end;
end;
run;
%end;
%end;
/*Get Final Dataset*/
data want (keep=pan1 pan2 pan3 add1 household);
if _n_ eq 1 then do;
declare hash ha(hashexp: 16);
ha.definekey('key');
ha.definedata('hhold');
ha.definedone();
end;
set have;
array _house{*} $ 40 pan1--pan3;
do i=1 to dim(_house);
key=_house{i};
call missing(hhold);
rc=ha.find();
if rc=0 then do;
found=1;
&recodes.
household=hhold;
leave;
end;
end;
if not found then do;
n+1;
hhold=n;
&recodes.
household=hhold;
end;
do j=1 to dim(_house);
if not missing(_house{j}) then do;
key=_house{j};
ha.replace();
end;
end;
run;
%mend findthem;
%findthem
Ksharp's pure hash solution was:
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
declare hash ha(hashexp : 20,ordered: 'Y');
declare hiter hi('ha');
ha.definekey('count');
ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
ha.definedone();
do until(last);
set have(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1))
end=last;
count+1;
searched=0;
if not missing(_pan1) or not missing(_pan2) or
not missing(_pan3) or not missing(_add1) then ha.add();
end;
declare hash _ha(hashexp: 20);
_ ha.definekey('key');
_ha.definedata('_household');
_ha.definedone();
end;
set have ;
_count+1;
length key $ 40;
array h{4} $ 40 pan1 pan2 pan3 add1;
array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;
count=_count;searched=0;
r=ha.find();
if not searched then do;
n+1;
household=n;
_household=n;
do i=1 to 4;
if not missing(h{i}) then do;
key=h{i};
_ha.replace();
end;
end;
do until(x=1);
x=1;
rc=hi.first();
do while(rc=0);
if not searched then do;
_found=0;
do j=1 to 4;
key=_h(j);
_rc=_ha.check();
if _rc =0 then _found=1;
end;
if _found then do;
do k=1 to 4;
if not missing(_h{k}) then do;
key=_h(k);
_ha.replace();
end;
end;
searched=1;
ha.replace();
x=0;
end;
end;
rc=hi.next();
end;
end;
end;
else do;
do z=1 to 4;
key=h{z};
rr=_ha.find();
if rr=0 then household=_household;
end;
end;
run;
Hi. Art .Thanks to remind my code at SAS-L.
But that code is older version . I have recoded and promoted it again ,and posted here already.
That is the fastest method I can think, but for the sake of OP's large table and the table's data construction,
It is still slow to speed .
Ksharp
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.