BookmarkSubscribeRSS Feed
Ksharp
Super User

OK. I take the suggest of DLing . Use do until() to judge whether no more obs be included into a group.

think it is faster a lot. Thank you . DLing

data test;
input  ( pan1 pan2 pan3 add1  ) ($);
datalines;
aaa   bbb    ccc     ddd    
qqq   rrr       www   aaa   
rrr     ppp    mmm lll       
uuu   zzz    ffff      ppp     
p       l        m          n      
jjjj    eee     rrr       ooo    
.    .   .   .             
sss   www  .    .             
 .         .        .        eee       
;
run;
data want(keep=pan1 pan2 pan3 add1 household);
if _n_ =1 then do;
 declare hash ha(hashexp : 20,ordered: 'Y');
 declare hiter hi('ha');
  ha.definekey('count');
  ha.definedata('count','_pan1','_pan2','_pan3','_add1','searched');
  ha.definedone();
do until(last);
 set test(rename=(pan1=_pan1 pan2=_pan2 pan3=_pan3 add1=_add1)) end=last;
 count+1;searched=0;
 if not missing(_pan1) or not missing(_pan2) or not missing(_pan3) or not missing(_add1) then ha.add();
end;

 declare hash _ha(hashexp: 20);
  _ha.definekey('key');
  _ha.definedata('_household');
  _ha.definedone();
end;
   set test ;_count+1;
   length key $ 40;
   array h{4} $ 40 pan1 pan2 pan3 add1;
   array _h{4} $ 40 _pan1 _pan2 _pan3 _add1;

count=_count;searched=0; r=ha.find();
if not searched then do;
                     n+1; household=n;_household=n;
                      do i=1 to 4;
                       if not missing(h{i}) then do; key=h{i}; _ha.replace();end;
                      end;                 
                    
                  do until(x=1);
                         x=1;
                     rc=hi.first();
                     do while(rc=0);
                            if not searched then do;
                         _found=0;
                         do j=1 to 4;
                           key=_h(j);_rc=_ha.check();
                           if _rc =0 then _found=1;
                         end;
                               if _found then do;
                                                do k=1 to 4;
                                            if not missing(_h{k}) then do; key=_h(k);_ha.replace();end;
                                                end;
                                                searched=1;ha.replace();x=0;
                                              end;
                                                   end;
                          rc=hi.next();
                       end;
                   end;
end;
else do;
       do z=1 to 4;
       key=h{z};rr=_ha.find();
       if rr=0 then household=_household;
        end;
     end;
run;

Ksharp

sas_Forum
Calcite | Level 5

Thqs Ksharp for your Help again i have ran the code on 40000 but it was running for 8-10 mins and  i stopped it...

Ksharp
Super User

Sorry. I can not do it any more, can not optimize it any more.

For the sake of your large table and complicated recursive , I am afraid that you need some more time to run

my code. except you have a super computer.

BTW, I tested my code for one hundred thousand obs at my old pc, it spends only four minutes.

Good Luck.

Bye!

Ksharp

sas_Forum
Calcite | Level 5

Ok Thqs Ksharp for your responce every time i am really thank ful to you for your help..Thqs

art297
Opal | Level 21

I'm still interested in trying a couple of additional approaches before giving up, but don't want to spend the time on data that likely doesn't reflect the true situation.  Is it possible for you to attach a file that closely approximates your "real" data both in size, complexity, etc.?

sas_Forum
Calcite | Level 5

HI Art thqs for your approach as i can not give the data it was 8crs i have tryed with 8lacs of data by your code it worked finely for pan1 while coming to pan2 ,pan3 and add1 there are some pans are  moving in to differnt household's as oen pan should be only in one household id..i am aslo still stressed out how to do it..regaridng the data also i canot send the 8 lacs data also...if any approach on this please let me know.. and the data above pasted with some records (app 100 Records) is the real data..

data temp;

infile datalines dsd dlm='|' truncover;

input (pan1 pan2 pan3 add1) (: $40.);

datalines;

M5669K|4HH5BFWBBMBY48Y7P8S||

I2175Q|P5P8YBBMBY48Y7P8S0005D||

S1495P|8BWP_Y8V3MYVWBBMBBMB||

G4909P|WN8Y4YFN48FYP884P83Y4||

S0560L|488Y8N8N8H-K56H4YP8P||

F0994B|R8YPF8Y4P-0BWP8BBMBY||

H6408L|HICPM4BBPWFP4WV4MP8YBB||

M2091M|48YJ23N3P8VWBBMBY48Y7||

S2426K|5V4PZHKB8YPNPVWPVBBMBY||

D7907L|SBMBMVYYBBM8PBY48Y7P8S0|H6408L|

G9847L|XB3845-WM43YWBBMBY48Y7||

D3871H|D3PC34Z0-38W3YBBMBY48Y||

T0753D|2P8Y38Y84W3N4NPJ2R8P8Y||

P5328F|Z06V4B8YFYFYMY8FMYVWBB||

D3938K|KSJ2VBYWBY4P8YBBMBY48Y||

A3529D|K0K4F4BY8BBXVBBMBY48Y8||

P8617A|I0S3YPV4P8BYVWMYVWBBM4||H6408L

R1129A|Z0KMV4YWM483NBBMB8Y7P8S||

R|KK0ZJ2PYJ2NW8K5SH8BV84||

S7807B|YPFBYBP4WNWN88YBMBBMBY||

N8681Q|KHH4B8MMYNW3Y8YBBMBY48||

P8632Q|H550HFMBM4YPFYFPBBMBY4||

K3437H|DZ6MYVNN3YWMBY48Y7P8S0||

G5510C|ZH@YPBY84BB$$$$$$$$$||

R2518M|HZB8NV8MWXYBBY48Y7P8S00||

M1014A|B8XY5IZH_38VWBBMBBMBY4||

M0049P|KZYJ2C8Y3BNCVY4VP8YBBM||

PD4296R|K0KVPJ23YV43PFY3P8VWBB||

L5351M|KF47PP8NPCN4VWNWBBMS00||

D7744C|56PW3WF3W8P3P8VWBBMBY4||

U1088L|#8B8WP8FPMP3YWBBMS0005||

U1088L|#8B8WP8FPMP3YWBBMBY48Y||

U0564D|#8B8WP8FPMP3YWBBMBY48Y||

B0102C|@P7P8W4@PZIYBWDSZ4M_W||

U1088L|#8B8WP8FPMPWBBMBY48Y||

U0564D|#8B8WP8FPMPYWBBMBY48Y||

S9556K|@P7P8W46CN8SZ_WGP48PBB||

S9556K|@P7P8W46CN8SZ_WGP48PBB||

B1476E|S05YCGY4N4CYHZKPYBPBBMS||

C2940N|56PW3WF3W8P3P8VWBBMBY4||

U0564D|#8B8WP8FPMP3YWBBMBY48Y||

R0090B|_XN4S2YJ2F8MBY48Y7P8S0||

N8784Q|S0K4PY4WD65YWBBMBY48Y7||

G8553G|KZBP8SFYFPNYBBMBY48Y7P||

S5904H|60KZZ2Y3NCP5J2Y3N3P8VW||

S3482C|Z05Y4YJ24VVYWN884YBBMB||

U1088L|#8B8WP8FPMPYWBBMBY48Y||

K3031C|ZH0HF8YPFY84Y83NBBMBY||

R0090B|_XN4S2YJ2F8MBY48Y7P8S0||

P4466N|IHY48CH84YBBY48Y7P8S00||

J2314G|5WXBM8Y3YWBBBMBY48Y7P8||

R0090B|_XN4SJ2YJ2F8MBY48Y7P8S0||

TI7823D|78MH-ZZ3M8GYBBMBBMBY48||

B0509R|50Z88B5V48YN4FYFPBBMB||

M0682N|B4YGWP443YPY3WMBBMBY48||

K0725D|ZH0HF8YPFY8Y4Y83NBBMBY||

P9185M|Z-8WX6SW8BPBMBY48Y7P8S||

V2636H|Z05WWMFJ2YC4P4WWMFBBMBY||

K9954E|MJ2FPFYBCBBY48Y7P8$$$$||

PS0210M|S0-SBBW88YW8BBMBY48Y7P||

PS5842M|38FW-WXBV4FWX_BNWXBBMB||

M5480A|50Z44YWVWNWMBY48Y7P8S0||

V2825L|SCF84C@WBWBYBYWP8YBBMB||

J9090J|BM83Y8P4NN6V883NBBMBY4||

G9332F|Z0SZ4W8HV8FPBBMBY48Y7P||

M2017K|K6KDNPJ28ZK4YBBMBY48Y7||

M6901D|I-NPX488V3YPBBMBY48Y7P8||

S0689N|Z0ZZP2YF367YVWBBMBBMBY|M5669K|

M4134E|DNB4FHZIHZ43Y4BBMBY48Y||

G7604R|5-Y8PRPMWPF4YBBMBY48Y7||

S3468A|HCFP48FWWV88BBMBY48Y7P||

S0796N|K0SP3KPVFYWBWP8BBMBY48||

D7781Q|H0DN48484P8883NBBMBY48|S2395K|

B4607P|HBPFWIS3N3YBMBY48Y7P8S||

C4398Q|K0HKCYBX8YMKMYVWBBMBY4||

J9787E|SNMZD4PB8PFBBMBY48Y7P8|D7781Q|

D1187E|BYP8BWBY4FPP8YBBMBBMBY||S2395K

K2110J|HKS6FYNWFYHMWBBMBY48Y7P||

M5116F|S5SIB8BP85GBBMBY48Y7P8||

M3134J|6HWJ2P78NVBBY48Y7P8S00||D7781Q

P1590P|S0Z4PYVY84PBBMBY48Y7P8|M3134J|

S9598G|KZWXB48VYBPBBMBY48Y7P8||

K9168B|-HJ2M88MJ2WMBY48Y7P8S0|H3629M|

S2395K|ZSZ48PF#NFVBW8BBMBBMBY||

M2140L|48VP33W8BNB8BW8BBMBBMB||H3629M

B4952H|YJ2PB3CN8MYBBMBY48Y7P8|S9598G|S2395K

S5018D|KZ0HVMWM4Y3BBMBY48Y7P8||

K5943E|Z6PP8VPM83N8BPMV8BBMBY||

N6677M|ZKYCSVW8BF8VWBBMBY48Y7|M5669K|

HJ7138K|MV4YK08W38BBMBY48Y7P8S||

  | | |

R9004J|K0KCNP7P85SBBMBY48Y7P||

F0109Q|Z8W8SSVY4VYNM8VBBMBBMB|R9004J|M2724Q

M2724Q|ZZZ54RFP8GYBMBBMBY48Y7||

D9109E|YVYB4W46SWYBMBBMBY48Y7P||

M9585E|KBGWY3YB2WBMBY48Y7P8S00||

  | | |

H3629M|-_P8VY-0WP8BMBY48Y7P8S0||

T2426B|50ZYC8NHI8YBCVCFYP8BBM|M9295K|

M9295K|Z5ZMPFV8M8BMBBMBY48Y7||

P3089F|K6KDNPJ28ZK4YBBMBY48Y7||

PO2499R|-K067P8KPYBMBY48Y7P8S0||

L3447Q|33PV4V5I633F4YBBMBY48Y|M5669K|

1088L|||

;

run;

art297
Opal | Level 21

What I'm wondering is how often you confront cases where, using the initial method, households are divided because of where they appear in your data.

A brute force approach doesn't seem realistic unless it is absolutely necessary.  I'm thinking along the lines of extracting the index to a format within the same datastep as it is initially created, and then analyzing the crosstabulation of the assigned households looking for crossed assignments.

It shouldn't make any difference if they exist in pan1, 2 or 3, or add1 but, rather, the order they appear in the file.

given:

pan1 pan2 pan3 add1

1       2      3

4       5      6        7

4       1      2        3

the original approach would misclassify the 2nd and 3rd records, but the code could be designed to identify such instances where a multiple classification was made.  The point I'm wondering about is whether one has to subject the data to 80,000 complete iterations or could be better address by doing a simple analysis after the initial classification attempt is made.

DLing
Obsidian | Level 7

The issue is that, with any sort of sequential processing like SAS data step, the linkage info can show up much later in the file, which requires (somehow) going back to all previously assigned id, and correct all of them!  From the corrections, more corrections may be necessary, so there is this cascading ripple effect, hence recursion or loop until.  This is the crux of the thorny problem.

My code for the labor day challenge was built for householding where the fields are like "name", "address", "phone", "account number".  Any records that share the same value in the same column must be put into a group.  Structurally this is different from the OP where it's the same value in any column, hence my not posting here.  They are closely related though.  WIth some modifications, it should work.

Also note that whether it's 2 fields, 4 fields, or n fields, that's not really material, the concept is the same.  And if you think about it, the OP really has only 1 field because it is "same value in any column"!  The records are really statements of arcs or linkages.

For my case, wIth the loop until no more inconsistency is found approach on 8 million customer with about 20 million records, it ran for 5-10 hours making around 15+ sweeps through the dataset.  The number of sweeps reflect the order and depth implied by the data.  This problem is interesting in that if you sort the input data stream so as to minimize linkages showing up later, then it runs fast, but if the sorting is opposite, then you experience horrendous run time.

Theoretical worst case, we can think of this as a binary tree.  We put the bottom nodes first in the dataset:

(1 2) (3 4) (1 3).

Now we can replicate this:

(1 2) (3 4) (2 3)    (5 6) (7 😎 (5 7)    (1 5)    <this records connects the two subtrees and causes one branch to need to be fully relabeled>

and again:

(9 10) (11 12) (9 11)     (12 13) (14 15) (12 14)      (9 12)        (1 9)

and again, and again, to any depth we wish.  Always the last record binds two large trees, only much bigger and deeper.  And the relabeling can only begin from the top of the tree and need to cascade all the way to the bottom.  If we were using a language that supported a) building linked lists data structure in memory, b) recursive function call, this is easy, provided the machine has sufficient stack depth at run time to accomodate the recursion depth.  But this is not SAS data step at all, hence the pain.

We can write code that would generate this binary tree structure to any arbitrary depth  (hmm... should be fun :smileylaugh:)  and feed it to any proposed code to verify correctness.

art297
Opal | Level 21

I agree with Ksharp and DLing that the problem will likely require an iterative macro.  However, given your sample data, the duplicate issue was resolved with one iteration, thus there isn't enough data there to be able to test whether any macro we might build is actually sufficient.

The alternative, of course, is that you simply run the iterations and tests manually.  In the following code, I run the proposed hash solution, check for discrepacies, correct the discrepancies in a subsequent run of the proposed hash solution, and then test the result.

If you do end up incorporating the logic in a macro, it would be appreciated if you post your final code.

The code and data I used was:

data have;

  infile cards dsd dlm='|' truncover;

  input (pan1 add1 pan2 pan3) (: $40.);

  cards;

M5669K|4HH5BFWBBMBY48Y7P8S|P5P8YBBMBY48Y7P8S0005D|

I2175Q|P5P8YBBMBY48Y7P8S0005D||

S1495P|8BWP_Y8V3MYVWBBMBBMB||

G4909P|WN8Y4YFN48FYP884P83Y4||

|S1495P||

S0560L|488Y8N8N8H-K56H4YP8P||

F0994B|R8YPF8Y4P-0BWP8BBMBY||

H6408L|HICPM4BBPWFP4WV4MP8YBB||

M2091M|48YJ23N3P8VWBBMBY48Y7||

S2426K|5V4PZHKB8YPNPVWPVBBMBY||

D7907L|SBMBMVYYBBM8PBY48Y7P8S0|H6408L|

G9847L|XB3845-WM43YWBBMBY48Y7||

D3871H|D3PC34Z0-38W3YBBMBY48Y||

T0753D|2P8Y38Y84W3N4NPJ2R8P8Y||

P5328F|Z06V4B8YFYFYMY8FMYVWBB||

D3938K|KSJ2VBYWBY4P8YBBMBY48Y||

A3529D|K0K4F4BY8BBXVBBMBY48Y8||

P8617A|I0S3YPV4P8BYVWMYVWBBM4||H6408L

R1129A|Z0KMV4YWM483NBBMB8Y7P8S||

R|KK0ZJ2PYJ2NW8K5SH8BV84||

S7807B|YPFBYBP4WNWN88YBMBBMBY||

N8681Q|KHH4B8MMYNW3Y8YBBMBY48||

P8632Q|H550HFMBM4YPFYFPBBMBY4||

K3437H|DZ6MYVNN3YWMBY48Y7P8S0||

G5510C|ZH@YPBY84BB$$$$$$$$$||

R2518M|HZB8NV8MWXYBBY48Y7P8S00||

M1014A|B8XY5IZH_38VWBBMBBMBY4||

M0049P|KZYJ2C8Y3BNCVY4VP8YBBM||

PD4296R|K0KVPJ23YV43PFY3P8VWBB||

L5351M|KF47PP8NPCN4VWNWBBMS00||

D7744C|56PW3WF3W8P3P8VWBBMBY4||

U1088L|#8B8WP8FPMP3YWBBMS0005||

U1088L|#8B8WP8FPMP3YWBBMBY48Y||

U0564D|#8B8WP8FPMP3YWBBMBY48Y||

B0102C|@P7P8W4@PZIYBWDSZ4M_W||

U1088L|#8B8WP8FPMPWBBMBY48Y||

U0564D|#8B8WP8FPMPYWBBMBY48Y||

S9556K|@P7P8W46CN8SZ_WGP48PBB||

S9556K|@P7P8W46CN8SZ_WGP48PBB||

B1476E|S05YCGY4N4CYHZKPYBPBBMS||

C2940N|56PW3WF3W8P3P8VWBBMBY4||

U0564D|#8B8WP8FPMP3YWBBMBY48Y||

R0090B|_XN4S2YJ2F8MBY48Y7P8S0||

N8784Q|S0K4PY4WD65YWBBMBY48Y7||

G8553G|KZBP8SFYFPNYBBMBY48Y7P||

S5904H|60KZZ2Y3NCP5J2Y3N3P8VW||

S3482C|Z05Y4YJ24VVYWN884YBBMB||

U1088L|#8B8WP8FPMPYWBBMBY48Y||

K3031C|ZH0HF8YPFY84Y83NBBMBY||

R0090B|_XN4S2YJ2F8MBY48Y7P8S0||

P4466N|IHY48CH84YBBY48Y7P8S00||

J2314G|5WXBM8Y3YWBBBMBY48Y7P8||

R0090B|_XN4SJ2YJ2F8MBY48Y7P8S0||

TI7823D|78MH-ZZ3M8GYBBMBBMBY48||

B0509R|50Z88B5V48YN4FYFPBBMB||

M0682N|B4YGWP443YPY3WMBBMBY48||

K0725D|ZH0HF8YPFY8Y4Y83NBBMBY||

P9185M|Z-8WX6SW8BPBMBY48Y7P8S||

V2636H|Z05WWMFJ2YC4P4WWMFBBMBY||

K9954E|MJ2FPFYBCBBY48Y7P8$$$$||

PS0210M|S0-SBBW88YW8BBMBY48Y7P||

PS5842M|38FW-WXBV4FWX_BNWXBBMB||

M5480A|50Z44YWVWNWMBY48Y7P8S0||

V2825L|SCF84C@WBWBYBYWP8YBBMB||

J9090J|BM83Y8P4NN6V883NBBMBY4||

G9332F|Z0SZ4W8HV8FPBBMBY48Y7P||

M2017K|K6KDNPJ28ZK4YBBMBY48Y7||

M6901D|I-NPX488V3YPBBMBY48Y7P8||

S0689N|Z0ZZP2YF367YVWBBMBBMBY|M5669K|

M4134E|DNB4FHZIHZ43Y4BBMBY48Y||

G7604R|5-Y8PRPMWPF4YBBMBY48Y7||

S3468A|HCFP48FWWV88BBMBY48Y7P||

S0796N|K0SP3KPVFYWBWP8BBMBY48||

D7781Q|H0DN48484P8883NBBMBY48|S2395K|

B4607P|HBPFWIS3N3YBMBY48Y7P8S||

C4398Q|K0HKCYBX8YMKMYVWBBMBY4||

J9787E|SNMZD4PB8PFBBMBY48Y7P8|D7781Q|

D1187E|BYP8BWBY4FPP8YBBMBBMBY||S2395K

K2110J|HKS6FYNWFYHMWBBMBY48Y7P||

M5116F|S5SIB8BP85GBBMBY48Y7P8||

M3134J|6HWJ2P78NVBBY48Y7P8S00||D7781Q

P1590P|S0Z4PYVY84PBBMBY48Y7P8|M3134J|

S9598G|KZWXB48VYBPBBMBY48Y7P8||

K9168B|-HJ2M88MJ2WMBY48Y7P8S0|H3629M|

S2395K|ZSZ48PF#NFVBW8BBMBBMBY||

M2140L|48VP33W8BNB8BW8BBMBBMB||H3629M

B4952H|YJ2PB3CN8MYBBMBY48Y7P8|S9598G|S2395K

S5018D|KZ0HVMWM4Y3BBMBY48Y7P8||

K5943E|Z6PP8VPM83N8BPMV8BBMBY||

N6677M|ZKYCSVW8BF8VWBBMBY48Y7|M5669K|

HJ7138K|MV4YK08W38BBMBY48Y7P8S||

  | | |

R9004J|K0KCNP7P85SBBMBY48Y7P||

F0109Q|Z8W8SSVY4VYNM8VBBMBBMB|R9004J|M2724Q

M2724Q|ZZZ54RFP8GYBMBBMBY48Y7||

D9109E|YVYB4W46SWYBMBBMBY48Y7P||

M9585E|KBGWY3YB2WBMBY48Y7P8S00||

  | | |

H3629M|-_P8VY-0WP8BMBY48Y7P8S0||

T2426B|50ZYC8NHI8YBCVCFYP8BBM|M9295K|

M9295K|Z5ZMPFV8M8BMBBMBY48Y7||

P3089F|K6KDNPJ28ZK4YBBMBY48Y7||

PO2499R|-K067P8KPYBMBY48Y7P8S0||

L3447Q|33PV4V5I633F4YBBMBY48Y|M5669K|

1088L|||

;

run;

/*Assign initial household numbers*/

data for_format (keep=start stop label);

  if _n_ eq 1 then do;

    declare hash ha(hashexp: 16);

    ha.definekey('key');

    ha.definedata('hhold');

    ha.definedone();

  end;

  set have;

  array _house{*} $ 40 pan1--pan3;

  do i=1 to dim(_house);

    key=_house{i};

    call missing(hhold);

    rc=ha.find();

    if rc=0 then do;

      found=1;

      label=hhold;

      leave;

    end;

  end;

  if not found then do;

    n+1;

    label=n;

  end;

  do j=1 to dim(_house);

    if not missing(_house{j}) then do;

      key=_house{j};

      hhold=label;

      ha.replace();

      start=key;

      stop=key;

      output;

    end;

  end;

run;

/*Test for invalid assignments*/

proc sort data=for_format out=test nodupkey;

  by start label;

run;

data test2;

  set test;

  by start;

  if not(first.start and last.start);

run;

/*Correct identified discrepancies*/

data want (keep=pan1 pan2 pan3 add1 household);

  if _n_ eq 1 then do;

    declare hash ha(hashexp: 16);

    ha.definekey('key');

    ha.definedata('hhold');

    ha.definedone();

  end;

  set have;

  array _house{*} $ 40 pan1--pan3;

  do i=1 to dim(_house);

    key=_house{i};

    call missing(hhold);

    rc=ha.find();

    if rc=0 then do;

      found=1;

      IF HHOLD EQ 64 THEN HHOLD=59;

      household=hhold;

      leave;

    end;

  end;

  if not found then do;

    n+1;

    household=n;

  end;

  do j=1 to dim(_house);

    if not missing(_house{j}) then do;

      key=_house{j};

      IF HOUSEHOLD EQ 64 THEN HOUSEHOLD=59;

      hhold=household;

      ha.replace();

    end;

  end;

run;

/*Test for invalid assignments*/

data test (keep=start household);

  set want;

  array _house{*} $ 40 pan1--pan3;

  do i=1 to dim(_house);

    if _house(i) ne "" then do;

      start=_house(i);

      output;

    end;

  end;

run;

proc sort data=test nodupkey;

  by start household;

run;

data test2;

  set test;

  by start;

  if not(first.start and last.start);

run;

art297
Opal | Level 21

Please ignore this post.  I tried to post an additional response and couldn't.  I am just testing to see if we have reached some kind of limit on number of posts within a thread.

art297
Opal | Level 21

I would be interested in your testing your data on the following code.  It appears to end up with the same household assignments as Ksharp's last offering, other than the actual numbers that are assigned.  I skip a number if the number was incorrectly assigned during one of the early passes.

It is a macrotized version of the process I had described a couple of days ago:

data have;

  infile cards truncover;

  input (pan1 add1 pan2 pan3) (: $40.);

  cards;

aaa   bbb    ccc     ddd

qqq   rrr       www   aaa

rrr     ppp    mmm  lll

uuu   zzz    ffff      ppp

p       l        m          n

m      aaa

a  b  c  .

jjjj    eee     rrr       ooo

.    .   .   .

e  f  .  .

a  e  .  .

g  h  .  .

i  j  k  .

i  j  k  a

sss   www  .  .

.  .  . eee

;

run;

%macro findthem;

  proc datasets library=work nowarn;

    delete recodes;

  quit;

 

  %let stopthis=0;

  /*Assign initial household numbers*/

  data for_recodes (keep=start hhold);

    if _n_ eq 1 then do;

      declare hash ha(hashexp: 16);

      ha.definekey('key');

      ha.definedata('hhold');

      ha.definedone();

    end;

    set have;

    array _house{*} $ 40 pan1--pan3;

    do i=1 to dim(_house);

      key=_house{i};

      call missing(hhold);

      rc=ha.find();

      if rc=0 then do;

        found=1;

        leave;

      end;

    end;

    if not found then do;

      n+1;

      hhold=n;

    end;

    do j=1 to dim(_house);

      if not missing(_house{j}) then do;

        key=_house{j};

        ha.replace();

        start=key;

        output;

      end;

    end;

  run;

  %let recodes=;

  %do %while (&stopthis eq 0);

    /*Test for invalid assignments*/

    proc sort data=for_recodes out=test nodupkey;

      by start hhold;

    run;

 

    data test2 (drop=n);

      set test end=eof;

      by start;

      if not(first.start and last.start) then do;

        output;

        n+1;

      end;

      if eof and n lt 1 then do;

        call symput('stopthis',1);

      end;

    run;

 

    %if &stopthis eq 0 %then %do;

      proc append base=recodes data=test2 force;

      run;

 

            proc sql noprint;

        select "if hhold eq "||

           strip(put(hhold,best12.))||

           " then hhold="||

           strip(put(min(hhold),best12.))||

           ";"

            into :recodes

              separated by " "

                from recodes

                        group by start

        ;

      quit;

            /*Correct identified discrepancies*/

      data for_recodes (keep=start hhold);

        if _n_ eq 1 then do;

          declare hash ha(hashexp: 16);

          ha.definekey('key');

          ha.definedata('hhold');

          ha.definedone();

        end;

        set have;

        array _house{*} $ 40 pan1--pan3;

        do i=1 to dim(_house);

          key=_house{i};

          call missing(hhold);

          rc=ha.find();

          if rc=0 then do;

            found=1;

            &recodes.

            leave;

          end;

        end;

        if not found then do;

          n+1;

          hhold=n;

          &recodes.

        end;

        do j=1 to dim(_house);

          if not missing(_house{j}) then do;

            key=_house{j};

            ha.replace();

            start=key;

            output;

          end;

        end;

      run;

    %end;

  %end;

 

  /*Get Final Dataset*/

  data want (keep=pan1 pan2 pan3 add1 household);

    if _n_ eq 1 then do;

      declare hash ha(hashexp: 16);

      ha.definekey('key');

      ha.definedata('hhold');

      ha.definedone();

    end;

    set have;

    array _house{*} $ 40 pan1--pan3;

    do i=1 to dim(_house);

      key=_house{i};

      call missing(hhold);

      rc=ha.find();

      if rc=0 then do;

        found=1;

        &recodes.

        household=hhold;

        leave;

      end;

    end;

    if not found then do;

      n+1;

      hhold=n;

      &recodes.

      household=hhold;

    end;

    do j=1 to dim(_house);

      if not missing(_house{j}) then do;

        key=_house{j};

        ha.replace();

      end;

    end;

  run;

%mend findthem;

 

%findthem

sas_Forum
Calcite | Level 5

                                    Obs    pan1    add1    pan2    pan3    household

                                            1    aaa     bbb     ccc     ddd         1   
                                            2    qqq     rrr     www     aaa         1   
                                            3    rrr     ppp     mmm     lll         1   
                                            4    uuu     zzz     ffff    ppp         1   
                                            5    p       l       m       n           2   
                                            6    m       aaa                         2   
                                            7    a       b       c                   3   
                                            8    jjjj    eee     rrr     ooo         1   
                                            9                                        4   
                                           10    e       f                           5   
                                           11    a       e                           3   
                                           12    g       h                           6   
                                           13    i       j       k                   7   
                                           14    i       j       k       a           7   
                                           15    sss     www                         1   
                                           16                            eee         1   

Hi art i have checked the code for the output of want if we check  for obs 6 the add1 aaa is already in obs 1 in pan1 it should get hid as1 but it got as 2 it is wrong output..

art297
Opal | Level 21

I don't think that you ran the code that I posted or, if you did, did you look carefully at your log to see if your system was confronting some kind of error?

I just ran that data and got the following result which is totally different than what you just posted??????

                                    Obs    pan1    add1    pan2    pan3    household

                                            1    aaa     bbb     ccc     ddd        1   
                                            2    qqq     rrr     www     aaa         1   
                                            3    rrr     ppp     mmm     lll          1   
                                            4    uuu     zzz     ffff    ppp           1   
                                            5    p       l       m       n                1
                                            6    m       aaa                             1   
                                            7    a       b       c                         3   
                                            8    jjjj    eee     rrr     ooo              1   
                                            9                                                4   
                                           10    e       f                                  3  
                                           11    a       e                                 3   
                                           12    g       h                                 6   
                                           13    i       j       k                          3   
                                           14    i       j       k       a                 3   
                                           15    sss     www                          1   
                                           16                            eee              1  

R_Win
Calcite | Level 5

But art but while coming to the output above i have not seen the household id 2 and 5 are mising...

Then i have changed the macro intial household id to 1then i got this output..

  %let stopthis=1;

  /*Assign initial household numbers*/

the ouput i got is ...

   Obs    pan1    add1    pan2    pan3    household

                                            1    aaa     bbb     ccc     ddd         1   
                                            2    qqq     rrr     www     aaa         1   
                                            3    rrr     ppp     mmm     lll         1   
                                            4    uuu     zzz     ffff    ppp         1   
                                            5    p       l       m       n           2   
                                            6    m       aaa                         2   
                                            7    a       b       c                   3   
                                            8    jjjj    eee     rrr     ooo         1   
                                            9                                        4   
                                           10    e       f                           5   
                                           11    a       e                           3   
                                           12    g       h                           6   
                                           13    i       j       k                   7   
                                           14    i       j       k       a           7   
                                           15    sss     www                         1   
                                           16                            eee         1   

art297
Opal | Level 21

My algorithm, by design, skips a household number if it was incorrectly assigned during an earlier iteration. Thus, in the case of your data, since 2 was incorrectly assigned in the first iteratioin, 2 is never used in the final result.

Art

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 124 replies
  • 3693 views
  • 4 likes
  • 7 in conversation