DATA Step, Macro, Functions and more

Association between observations

Reply
New Contributor
Posts: 2

Association between observations

[ Edited ]

Hi community !

I'm stuck on a problem of association. My data looks like :

clientMasterclient_1Masterclient_2Masterclient_3
aa  
bab 
c  e
d b 
e  e

 

And I would like to add a column which make the relation between my observations. So it would be something like that :

 

clientMasterclient_1Masterclient_2Masterclient_3MasterClient
aa  a
bab a
c  ee
d b a
e  ee

 

 

First I began by taking the first column filled but there is a mistake.
The observation "d" would have an "e" in MasterClient instead of an "a" like the others of his group.

 

Do you have an idea for solving that problem?

 

An idea I'm exploring is to fill the blanks when it's possible. Like that I could be able to use my first method.

 

You can answer in french or english. Thank you !

PROC Star
Posts: 1,209

Re: Association between observations

Posted in reply to kawabongga

I don't understand the logic here.

 

Why is MasterClient='a' in the fourth observation?

New Contributor
Posts: 2

Re: Association between observations

It's an "a" because the first observation is related to the 2nd and the 2nd to the fourth. I want the same id for all of them.

Super User
Posts: 23,224

Re: Association between observations

Posted in reply to kawabongga

It's a Type II dimension with a very bad starting data structure. You need recursion or a tree search and need to reform your data to make this work. 

 

Is this how your data is stored originally? Type II dimensions are typically in a long format.

https://en.wikipedia.org/wiki/Slowly_changing_dimension

 

If you resturcture your data to have a start and end, this macro will create the 'groups' for you so that you can uniquely identify each person. 

 

https://gist.github.com/statgeek/14e3aa2a9f718f551cd98134e9ceed30

 

 

 

 

Super User
Posts: 10,681

Re: Association between observations

Posted in reply to kawabongga

First of all , you need to know the top Master .

Here I made a table to contain it.


data ancient;
 set have;
 if _start=_end;
run;

After that the following code could give you a start.

 


 


data temp;
infile cards expandtabs ;
input client $ Masterclient_1 $ Masterclient_2 $ Masterclient_3 $;
cards;
a	a . .	 	 
b	a	b	. 
c	 .	 .	e
d	 .	b	 .
e	 .	. 	e
;
run;
data have;
 set temp;
 array x{*} $ Masterclient: ;
 _end=client;
 do i=1 to dim(x);
   if not missing(x{i}) then do;_start=x{i};output; end;
 end;
 keep _start _end;
run;




data ancient;
 set have;
 if _start=_end;
run;



data want(keep=path);
if _n_ eq 1 then do;
length path _path  $ 400 ;
if 0 then set have;
declare hash ha(hashexp:20,dataset:'have(where=(_start is not missing and _end is not missing))',multidata:'y');
ha.definekey('_start');
ha.definedata('_end');
ha.definedone();

declare hash pa(ordered:'y');
declare hiter hi_path('pa');
pa.definekey('n');
pa.definedata('n','path');
pa.definedone();

end;


set ancient;
count=1;n=1;_n=1;
path=catx('|',_start,_end);
   
pa.add();
do while(hi_path.next()=0);
 if n ne 1 then pa.remove(key:_n);_n=n;
 _path=path;  
 _start=scan(path,-1,'|');
 rc=ha.find();if rc ne 0 then output;
 do while(rc=0);
  if not findw(path,strip(_end),'|') then do;
   if length(path)+length(_end)+1 gt lengthc(path) then do;
    putlog 'ERROR: The length of path and _path are set too short';
    stop;
   end;
   
   count+1;n=count;
   path=catx('|',path,_end);
   pa.add();
   path=_path;
 end;
  rc=ha.find_next();
end;
end;
pa.clear();
run;
data want;
 set want;
 start=scan(path,1,'|');
 end=scan(path,-1,'|');
 count=countw(path,'|')-1;
run;
Ask a Question
Discussion stats
  • 4 replies
  • 153 views
  • 0 likes
  • 4 in conversation