Solved: Concatenate variables in column depending on other variable and keep u...

kiranv_ · Posted 07-27-2018 01:32 PM

Hi

I have below dataset

data have;
input id val1 $ val2 $;
datalines;
1 123 CDE
1 345 CEF
1 756 CEF
1 458 HHH
1 789 JJJ
1 999 KLM
1 777 KLM
;

depending on val2 I need to concatenated val1 and get unique combination, output should be like something as shown below. I have multiple ids, for simplicity sake, I am just showing one id.

Edit1: In this table val2 column have duplicates for CEF and KLM values, rule is the concatenation of val1 should have only single combination of CEF and KLM. Below is little elaborate discussion on this

first row shown in want dataset has 123,345(CEF),458,789,999(KLM) is having only one row with respect val2 column of CEF and

KLM (first instances of CEF and KLM).

second row has 123,756(second instance of CEF),458,789,999( first instance of KLM)

third row has 123,345 (first instance of CEF),458,789,777(second instance of KLM)

fourth row has 123,756 (second instance of CEF),,458,789,777 (second instance of KLM)

data want;
Input Val $50.;
datalines;
123,345,458,789,999
123,756,458,789,999
123,345,458,789,777
123,756,458,789,777
;

I have tried first. and last. but I am unable to make unique combinations. Please let me know if you have any suggestions or ideas for this problem an also let me know if something is unclear.

hashman · Posted 07-28-2018 06:36 PM

@kiranv_,

A curious puzzle. The wrinkle is the order in which you want to collect the values from the dupes. Normally, one would fix the first dupe value of the first dupe and collect the rest, then fix the second and collect the rest, and so on. The difference between this "usual" order and your order can be illustrated as follows:

Usual order             Your order         
-----------             ----------         
data _null_ ;           data _null_ ;      
  do i = 1 to 2 ;         do j = 1 to 2 ;  
    do j = 1 to 2 ;         do i = 1 to 2 ;
      put i= j= ;             put i= j= ;  
    end ;                   end ;          
  end ;                   end ;            
run ;                   run ;              
-------                 -------            
i=1 j=1                 i=1 j=1            
i=1 j=2                 i=2 j=1            
i=2 j=1                 i=1 j=2            
i=2 j=2                 i=2 j=2

Hence, if nested loops are to be used for the task (as done below), they need to be inverted similarly to the above. You can run my code directly against your sample input (and will get the result you want), but for the sake of wider generality I've prepared my own, so that a case with double and triple dupes could be tested - otherwise one can't be sure that the algorithm is valid.

data have ;               
  input id val1 $ val2 $ ;
  cards ;                 
1 111 AAA                 
1 222 BBB                 
1 333 BBB                 
1 444 CCC                 
1 555 DDD                 
1 666 EEE                 
1 777 EEE                 
1 888 EEE                 
1 999 FFF                 
2 000 AAA                 
2 111 AAA                 
2 222 BBB                 
2 333 CCC                 
2 444 CCC                 
2 555 DDD                 
2 666 EEE                 
2 777 FFF                 
2 888 FFF                 
2 999 GGG                 
;                         
run ;

Since we'll need to assemble nested loops, it's convenient to determine the max nesting level (and the size of the needed arrays, which is the same) beforehand, rather than hard code:

proc sql noprint ;                                                                  
  select cats (max (D)) into:D from (select count (id)    as D from have group id) ;
  select cats (max (L)) into:L from (select length (val1) as L from have) ;         
quit ;

Now we need to assemble the nested loop based on these findings:

%macro loop (dim, arrv, arrp, arrs, jvar, csv) ;
  %local i ;                                    
  array _x [&dim] ;                             
  %do i = &dim %to 1 %by -1 ;                   
    do _x&i = &arrp [1,&i] to &arrp [2,&i] ;    
  %end ;                                        
  do _i = 1 to &jvar ;                          
    &arrs [_i] = &arrv [_x[_i]] ;               
  end ;                                         
  &csv = catx (",", of &arrs[*]) ;              
  output ;                                      
  %do i = &dim %to 1 %by -1 ;                   
    end ;                                       
  %end ;                                        
%mend ;

After this prelim work is done, the rest is relatively easy:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;

The idea is based on keeping track of the endpoints of each consecutive group of VAL2 values. Above, this is done using the 2-dimensional array P. If the value VAL2 in position X is unique, the endpoints P[1,X]=P[2,X], so the corresponding loop level will iterate but once; otherwise, they are not equal and the range in between determines the respective number of iterations. Thus, the logic is the same whether it be a dupe or a singleton. It is assumed that the input is sorted by ID, VAL2. The rest is just a matter or rather simple diligent coding:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;

To give you a better idea what kind of code is actually run, here's what the macro would assemble if the longest ID by-group on file had 4 records:

  array _x [4] ;               
  do _x4 = p [1,4] to p [2,4] ;
  do _x3 = p [1,3] to p [2,3] ;
  do _x2 = p [1,2] to p [2,2] ;
  do _x1 = p [1,1] to p [2,1] ;
    do _i = 1 to _j ;          
      s [_i] = v [_x[_i]] ;    
    end ;                      
    val = catx (",", of s[*]) ;
  output ;                     
  end ;                        
  end ;                        
  end ;                        
  end ;

HTH

Paul D.

View solution in original post

ballardw · Posted 07-27-2018 01:38 PM

You need to describe more about the "rules" involved in making your combinations. Your subject line says "depending on other variable". How does the process depend on other variable (which I have to assume is val2 in your example data since you did not state which).

Also, does the output order have to exactly match?

You only show one value if ID. Do you have more than one value for Id? If so, what role does Id play in building the combinations.

Also your second and fourth line of "want" data are identical. Was line 4 supposed to be

123,756,458,789,777

kiranv_ · Posted 07-27-2018 02:07 PM

@ballardw you are right, My line 4 was wrong and I have edited. I have also elaborated on rules for concatenation. Thanks for your suggestions

novinosrin · Posted 07-27-2018 04:29 PM

Hi @kiranv_

Not happy with my own performance on the code, but you can use until the champs provide you efficient ones. And i am sure being a Proc star you can just make it better too

data have;
input id val1 $ val2 $;
datalines;
1 123 CDE
1 345 CEF
1 756 CEF
1 458 HHH
1 789 JJJ
1 999 KLM
1 777 KLM
;

proc transpose data=have out=w;
by id val2;
var val1;
run;

proc transpose data=w(keep= id col:) out=w2;
by id ;
var col:;
run;

data wanted;
length temp $8;
call missing(temp);
if _n_ =1 then do;
dcl hash h();
 h.definekey ("i") ;
 h.definedata ('temp') ;
 h.definedone () ;
 end;
set w2 ;
by id;
array t(*) col:;
array tem(*) $ j1-j5;
retain tem;
if first.id then do;
do n=1 to dim(tem);
tem(n)=t(n);
end;
output;
h.clear();
end;
else 
do;
	do i=1 to dim(t);
		if not missing(t(i)) then do;
		rc= h.add(key:i,data:tem(i));tem(i)=t(i);
		output;
		rc=h.find();
		tem(i)=temp;
							end;
	end;
	do i=1 to dim(t);
		if not missing(t(i)) then tem(i)=t(i);
	end;
	output;
end;keep j:;
run;

kiranv_ · Posted 07-30-2018 10:58 AM

Thanks @novinosrin, that was an elegant answer

hashman · Posted 07-28-2018 06:36 PM

@kiranv_,

A curious puzzle. The wrinkle is the order in which you want to collect the values from the dupes. Normally, one would fix the first dupe value of the first dupe and collect the rest, then fix the second and collect the rest, and so on. The difference between this "usual" order and your order can be illustrated as follows:

Usual order             Your order         
-----------             ----------         
data _null_ ;           data _null_ ;      
  do i = 1 to 2 ;         do j = 1 to 2 ;  
    do j = 1 to 2 ;         do i = 1 to 2 ;
      put i= j= ;             put i= j= ;  
    end ;                   end ;          
  end ;                   end ;            
run ;                   run ;              
-------                 -------            
i=1 j=1                 i=1 j=1            
i=1 j=2                 i=2 j=1            
i=2 j=1                 i=1 j=2            
i=2 j=2                 i=2 j=2

Hence, if nested loops are to be used for the task (as done below), they need to be inverted similarly to the above. You can run my code directly against your sample input (and will get the result you want), but for the sake of wider generality I've prepared my own, so that a case with double and triple dupes could be tested - otherwise one can't be sure that the algorithm is valid.

data have ;               
  input id val1 $ val2 $ ;
  cards ;                 
1 111 AAA                 
1 222 BBB                 
1 333 BBB                 
1 444 CCC                 
1 555 DDD                 
1 666 EEE                 
1 777 EEE                 
1 888 EEE                 
1 999 FFF                 
2 000 AAA                 
2 111 AAA                 
2 222 BBB                 
2 333 CCC                 
2 444 CCC                 
2 555 DDD                 
2 666 EEE                 
2 777 FFF                 
2 888 FFF                 
2 999 GGG                 
;                         
run ;

Since we'll need to assemble nested loops, it's convenient to determine the max nesting level (and the size of the needed arrays, which is the same) beforehand, rather than hard code:

proc sql noprint ;                                                                  
  select cats (max (D)) into:D from (select count (id)    as D from have group id) ;
  select cats (max (L)) into:L from (select length (val1) as L from have) ;         
quit ;

Now we need to assemble the nested loop based on these findings:

%macro loop (dim, arrv, arrp, arrs, jvar, csv) ;
  %local i ;                                    
  array _x [&dim] ;                             
  %do i = &dim %to 1 %by -1 ;                   
    do _x&i = &arrp [1,&i] to &arrp [2,&i] ;    
  %end ;                                        
  do _i = 1 to &jvar ;                          
    &arrs [_i] = &arrv [_x[_i]] ;               
  end ;                                         
  &csv = catx (",", of &arrs[*]) ;              
  output ;                                      
  %do i = &dim %to 1 %by -1 ;                   
    end ;                                       
  %end ;                                        
%mend ;

After this prelim work is done, the rest is relatively easy:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;

The idea is based on keeping track of the endpoints of each consecutive group of VAL2 values. Above, this is done using the 2-dimensional array P. If the value VAL2 in position X is unique, the endpoints P[1,X]=P[2,X], so the corresponding loop level will iterate but once; otherwise, they are not equal and the range in between determines the respective number of iterations. Thus, the logic is the same whether it be a dupe or a singleton. It is assumed that the input is sorted by ID, VAL2. The rest is just a matter or rather simple diligent coding:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;

To give you a better idea what kind of code is actually run, here's what the macro would assemble if the longest ID by-group on file had 4 records:

  array _x [4] ;               
  do _x4 = p [1,4] to p [2,4] ;
  do _x3 = p [1,3] to p [2,3] ;
  do _x2 = p [1,2] to p [2,2] ;
  do _x1 = p [1,1] to p [2,1] ;
    do _i = 1 to _j ;          
      s [_i] = v [_x[_i]] ;    
    end ;                      
    val = catx (",", of s[*]) ;
  output ;                     
  end ;                        
  end ;                        
  end ;                        
  end ;

HTH

Paul D.

kiranv_ · Posted 07-30-2018 11:00 AM

An excellent solution, with a detailed explanation. thank you so much Paul

novinosrin · Posted 07-30-2018 11:10 AM

@kiranv_ Thank you for acknowledging. The well known incarnation of the lord @hashman dorfman who dwarfs others blesses us us with sas miracles whose en-devours I often peek into his peekc/long and other papers. I missed to inform you that I got hold of his book by university safari subscription which my university charged me along with the tuition fee at student rate. So, I can't thank my uni enough although if that wasn't the case, i would pay Paul directly.

I remember you were saying you want to buy that too. Please do so if you haven't. The book has been released in the open market now and DonH sent that notification on the same thread I posted about the book. Just to let you know mate!

PS.

If you already knew, sorry about that lol

kiranv_ · Posted 07-30-2018 11:40 AM

you are absolutely right and I am buying it today.

Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

Re: Concatenate variables in column depending on other variable and keep unique combination

SAS Innovate 2025: Call for Content

Classroom Training Available!