BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kiranv_
Rhodochrosite | Level 12

Hi

 

I have below dataset

data have;
input id val1 $ val2 $;
datalines;
1 123 CDE
1 345 CEF
1 756 CEF
1 458 HHH
1 789 JJJ
1 999 KLM
1 777 KLM
;

depending on val2 I need to concatenated val1 and get unique combination, output should be  like something as shown below. I have multiple ids, for simplicity sake, I am just showing one id.

 

Edit1:  In this table  val2 column have duplicates for CEF and KLM values,  rule is the concatenation of val1 should  have only single combination of  CEF and KLM. Below is little elaborate discussion on this

 

   first row shown in want dataset has  123,345(CEF),458,789,999(KLM) is having only one row  with respect val2 column of CEF and  

   KLM (first instances of CEF and KLM). 

  second row has 123,756(second instance of CEF),458,789,999( first instance of KLM)

   third row has 123,345 (first instance of CEF),458,789,777(second instance of KLM)

fourth row has 123,756 (second instance of CEF),,458,789,777 (second instance of KLM)

 

 

data want;
Input Val $50.;
datalines;
123,345,458,789,999
123,756,458,789,999
123,345,458,789,777
123,756,458,789,777
;

 I have tried first. and last. but I am unable to make unique combinations. Please let me know if you have any suggestions or ideas for this problem an also let me know if something is unclear.

 

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@kiranv_

 

A curious puzzle. The wrinkle is the order in which you want to collect the values from the dupes. Normally, one would fix the first dupe value of the first dupe and collect the rest, then fix the second and collect the rest, and so on. The difference between this "usual" order and your order can be illustrated as follows:

Usual order             Your order         
-----------             ----------         
data _null_ ;           data _null_ ;      
  do i = 1 to 2 ;         do j = 1 to 2 ;  
    do j = 1 to 2 ;         do i = 1 to 2 ;
      put i= j= ;             put i= j= ;  
    end ;                   end ;          
  end ;                   end ;            
run ;                   run ;              
-------                 -------            
i=1 j=1                 i=1 j=1            
i=1 j=2                 i=2 j=1            
i=2 j=1                 i=1 j=2            
i=2 j=2                 i=2 j=2            

Hence, if nested loops are to be used for the task (as done below), they need to be inverted similarly to the above. You can run my code directly against your sample input (and will get the result you want), but for the sake of wider generality I've prepared my own, so that a case with double and triple dupes could be tested - otherwise one can't be sure that the algorithm is valid.

data have ;               
  input id val1 $ val2 $ ;
  cards ;                 
1 111 AAA                 
1 222 BBB                 
1 333 BBB                 
1 444 CCC                 
1 555 DDD                 
1 666 EEE                 
1 777 EEE                 
1 888 EEE                 
1 999 FFF                 
2 000 AAA                 
2 111 AAA                 
2 222 BBB                 
2 333 CCC                 
2 444 CCC                 
2 555 DDD                 
2 666 EEE                 
2 777 FFF                 
2 888 FFF                 
2 999 GGG                 
;                         
run ;                     

Since we'll need to assemble nested loops, it's convenient to determine the max nesting level (and the size of the needed arrays, which is the same) beforehand, rather than hard code:

proc sql noprint ;                                                                  
  select cats (max (D)) into:D from (select count (id)    as D from have group id) ;
  select cats (max (L)) into:L from (select length (val1) as L from have) ;         
quit ;                                                                              

Now we need to assemble the nested loop based on these findings:

%macro loop (dim, arrv, arrp, arrs, jvar, csv) ;
  %local i ;                                    
  array _x [&dim] ;                             
  %do i = &dim %to 1 %by -1 ;                   
    do _x&i = &arrp [1,&i] to &arrp [2,&i] ;    
  %end ;                                        
  do _i = 1 to &jvar ;                          
    &arrs [_i] = &arrv [_x[_i]] ;               
  end ;                                         
  &csv = catx (",", of &arrs[*]) ;              
  output ;                                      
  %do i = &dim %to 1 %by -1 ;                   
    end ;                                       
  %end ;                                        
%mend ;                                         

After this prelim work is done, the rest is relatively easy:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;                                    

The idea is based on keeping track of the endpoints of each consecutive group of VAL2 values. Above, this is done using the 2-dimensional array P. If the value VAL2 in position X is unique, the endpoints P[1,X]=P[2,X], so the corresponding loop level will iterate but once; otherwise, they are not equal and the range in between determines the respective number of iterations. Thus, the logic is the same whether it be a dupe or a singleton. It is assumed that the input is sorted by ID, VAL2. The rest is just a matter or rather simple diligent coding:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;                                    

To give you a better idea what kind of code is actually run, here's what the macro would assemble if the longest ID by-group on file had 4 records:

  array _x [4] ;               
  do _x4 = p [1,4] to p [2,4] ;
  do _x3 = p [1,3] to p [2,3] ;
  do _x2 = p [1,2] to p [2,2] ;
  do _x1 = p [1,1] to p [2,1] ;
    do _i = 1 to _j ;          
      s [_i] = v [_x[_i]] ;    
    end ;                      
    val = catx (",", of s[*]) ;
  output ;                     
  end ;                        
  end ;                        
  end ;                        
  end ;                        

HTH

 

Paul D.

      

 

   

View solution in original post

8 REPLIES 8
ballardw
Super User

You need to describe more about the "rules" involved in making your combinations. Your subject line says "depending on other variable". How does the process depend on other variable (which I have to assume is val2 in your example data since you did not state which).

Also, does the output order have to exactly match?

You only show one value if ID. Do you have more than one value for Id? If so, what role does Id play in building the combinations.

 

Also your second and fourth line of "want" data are identical. Was line 4 supposed to be

 

123,756,458,789,777

kiranv_
Rhodochrosite | Level 12

@ballardw you are right, My line 4 was wrong and I have edited. I have also elaborated on rules for concatenation. Thanks for your suggestions

novinosrin
Tourmaline | Level 20

Hi @kiranv_

 

Not happy with my own performance on the code, but you can use until the champs provide you efficient ones. And i am sure being a Proc star you can just make it better too

 

data have;
input id val1 $ val2 $;
datalines;
1 123 CDE
1 345 CEF
1 756 CEF
1 458 HHH
1 789 JJJ
1 999 KLM
1 777 KLM
;

proc transpose data=have out=w;
by id val2;
var val1;
run;

proc transpose data=w(keep= id col:) out=w2;
by id ;
var col:;
run;

data wanted;
length temp $8;
call missing(temp);
if _n_ =1 then do;
dcl hash h();
 h.definekey ("i") ;
 h.definedata ('temp') ;
 h.definedone () ;
 end;
set w2 ;
by id;
array t(*) col:;
array tem(*) $ j1-j5;
retain tem;
if first.id then do;
do n=1 to dim(tem);
tem(n)=t(n);
end;
output;
h.clear();
end;
else 
do;
	do i=1 to dim(t);
		if not missing(t(i)) then do;
		rc= h.add(key:i,data:tem(i));tem(i)=t(i);
		output;
		rc=h.find();
		tem(i)=temp;
							end;
	end;
	do i=1 to dim(t);
		if not missing(t(i)) then tem(i)=t(i);
	end;
	output;
end;keep j:;
run;

 

hashman
Ammonite | Level 13

@kiranv_

 

A curious puzzle. The wrinkle is the order in which you want to collect the values from the dupes. Normally, one would fix the first dupe value of the first dupe and collect the rest, then fix the second and collect the rest, and so on. The difference between this "usual" order and your order can be illustrated as follows:

Usual order             Your order         
-----------             ----------         
data _null_ ;           data _null_ ;      
  do i = 1 to 2 ;         do j = 1 to 2 ;  
    do j = 1 to 2 ;         do i = 1 to 2 ;
      put i= j= ;             put i= j= ;  
    end ;                   end ;          
  end ;                   end ;            
run ;                   run ;              
-------                 -------            
i=1 j=1                 i=1 j=1            
i=1 j=2                 i=2 j=1            
i=2 j=1                 i=1 j=2            
i=2 j=2                 i=2 j=2            

Hence, if nested loops are to be used for the task (as done below), they need to be inverted similarly to the above. You can run my code directly against your sample input (and will get the result you want), but for the sake of wider generality I've prepared my own, so that a case with double and triple dupes could be tested - otherwise one can't be sure that the algorithm is valid.

data have ;               
  input id val1 $ val2 $ ;
  cards ;                 
1 111 AAA                 
1 222 BBB                 
1 333 BBB                 
1 444 CCC                 
1 555 DDD                 
1 666 EEE                 
1 777 EEE                 
1 888 EEE                 
1 999 FFF                 
2 000 AAA                 
2 111 AAA                 
2 222 BBB                 
2 333 CCC                 
2 444 CCC                 
2 555 DDD                 
2 666 EEE                 
2 777 FFF                 
2 888 FFF                 
2 999 GGG                 
;                         
run ;                     

Since we'll need to assemble nested loops, it's convenient to determine the max nesting level (and the size of the needed arrays, which is the same) beforehand, rather than hard code:

proc sql noprint ;                                                                  
  select cats (max (D)) into:D from (select count (id)    as D from have group id) ;
  select cats (max (L)) into:L from (select length (val1) as L from have) ;         
quit ;                                                                              

Now we need to assemble the nested loop based on these findings:

%macro loop (dim, arrv, arrp, arrs, jvar, csv) ;
  %local i ;                                    
  array _x [&dim] ;                             
  %do i = &dim %to 1 %by -1 ;                   
    do _x&i = &arrp [1,&i] to &arrp [2,&i] ;    
  %end ;                                        
  do _i = 1 to &jvar ;                          
    &arrs [_i] = &arrv [_x[_i]] ;               
  end ;                                         
  &csv = catx (",", of &arrs[*]) ;              
  output ;                                      
  %do i = &dim %to 1 %by -1 ;                   
    end ;                                       
  %end ;                                        
%mend ;                                         

After this prelim work is done, the rest is relatively easy:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;                                    

The idea is based on keeping track of the endpoints of each consecutive group of VAL2 values. Above, this is done using the 2-dimensional array P. If the value VAL2 in position X is unique, the endpoints P[1,X]=P[2,X], so the corresponding loop level will iterate but once; otherwise, they are not equal and the range in between determines the respective number of iterations. Thus, the logic is the same whether it be a dupe or a singleton. It is assumed that the input is sorted by ID, VAL2. The rest is just a matter or rather simple diligent coding:

data want (keep = id val) ;              
  array v [   &d] $ &L _temporary_ ;     
  array s [   &d] $ &L _temporary_ ;     
  array p [2, &d]      _temporary_ ;     
  do _n_ = 1 by 1 until (last.id) ;      
    set have ;                           
    by id val2 ;                         
    v [_n_] = val1 ;                     
    if first.val2 then do ;              
      _j = sum (_j, 1) ;                 
      p [1, _j] = _n_ ;                  
    end ;                                
    if last.val2 then p [2, _j] = _n_ ;  
  end ;                                  
  do _n_ = _j + 1 to &d ;                
    p [1, _n_] = 0 ;                     
    p [2, _n_] = 0 ;                     
  end ;                                  
  length val $ %eval (&L * &D + &D - 1) ;
  call missing (of s[*]) ;               
  %loop (&d, v, p, s, _j, val)           
run ;                                    

To give you a better idea what kind of code is actually run, here's what the macro would assemble if the longest ID by-group on file had 4 records:

  array _x [4] ;               
  do _x4 = p [1,4] to p [2,4] ;
  do _x3 = p [1,3] to p [2,3] ;
  do _x2 = p [1,2] to p [2,2] ;
  do _x1 = p [1,1] to p [2,1] ;
    do _i = 1 to _j ;          
      s [_i] = v [_x[_i]] ;    
    end ;                      
    val = catx (",", of s[*]) ;
  output ;                     
  end ;                        
  end ;                        
  end ;                        
  end ;                        

HTH

 

Paul D.

      

 

   

kiranv_
Rhodochrosite | Level 12

An excellent solution, with a detailed explanation. thank you so much Paul

novinosrin
Tourmaline | Level 20

@kiranv_   Thank you for acknowledging. The well known incarnation of the lord @hashman dorfman who dwarfs others blesses us us with sas miracles whose en-devours  I often peek into his peekc/long and other papers. I missed to inform you that I got hold of his book by university safari subscription which my university charged me along with the tuition fee at student rate. So, I can't thank my uni enough although if that wasn't the case, i would pay Paul directly. 

 

I remember you were saying you want to buy that too. Please do so if you haven't. The book has been released in the open market now and  DonH sent that notification on the same thread I posted about the book.  Just to let you know mate!

 

PS.

If you already knew, sorry about that lol

kiranv_
Rhodochrosite | Level 12

you are absolutely right and I am buying it today.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1111 views
  • 5 likes
  • 4 in conversation