Re: Find the name of the owner's favorite monkey

supp · Posted 09-20-2019 09:29 AM

If we have some data on pets and their owners such as:

data have;
	infile datalines truncover;
		input owner $ pet_type1 $ pet_name_1 $ pet_type2 $ pet_name_2 $
			      pet_type3 $ pet_name_3 $ pet_type4 $ pet_name_4 $
			      pet_type5 $ pet_name_5 $ pet_type6 $ pet_name_6 $
			      pet_type7 $ pet_name_7 $ ;
datalines;
kimberly dog sadie cat manson cat nellie monkey salty
landry dog manda
julie dog emmy dog sanchez
spencer cat armstrong dog cotton monkey jenkins monkey figgy monkey alex
penelope fish baldwin
welsh monkey constance dog mulder dog figgy monkey ac
alina cat chloe cat marls dog keen dog yasin cat avery cat harold monkey albert
roy cat eden
rachael fish gill
decker dog isabel monkey colossal
amira monkey curious
ramsey monkey george monkey cosmic dog jay cat kira cat ross cat kumar
;
run;

One owner can have up to 7 pets, all listed on one row. The pets have been ranked in order of the owners favorite. The structure is pet_type1, pet_name1 (most favorite) to pet_type7, pet_name7 (least favorite). The desired output is listing all owners in the original data with just the name of their favorite monkey. If they have multiple monkeys (pet # 1 and pet #3), we want to return the name from the lowest pet #. In the example data we would want to return:

kimberly salty
landry 
julie
spencer jenkins
penelope
welsh constance
alina albert
roy
rachael
decker colossal
amira curious
ramsey george

I am sure I can do this with a bunch of if statements. Is there a better approach to this? Maybe there is already a function that can handle this?

Thanks in advance!

gamotte · Posted 09-20-2019 09:40 AM

Hello,

data want;
    set have;
    keep owner fav_monkey;
    array pets pet:;
    fst_monkey_idx=whichc("monkey", of pets(*));
    if fst_monkey_idx then fav_monkey=pets(fst_monkey_idx+1);
run;

PaigeMiller · Posted 09-20-2019 09:41 AM


data want;
    set have;
    length favorite_monkey $ 16;
    array pet_type pet_type:;
    array pet_name pet_name:;
    do i=1 to dim(pet_type);
        if pet_type(i)='monkey' then do;
            favorite_monkey=pet_name(i);
            leave;
        end;
    end;
    keep owner favorite_monkey;
run;

--
Paige Miller

novinosrin · Posted 09-20-2019 09:58 AM

Sir, The linear method of yours interesting runs a lot faster than whichc(of course not taking away a fraction of full credit to @gamotte idea) on my citizens machine for some reason. I have no clue why. I would have thought whichc to be faster. Or is it just my machine? Any thoughts?

PaigeMiller · Posted 09-20-2019 11:16 AM

I don't know if this accounts for much of the speed difference, but WHICHC checks both pet_type and pet_name for monkey, whereas my two array method only checks pet_type for monkey.

--
Paige Miller

PeterClemmensen · Posted 09-20-2019 09:45 AM

data want(keep=owner Favoritemonkey);
    set have;
    array p{*} $ pet_type1--pet_name_7;
    idx=whichc("monkey", of p(*));
    Favoritemonkey=ifc(idx=0, '', p[idx+1]);
run;

The DATA to DATA Step Macro
Blog: SASnrd

supp · Posted 09-20-2019 09:51 AM

thanks @PeterClemmensen. Can you explain the double dash '--' for declaring pet_type as array elements? Could we just use a single dash? Is there a difference?

PaigeMiller · Posted 09-20-2019 09:56 AM

@supp wrote:
thanks @PeterClemmensen. Can you explain the double dash '--' for declaring pet_type as array elements? Could we just use a single dash? Is there a difference?

The code is not declaring pet-type as an array. It is declaring pet_type and pet_name into the same array.

The double-dash is used because it takes all variables from pet_type1 through pet_name7 and includes all of those in the array.

--
Paige Miller

supp · Posted 09-20-2019 10:02 AM

Thanks for the clarification! I missed that it was including both type and name into the array. I need to slow down.

Panagiotis · Posted 09-20-2019 09:57 AM

I haven't validated it for all scenarios, but this works for this specific example. You will have to test it on others.

The key is using a loop and arrays to restructure the data, then begin testing for a monkey. First time you see a monkey end the restructure loop and output the row. If I run the entire loop and don't find a monkey then just output the last row which is the owners name but no value for pet.

data test;
	set have;
	array Pet_Type[*] pet_type:; /*Set values of all pet types in an array*/
        array Pet_Name[*] pet_name:; /*Set values of all pet names in an array*/
	FavoriteNum=0; /*Start a numbered list for favorite value of pet. Don't need this, just used it to see*/
	do i=1 to dim(Pet_Type); /*Run a loop and restructure the data based on the number of values in the array. Then test if it's a monkey. If it's a monkey output the row and end the loop since it's the first*/
		FavoriteNum+1; /*Increase the favorite overall number of the pet*/
		PetName=Pet_Name[i];
		PetType=Pet_Type[i];
		if upcase(PetType)="MONKEY" then do;
			output;
			leave; /*Leave loop*/
		end;
		else if dim(Pet_Type)=FavoriteNum and PetType ne "MONKEY" then output; /*If it makes it to the last row of the owner and no monkey is found, then output the row with nothing*/
 	end;
	drop i pet_:;
run;

supp · Posted 09-20-2019 11:46 AM

I applied a couple of these solutions to the real world scenario. The data is similar to what I posted and a little over 700,000 rows. If we call @gamotte's approach as approach #1 and @PaigeMiller's approach as approach #2.

Where approach #1 sets all variables into a large array and uses whichc function to find and return the array element and value of interest.

and

Where approach # 2 defines multiple smaller arrays and a do loop to find element of interest in one array and return value of interest from second array.

Both solutions are working and returning the same results (at least at first glance). Both solutions are extremely fast and for practical purposes there is no real difference in performance. @PaigeMiller 's approach does seem to be slightly faster on average. I am getting user CPU times around .2 seconds. @gamotte 's solution is getting around .3 seconds user CPU time. This doesn't seem to be significant enough to matter for a user.

I think I like using the smaller, more defined array's in @PaigeMiller 's approach as I find it easier to visualize what is happening in the code. If we add another type of variable to the data, like pet_breed, it would increase the number of elements into one large array. In the real world scenario I started getting confused with which variable was which element in the array. Any thoughts or feedback on this would be appreciated.

Along these lines is there a way to show or display what variable is what array element? I found this code "put (pets[*]) (=/);" Which was helpful, is there anything more explicit that would show array element 1 = variable name?

supp · Posted 09-20-2019 11:58 AM

I applied a couple of these solutions to the real world scenario. The data is similar to what I posted and a little over 700,000 rows. If we call @gamotte's approach as approach #1 and @PaigeMiller's approach as approach #2.

Where approach #1 sets all variables into a large array and uses whichc function to find and return the array element and value of interest.

and

Where approach # 2 defines multiple smaller arrays and a do loop to find element of interest in one array and return value of interest from second array.

Both solutions are working and returning the same results (at least at first glance). Both solutions are extremely fast and for practical purposes there is no real difference in performance. PaigeMiller's approach does seem to be slightly faster on average. I am getting user CPU times around .2 seconds. gamotte's solution is getting around .3 seconds user CPU time. This doesn't seem to be significant enough to matter for a user.

I think I like using the smaller, more defined array's in PaigeMiller 's approach as I find it easier to visualize what is happening in the code. If we add another type of variable to the data, like pet_breed, it would increase the number of elements into one large array. In the real world scenario I started getting confused with which variable was which element in the array. Any thoughts or feedback on this would be appreciated.

Along these lines is there a way to show or display what variable is what array element? I found this code "put (pets[*]) (=/);" Which was helpful, is there anything more explicit that would show array element 1 = variable name?

PaigeMiller · Posted 09-20-2019 01:58 PM

If I have said it once, I have probably said it at least 3.7 times now

Don't use WHICHC to find monkeys.

--
Paige Miller

supp · Posted 09-20-2019 02:05 PM

Ha! Sorry for having to make you say it .7 times more. Maybe this should be added to @Kurt_Bremser's Maxims of Maximally Efficient SAS Programmers !?

hashman · Posted 09-21-2019 01:00 AM

@supp:

You input spec $ defaults to $8 and is insufficient to accommodate names like "armstrong" or "constance", which thus get truncated. Hence, below I made it $10 and sized the concatenated variable _C accordingly, plus likewise made the length of output variable NAME $10.

data have ;                                                                    
  infile cards truncover ;                                                     
  input (owner                                                                 
         pet_type1 pet_name_1                                                  
         pet_type2 pet_name_2                                                  
         pet_type3 pet_name_3                                                  
         pet_type4 pet_name_4                                                  
         pet_type5 pet_name_5                                                  
         pet_type6 pet_name_6                                                  
         pet_type7 pet_name_7)                                                 
        (:$10.) ;                                                              
  cards ;                                                                      
kimberly dog sadie cat manson cat nellie monkey salty                          
landry dog manda                                                               
julie dog emmy dog sanchez                                                     
spencer cat armstrong dog cotton monkey jenkins monkey figgy monkey alex       
penelope fish baldwin                                                          
welsh monkey constance dog mulder dog figgy monkey ac                          
alina cat chloe cat marls dog keen dog yasin cat avery cat harold monkey albert
roy cat eden                                                                   
rachael fish gill                                                              
decker dog isabel monkey colossal                                              
amira monkey curious                                                           
ramsey monkey george monkey cosmic dog jay cat kira cat ross cat kumar         
;                                                                              
run ;                                                                          
                                                                               
data want (keep = owner name) ;                                                
  set have ;                                                                   
  _c = put (catx (" ", of pet_:), $76.) ;                                      
  _x = findw (_c, "monkey") ;                                                  
  if _x then name = put (scan (substr (_c, _x), 2), $10.) ;                    
run ;

If it were a competition for the least number of program statements, _C and _X could be plugged into the IF statement:

data want (keep = owner name) ;                                                                                                         
  set have ;                                                                                                                            
  if findw (catx (" ", of pet_:), "monkey") then name = put (scan (substr (catx (" ", of pet_:), findw (catx (" ", of pet_:), "monkey") 
), 2), $10.) ;                                                                                                                          
run ;

But for anything other than this fictitious "competition", it would make no sense from the standpoints of clarity and unnecessarily repeated identical function calls.

Kind regards

Paul D.

Register Today!

SAS Training: Just a Click Away