If we have some data on pets and their owners such as:
data have;
infile datalines truncover;
input owner $ pet_type1 $ pet_name_1 $ pet_type2 $ pet_name_2 $
pet_type3 $ pet_name_3 $ pet_type4 $ pet_name_4 $
pet_type5 $ pet_name_5 $ pet_type6 $ pet_name_6 $
pet_type7 $ pet_name_7 $ ;
datalines;
kimberly dog sadie cat manson cat nellie monkey salty
landry dog manda
julie dog emmy dog sanchez
spencer cat armstrong dog cotton monkey jenkins monkey figgy monkey alex
penelope fish baldwin
welsh monkey constance dog mulder dog figgy monkey ac
alina cat chloe cat marls dog keen dog yasin cat avery cat harold monkey albert
roy cat eden
rachael fish gill
decker dog isabel monkey colossal
amira monkey curious
ramsey monkey george monkey cosmic dog jay cat kira cat ross cat kumar
;
run;
One owner can have up to 7 pets, all listed on one row. The pets have been ranked in order of the owners favorite. The structure is pet_type1, pet_name1 (most favorite) to pet_type7, pet_name7 (least favorite). The desired output is listing all owners in the original data with just the name of their favorite monkey. If they have multiple monkeys (pet # 1 and pet #3), we want to return the name from the lowest pet #. In the example data we would want to return:
kimberly salty
landry
julie
spencer jenkins
penelope
welsh constance
alina albert
roy
rachael
decker colossal
amira curious
ramsey george
I am sure I can do this with a bunch of if statements. Is there a better approach to this? Maybe there is already a function that can handle this?
Thanks in advance!
Hello,
data want;
set have;
keep owner fav_monkey;
array pets pet:;
fst_monkey_idx=whichc("monkey", of pets(*));
if fst_monkey_idx then fav_monkey=pets(fst_monkey_idx+1);
run;
data want;
set have;
length favorite_monkey $ 16;
array pet_type pet_type:;
array pet_name pet_name:;
do i=1 to dim(pet_type);
if pet_type(i)='monkey' then do;
favorite_monkey=pet_name(i);
leave;
end;
end;
keep owner favorite_monkey;
run;
Sir, The linear method of yours interesting runs a lot faster than whichc(of course not taking away a fraction of full credit to @gamotte idea) on my citizens machine for some reason. I have no clue why. I would have thought whichc to be faster. Or is it just my machine? Any thoughts?
I don't know if this accounts for much of the speed difference, but WHICHC checks both pet_type and pet_name for monkey, whereas my two array method only checks pet_type for monkey.
data want(keep=owner Favoritemonkey);
set have;
array p{*} $ pet_type1--pet_name_7;
idx=whichc("monkey", of p(*));
Favoritemonkey=ifc(idx=0, '', p[idx+1]);
run;
@supp wrote:
thanks @PeterClemmensen. Can you explain the double dash '--' for declaring pet_type as array elements? Could we just use a single dash? Is there a difference?
The code is not declaring pet-type as an array. It is declaring pet_type and pet_name into the same array.
The double-dash is used because it takes all variables from pet_type1 through pet_name7 and includes all of those in the array.
I haven't validated it for all scenarios, but this works for this specific example. You will have to test it on others.
The key is using a loop and arrays to restructure the data, then begin testing for a monkey. First time you see a monkey end the restructure loop and output the row. If I run the entire loop and don't find a monkey then just output the last row which is the owners name but no value for pet.
data test;
set have;
array Pet_Type[*] pet_type:; /*Set values of all pet types in an array*/
array Pet_Name[*] pet_name:; /*Set values of all pet names in an array*/
FavoriteNum=0; /*Start a numbered list for favorite value of pet. Don't need this, just used it to see*/
do i=1 to dim(Pet_Type); /*Run a loop and restructure the data based on the number of values in the array. Then test if it's a monkey. If it's a monkey output the row and end the loop since it's the first*/
FavoriteNum+1; /*Increase the favorite overall number of the pet*/
PetName=Pet_Name[i];
PetType=Pet_Type[i];
if upcase(PetType)="MONKEY" then do;
output;
leave; /*Leave loop*/
end;
else if dim(Pet_Type)=FavoriteNum and PetType ne "MONKEY" then output; /*If it makes it to the last row of the owner and no monkey is found, then output the row with nothing*/
end;
drop i pet_:;
run;
I applied a couple of these solutions to the real world scenario. The data is similar to what I posted and a little over 700,000 rows. If we call @gamotte's approach as approach #1 and @PaigeMiller's approach as approach #2.
Where approach #1 sets all variables into a large array and uses whichc function to find and return the array element and value of interest.
and
Where approach # 2 defines multiple smaller arrays and a do loop to find element of interest in one array and return value of interest from second array.
Both solutions are working and returning the same results (at least at first glance). Both solutions are extremely fast and for practical purposes there is no real difference in performance. @PaigeMiller 's approach does seem to be slightly faster on average. I am getting user CPU times around .2 seconds. @gamotte 's solution is getting around .3 seconds user CPU time. This doesn't seem to be significant enough to matter for a user.
I think I like using the smaller, more defined array's in @PaigeMiller 's approach as I find it easier to visualize what is happening in the code. If we add another type of variable to the data, like pet_breed, it would increase the number of elements into one large array. In the real world scenario I started getting confused with which variable was which element in the array. Any thoughts or feedback on this would be appreciated.
Along these lines is there a way to show or display what variable is what array element? I found this code "put (pets[*]) (=/);" Which was helpful, is there anything more explicit that would show array element 1 = variable name?
I applied a couple of these solutions to the real world scenario. The data is similar to what I posted and a little over 700,000 rows. If we call @gamotte's approach as approach #1 and @PaigeMiller's approach as approach #2.
Where approach #1 sets all variables into a large array and uses whichc function to find and return the array element and value of interest.
and
Where approach # 2 defines multiple smaller arrays and a do loop to find element of interest in one array and return value of interest from second array.
Both solutions are working and returning the same results (at least at first glance). Both solutions are extremely fast and for practical purposes there is no real difference in performance. PaigeMiller's approach does seem to be slightly faster on average. I am getting user CPU times around .2 seconds. gamotte's solution is getting around .3 seconds user CPU time. This doesn't seem to be significant enough to matter for a user.
I think I like using the smaller, more defined array's in PaigeMiller 's approach as I find it easier to visualize what is happening in the code. If we add another type of variable to the data, like pet_breed, it would increase the number of elements into one large array. In the real world scenario I started getting confused with which variable was which element in the array. Any thoughts or feedback on this would be appreciated.
Along these lines is there a way to show or display what variable is what array element? I found this code "put (pets[*]) (=/);" Which was helpful, is there anything more explicit that would show array element 1 = variable name?
If I have said it once, I have probably said it at least 3.7 times now
Don't use WHICHC to find monkeys.
Ha! Sorry for having to make you say it .7 times more. Maybe this should be added to @Kurt_Bremser's Maxims of Maximally Efficient SAS Programmers !?
You input spec $ defaults to $8 and is insufficient to accommodate names like "armstrong" or "constance", which thus get truncated. Hence, below I made it $10 and sized the concatenated variable _C accordingly, plus likewise made the length of output variable NAME $10.
data have ;
infile cards truncover ;
input (owner
pet_type1 pet_name_1
pet_type2 pet_name_2
pet_type3 pet_name_3
pet_type4 pet_name_4
pet_type5 pet_name_5
pet_type6 pet_name_6
pet_type7 pet_name_7)
(:$10.) ;
cards ;
kimberly dog sadie cat manson cat nellie monkey salty
landry dog manda
julie dog emmy dog sanchez
spencer cat armstrong dog cotton monkey jenkins monkey figgy monkey alex
penelope fish baldwin
welsh monkey constance dog mulder dog figgy monkey ac
alina cat chloe cat marls dog keen dog yasin cat avery cat harold monkey albert
roy cat eden
rachael fish gill
decker dog isabel monkey colossal
amira monkey curious
ramsey monkey george monkey cosmic dog jay cat kira cat ross cat kumar
;
run ;
data want (keep = owner name) ;
set have ;
_c = put (catx (" ", of pet_:), $76.) ;
_x = findw (_c, "monkey") ;
if _x then name = put (scan (substr (_c, _x), 2), $10.) ;
run ;
If it were a competition for the least number of program statements, _C and _X could be plugged into the IF statement:
data want (keep = owner name) ;
set have ;
if findw (catx (" ", of pet_:), "monkey") then name = put (scan (substr (catx (" ", of pet_:), findw (catx (" ", of pet_:), "monkey")
), 2), $10.) ;
run ;
But for anything other than this fictitious "competition", it would make no sense from the standpoints of clarity and unnecessarily repeated identical function calls.
Kind regards
Paul D.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.