I have to roll-up a lot of locations but I'm able to look for keywords to classify them. I can accomplish this with multiple IF statements but would prefer to use IN and a wildcard or FINDW operator. I'm wondering if I can combine these. Here is the original method:
DATA want;
SET have;
if location = "STATE1A, CITY1" then new = "Group 1";
if location = "STATE1A, CITY2" then new = "Group 1";
if location = "STATE2B, CITY3" then new = "Group 1";
if location = "STATE2B, CITY4" then new = "Group 1";
if location = "STATE3A, CITY1" then new = "Group 2";
if location = "STATE3A, CITY2" then new = "Group 2";
if location = "STATE4B, CITY3" then new = "Group 2";
if location = "STATE4B, CITY4" then new = "Group 2";
RUN;
You can see that I'm only concerned about STATE1, STATE2, etc, regardless of the city. I need to search each observation for the correct State name (there can be 5-6 different ones), and I'm not concerned about the city name. The state name is varying lengths in the string. If I wanted to look for just one state name I would do the below, but I need to include a few states within the IN() command.
IF FINDW(lowcase(location),"state1") > 0 then new = Group a;
I would like to try:
IF location IN(%STATE1A%, %STATE2B%) then new = "Group 1";
IF location IN(%STATE3A%, %STATE4B%) then new = "Group 2";
I'm unsure if I should be using the pattern matching with %, or the FINDW command within the IN()?
One questions why state and city have been comingled this way to begin with but you may consider:
/* If scan(location,1) in ("State1A" "STATE2B") then .. */ data example; input location $ 1-20; if scan(location,1,',') in ("STATE1A" "STATE2B") then new = "Group 1"; else if scan(location,1) in ("STATE3A" "STATE4B") then new = "Group 2"; datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 ;
If you must use if then.
You could consider moving the logic to a custom format.
Proc format libarary=work; value $loc_state "STATE1A", "STATE2B"="Group 1" "STATE3A", "STATE4B"="Group 2" ; run; data example2; input location $ 1-20; new = put( scan(location,1,','),$loc_state.); datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 ;
This has an advantage that if you get new "state" values you only have to change the format, no additional if then else coding.
If the addition is to an existing group it is only adding one value to the current list, or create a new group.
I would also strongly recommend why you are creating a value like "Group 1". If you have enough values, i.e. "Group 10" then display and sort order will be out of numeric order. It might be better to create an actual numeric value and a name or label like "Group" instead.
With a custom informat that is also easy, as well as incorporating a warning that an unexpected value has been encountered:
Proc format libarary=work; invalue loc_state (upcase) "STATE1A", "STATE2B"=1 "STATE3A", "STATE4B"=2 ' '=. other= _error_ ; run; data example3; input location $ 1-20; group = input(scan(location,1,','),loc_state.); datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 garbage, CITY2 ;
The invalue option (UPCASE) indicates that the value is to be converted to upper case for the value assignment. Which may help if your data actually contains values like "State1A" state1a" "sTate1A" where case may vary.
One questions why state and city have been comingled this way to begin with but you may consider:
/* If scan(location,1) in ("State1A" "STATE2B") then .. */ data example; input location $ 1-20; if scan(location,1,',') in ("STATE1A" "STATE2B") then new = "Group 1"; else if scan(location,1) in ("STATE3A" "STATE4B") then new = "Group 2"; datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 ;
If you must use if then.
You could consider moving the logic to a custom format.
Proc format libarary=work; value $loc_state "STATE1A", "STATE2B"="Group 1" "STATE3A", "STATE4B"="Group 2" ; run; data example2; input location $ 1-20; new = put( scan(location,1,','),$loc_state.); datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 ;
This has an advantage that if you get new "state" values you only have to change the format, no additional if then else coding.
If the addition is to an existing group it is only adding one value to the current list, or create a new group.
I would also strongly recommend why you are creating a value like "Group 1". If you have enough values, i.e. "Group 10" then display and sort order will be out of numeric order. It might be better to create an actual numeric value and a name or label like "Group" instead.
With a custom informat that is also easy, as well as incorporating a warning that an unexpected value has been encountered:
Proc format libarary=work; invalue loc_state (upcase) "STATE1A", "STATE2B"=1 "STATE3A", "STATE4B"=2 ' '=. other= _error_ ; run; data example3; input location $ 1-20; group = input(scan(location,1,','),loc_state.); datalines; STATE1A, CITY1 STATE2B, CITY2 STATE3A, CITY1 STATE4B, CITY2 garbage, CITY2 ;
The invalue option (UPCASE) indicates that the value is to be converted to upper case for the value assignment. Which may help if your data actually contains values like "State1A" state1a" "sTate1A" where case may vary.
Please try
DATA want;
SET have;
if location in ( "STATE1A, CITY1", "STATE1A, CITY2" "STATE2B, CITY3" "STATE2B, CITY4" ) then new = "Group 1";
if location in ( "STATE3A, CITY1" "STATE3A, CITY2" "STATE4B, CITY3" "STATE4B, CITY4" ) then new = "Group 2";
RUN;
DATA want;
SET have;
if findw( location, "STATE1A" ) or findw(location, "STATE2B") then new = "Group 1";
if findw(location, "STATE3A" ) or findw(location, "STATE4B") then new = "Group 2";
RUN;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.