Hello,
I have 3 character data fields: field_1; field_2; field_3 --> all character fields (format $5.) with both letters and numbers. I have been able to extract records using the if field_1= '20B' (as an example).
I am trying to create a categorical variable based on specific conditions:
Category 1 = extract records where any 2 of the 3 fields must contain/start with '20B' and '300'
Category 2 extract records where any 2 of the friends must contain/start with '20B' and '400' OR (contain/start with '20B' in any 1 of the 3 fields, so long that the other 2 fields don't equal '300' or '400' [or the other 2 fields are empty])
Category 3, 4, 5.. etc. will be based on conditions based on field_1 only:
Category 3 = Field_1 >= '150' & <= '170'
Category 4= Field_1 >= '500' & <= '600'
I tried doing this using arrays and a few other approaches, but it's not working for me. Not sure how to go about approaching this. Any help would be appreciated! Thanks in advance!
See if this gets you started. This creates a number of helper 1/0 (true/false) variables that make the assignment easier.
data have; input field1 $ field2 $ field3$; /* count of starts with 20B or 300*/ C20b_300 = sum(field1=:'20B',field2=:'20B',field3=:'20B', field1=:'300',field2=:'300',field3=:'300') ge 2 ; /* count of starts with 20B or 400*/ C20b_400 = sum(field1=:'20B',field2=:'20B',field3=:'20B', field1=:'400',field2=:'400',field3=:'400') ge 2 ; one_20b_only=(sum(field1=:'20B',field2=:'20B',field3=:'20B')=1) and (sum(field1=:'300',field2=:'300',field3=:'300', field1=:'400',field2=:'400',field3=:'400')=0) ; if C20b_300 then Category= 1; else if C20b_400 or one_20b_only then Category=2; datalines; 20B 20B 20B 20B 100 100 20B . . 20B 300 ABC 20B 400 PDA 17 . . 17 20B 20B 163 . . ;
Some explanation:
First the =: is the SAS comparison for "starts with" so Field1 =: '20B' is true when the value starts with '20B', case sensitive.
Second is the bit that SAS returns 1 for True and 0 for False. So if you sum a bunch of 1/0 coded values then the total is the number of true conditions. This is partially why the question about the hypothetical '20B300' and CONTAINS as the code for checking contains would report both true and would require more complex logic to only count one "true" for such a value. Not that hard with experience but uglier.
I am treating "any 2" as including "all 3 begin with 20B or 300. If that is not the case and you want exactly 2 then the comparison would =2 instead of GE 2. (I tend to use the GE or LE because the various >= <= and <> (especially the last) are easier to type.
We then use the comparisons to totals to see which is true. So then we can use them for the value assignment.
This way you can see the logic results as variables to see which one(s) may be returning unexpected results and check all your logic. After you are satisfied then you DROP the unneeded variables.
Actually if I get these true false variables assigned correctly it is possible to write a single long,and very confusing to the uninitiated, statement to assign numeric category values using a combination of addition and multiplication. We'll leave that for later.
You do want to pay attention to boundary values like the "any 2" vs "all" and be prepared to test or adjust logic.
Best to provide some example data in the form of a data step such as:
data have; input field1 $ field2 $ field3$; datalines; abc pdq xyz F20 20Bzz 04b ;
With the above none of the value should hold more than 8 characters, Separate the data by spaces. A similar data set could provide the calculated categories.
You should provide at least one case that matches each of your rules. Since you say start or contains "20B" that is 2 conditions to include. You should also provide one or more that does not match any of the expected categories (should not be assigned).
You also have tell us if you can have a field value that contains both '20B' and '300' such as '20BXY300' and if that gets counted once or twice in any rule.
Danger Will Robinson!!
Comparisons like Field_1 >= '150' & <= '170' (or that will actually run: '150' <= Field_1 <= '170' or Field_1 >= '150' & Field_1 <= '170' ) are not a good idea with character values. '17' would be true for that comparison. Character values are compared character by character left to right until one of them runs out. so '17' is in effect equal to '170'. If you want to use the continuous numeric values that are sort of implied then explicitly create a numeric variable from Field_1 and use comparisons with 150 to 170 numeric.
Some code to run and look in the log:
data example; input field1 $ ; if notdigit(strip(field1)) then; else numfield1 = input(field1,8.); if '150' le field1 le '170' then put field1= ' is in interval'; if 150 le numfield1 le 170 then put numfield1= 'is in interval'; datalines; abc pdq xyz F20 20Bzz 04b 163 . . 17 . . ;
The NOTDIGIT function will return the position of the first non-digit character or 0 if none are found. Since these variables have a length of 8 by default then we use strip because the function will examine the whole value and treat the trailing blanks as "notdigit" values. SAS will use any number other than 0 and . as "false" so we can write logic that way. Note that I include a character comparison for 17 to show what happens with the basic values.
What you are describing is a hierarchy. That means you test one condition and assign if true and if not then do the next comparison(s).
Pseudo-code for that looks like:
If <condition(s)> is true then Category1; else if <other condition(s)> is true then Category2; else if <third condition> is true then Category3; else if <4th condition> is true then Category4; else if <5th condition> is true then Category5;
Thank you for your response ballardw!
No, I don't have a field that field value that contains both '20B' and '300' such as '20BXY300' . Extracting records that START with '20B' or '300' is actually what I want (not contains). I should have been clearer on that end - my apologies.
Could you please provide an example of a formulated script for the conditions? I think the condition that I struggle most with conceptualising the script would be:
Category 2 extract records where any 2 of the friends must contain/start with '20B' and '400' OR (contain/start with '20B' in any 1 of the 3 fields, so long that the other 2 fields don't equal '300' or '400' [or the other 2 fields are empty])
See if this gets you started. This creates a number of helper 1/0 (true/false) variables that make the assignment easier.
data have; input field1 $ field2 $ field3$; /* count of starts with 20B or 300*/ C20b_300 = sum(field1=:'20B',field2=:'20B',field3=:'20B', field1=:'300',field2=:'300',field3=:'300') ge 2 ; /* count of starts with 20B or 400*/ C20b_400 = sum(field1=:'20B',field2=:'20B',field3=:'20B', field1=:'400',field2=:'400',field3=:'400') ge 2 ; one_20b_only=(sum(field1=:'20B',field2=:'20B',field3=:'20B')=1) and (sum(field1=:'300',field2=:'300',field3=:'300', field1=:'400',field2=:'400',field3=:'400')=0) ; if C20b_300 then Category= 1; else if C20b_400 or one_20b_only then Category=2; datalines; 20B 20B 20B 20B 100 100 20B . . 20B 300 ABC 20B 400 PDA 17 . . 17 20B 20B 163 . . ;
Some explanation:
First the =: is the SAS comparison for "starts with" so Field1 =: '20B' is true when the value starts with '20B', case sensitive.
Second is the bit that SAS returns 1 for True and 0 for False. So if you sum a bunch of 1/0 coded values then the total is the number of true conditions. This is partially why the question about the hypothetical '20B300' and CONTAINS as the code for checking contains would report both true and would require more complex logic to only count one "true" for such a value. Not that hard with experience but uglier.
I am treating "any 2" as including "all 3 begin with 20B or 300. If that is not the case and you want exactly 2 then the comparison would =2 instead of GE 2. (I tend to use the GE or LE because the various >= <= and <> (especially the last) are easier to type.
We then use the comparisons to totals to see which is true. So then we can use them for the value assignment.
This way you can see the logic results as variables to see which one(s) may be returning unexpected results and check all your logic. After you are satisfied then you DROP the unneeded variables.
Actually if I get these true false variables assigned correctly it is possible to write a single long,and very confusing to the uninitiated, statement to assign numeric category values using a combination of addition and multiplication. We'll leave that for later.
You do want to pay attention to boundary values like the "any 2" vs "all" and be prepared to test or adjust logic.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.