I have a character column as follow:
col1 |
product1 ..(words in btw).. stage1..... |
product1...(words in btw)..stage2...... |
product2.(words in btw)...stage1.... |
product2 .(words in btw)....stage2.... |
those dots between product names and stage names represent other words which doesn't matter to me. The product name and stage name can be anywhere in the string.
My end goal is to create a new column as follow:
col1 | new_col |
product1 ..(words in btw)...stage1..... | product1_stage1 |
product1..(words in btw)...stage2...... | product1_stage2 |
product2..(words in btw)..stage1.... | product2_stage1 |
product2 ..(words in btw)...stage2.... | product2_stage2 |
So I want to write a code that looks at col1 and search for words "productX" and "stageZ" and if finds both then put new_col would be filled as "productX_stageZ", and the same for other products and stages.
Any help is appreciated!
@Al_senior wrote:
Product names are different, like what you mentioned as wombat, koala, kangaroo, eucalyptus4. And the product name most of the time follows by a number but not always. It is the same story for stage. Stages have different names.
And yeas, there are spaces in the string.
The data is a SAS Table.
There is not specific character that I can use to parse the string.
What I tried so far was creating a new column for each product as:
indicator = index(col1, "productX")
if indicator gt 0 then product_col = "productX"
which obviously is an awful solution if there many product names.
Still not saying where the actual values of product or stage comes from. Are they in another data set? Macro variable? or what? How many of each are there? Are there restrictions that some stages only appear with some products?
And a truly nasty possible headache causer: do any of the products or stages appear more than once in a given value?
Can you show us actual data? If the data is confidential somehow, then you can substitute wombat, koala, kangaroo, eucalyptus, etc. for the parts that are confidential. Otherwise, I don't think we have a clear idea of what it is that you're looking for, and it's going to be really difficult for us to help.
Jim
My data is confidential so I go with an example of car manufacturing which could be similar to the following table:
Col1 |
Camry red Manual |
Corolla Blue Automatic |
Camry Blue (Japan) - Manual |
Prius (America) Toyota Automatic |
And I want to get a table as bellow:
Col1 | new_col |
Camry red Manual | Camry Manual |
Corolla Blue Automatic | Corolla Automatic |
Civic Blue (Japan) - Manual | Civic Manual |
Prius (America) - Toyota Automatic | Prius Automatic |
I wrote the following code
data want;
set have;
car= index(Col1, "Automatic");
if car gt 0 then type= "Automatic";
else type="Manual";
run;
So now I think if I can create a column named CAR with the following logic (please pay attention it is not the code, it is just the logic)
if "Camry" is in Col1
then CAR = "Camry"
else if "Corolla" is in Col1
then CAR="Corolla"
else if "Prius" is in Col1
then CAR="Prius"
else if "Civic" in Col1
then CAR="Civic"
so if someone can help me with this if statement I guess that could be the solution for my problem.
And finally when I have column "CAR" and column "Type" I can combine them as the new column that I am looking for.
Hi:
With data as you show above, the program doesn't need an IF statement.
However, if the data are more complex, then the solution will be more complex.
Cynthia
OK, I think I might have something for you. Below is some SAS code. Below the code are the results. This code will parse out the values you're looking for no matter the word order and no matter placement. In other words if the type of car (automatic vs. manual) comes before the model of the car (Prius, Camry, etc.), it would still be picked up.
This code requires that you define a table of the cars (or whatever it really is) that you're looking for as well as the possible types and then does a double array search of each line of data. Note that if a definition is omitted, e.g. the Volt, "unknown" will be returned.
IMPORTANT: This code does not allow embedded blanks. If you wanted, for example "Accord LX", you would need to code "Accord_LX" (or "Accord-LX") in your both data and in your table definitions.
See what you think.
Jim
DATA Have;
INFILE DATALINES TRUNCOVER;
INPUT
Data_To_Examine $CHAR128.;
;
DATALINES;
Camry red Manual
Corolla Blue Automatic
Yugo (Yugoslavia) Manual
Tiguan Germany Automatic
Camry Blue (Japan) - Manual
Prius (America) Toyota Automatic
Volt (US) Automatic
Blue Automatic Honda Civic 4-Door
Gray Manual Accord LX 2012
;
RUN;
DATA Cars_Table;
INFILE DATALINES TRUNCOVER;
INPUT
Car $CHAR32.;
;
DATALINES;
Camry
Corolla
Prius
Accord
Civic
;
RUN;
PROC SQL NOPRINT;
SELECT COUNT(Car)
INTO : Car_Count TRIMMED
FROM Cars_Table;
QUIT;
%PUT NOTE: &=Car_Count;
PROC TRANSPOSE DATA=Cars_Table
OUT =Cars_Array (DROP=_:)
Prefix=Car_
;
Var Car;
RUN;
DATA Types_Table;
INFILE DATALINES TRUNCOVER;
INPUT
Type $CHAR32.;
;
DATALINES;
Automatic
Manual
;
RUN;
PROC SQL NOPRINT;
SELECT COUNT(Type)
INTO : Type_Count TRIMMED
FROM Types_Table;
QUIT;
%PUT NOTE: &=Type_Count;
PROC TRANSPOSE DATA=Types_Table
OUT =Types_Array (DROP=_:)
Prefix=Type_
;
Var Type;
RUN;
DATA Want;
DROP _:;
DROP Car_1 - Car_&Car_Count;
DROP Type_1 - Type_&Type_Count;
IF _N_ = 1 THEN
DO;
SET Cars_Array;
SET Types_Array;
END;
ARRAY Cars {&Car_Count} $32 Car_1 - Car_&Car_Count;
ARRAY Types {&Type_Count} $32 Type_1 - Type_&Type_Count;
SET Have;
LENGTH Car_Var $32;
LENGTH Type_Var $32;
Car_Var = 'Unknown';
Type_Var = 'Unknown';
DO _i = 1 TO &Car_Count;
IF INDEX(UPCASE(Data_To_Examine), UPCASE(STRIP(Cars{_i}))) THEN
DO;
Car_Var = Cars{_i};
DO _j = 1 TO &Type_Count;
IF INDEX(UPCASE(Data_To_Examine), UPCASE(STRIP(Types{_j}))) THEN
DO;
Type_Var = Types{_j};
END;
END;
END;
END;
RUN;
Results:
IF you are going to type in literal strings to look for then a temporary array is one way to do this.
DATA Have; INFILE DATALINES TRUNCOVER; INPUT Col1 $CHAR128.; ; DATALINES; Camry red Manual Corolla Blue Automatic Yugo (Yugoslavia) Manual Tiguan Germany Automatic Camry Blue (Japan) - Manual Prius (America) Toyota Automatic Volt (US) Automatic Blue Automatic Honda Civic 4-Door Gray Manual Accord LX 2012 ; RUN; data example; set have; length car $ 20; /* left out Volt to show how that the name must be in the temporary array to be found and added a value not in the data to show that have a value to search for not present does not cause problems */ array p(8) $ 20 _temporary_ ("Camry","Corolla","Yugo","Tiguan","Prius","Civic","Accord","Not_in_data"); do i=1 to dim(p); if find(col1, trim(p[i]),'i')>0 then do; Car=p[i]; leave; /* this assumes only expect/want to find one car value in col1*/ end; end; drop i; run;
Note: SAS does have an IN operator but it is the equivalent of a bunch of : if var=value1 or var=value2 or var=value3 ....
Not the way your hopefully pseudo code was intended to work. Presence of a string in another is the work for Index, Indexw, Find, of Findw generally.
Exercise for the interested reader is to make a second temporary array for the "stage" equivalent text.
I have a dataset as follow:
col1 |
product1 ...(other words)...... |
product1....(other words).... |
product2....(other words).... |
..... |
product120 ...(other words)... |
And I want to create a new column as follow:
col1 | new_col |
product1 .....(other words).... | product1 |
product1....(other words).... | product2 |
product2....(other words).... | product3 |
..... | ... |
product120 ...(other words)... | product120 |
What I'd like to do is to write a conditional statement that if the word "productX" was in the string then it puts "productX" in the new column. The word product could be anywhere in the string.
Duplicate post moved to this thread.
Strong suggestion: Create TWO new variables. One for the product, other for the stage.
I don't know what you may envision doing but multiple values in a single variable almost invariably adds lots of extra work.
Consider that you want to to a summary count based on Product values. If the value is Product1_Stage1 or Product1_Stage2 that means you have to do something else to get just the product, otherwise every product will be duplicated with each level of stage. So save some headaches and split them up now. Or provide a real good use case for how you intend to use the combined value.
So, just where do the actual Product names that you want reside? Your statement " The product name and stage name can be anywhere in the string." means that we have no clue what you are looking for and can't parse what you show as it is basically meaningless. The actual "product" could very well be in the (words in btw) as far as we know. So how do you expect to tell a program what specific products you are looking for? There is a change if the verbiage actually is "Stage" that something could be done with that but that is rife with potential other words like "stage manager" "stage lights" or similar.
BTW your example data strongly implies that Product is always at the beginning of the variable and that does not match your "anywhere in the string" description.
Typical "find" of specific words in a string involve, strangely enough, a function named FINDW (for Find word), to identify the location and then use a function such as SCAN or SUBSTR to extract the value.
This is the best that I have, but I think it really depends on your data. If you're not familiar with regular expressions (I am a bit of a beginner/intermediate), I strongly consider learning about them to deal with weird parsing issues.
data have;
input col1 $50.;
datalines;
product1..... stage1.....
product1.....stage2......
product2....stage1....
product2 .....stage2....
;
run;
data want;
set have;
words = prxchange("s/[^(product\d{1}|stage\d{1})]//i", -1, col1);
join_location = prxmatch("/\d{1}\w{1}/", words);
new_col = catx("_", substr(words, 1, join_location), substr(words, join_location + 1, length(words)));
run;
col1 words join_location new_col product1..... stage1..... product1stage1 8 product1_stage1 product1.....stage2...... product1stage2 8 product1_stage2 product2....stage1.... product2stage1 8 product2_stage1 product2 .....stage2.... product2stage2 8 product2_stage2
I don't have enough of your data to know for sure if that's what you need or will solve your issue, but it seems to match your have example.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.