Re: find a couple of words in a string

Al_senior · Posted 08-06-2021 09:52 AM

I have a character column as follow:

col1

product1 ..(words in btw).. stage1.....

product1...(words in btw)..stage2......

product2.(words in btw)...stage1....

product2 .(words in btw)....stage2....

those dots between product names and stage names represent other words which doesn't matter to me. The product name and stage name can be anywhere in the string.

My end goal is to create a new column as follow:

col1	new_col
product1 ..(words in btw)...stage1.....	product1_stage1
product1..(words in btw)...stage2......	product1_stage2
product2..(words in btw)..stage1....	product2_stage1
product2 ..(words in btw)...stage2....	product2_stage2

So I want to write a code that looks at col1 and search for words "productX" and "stageZ" and if finds both then put new_col would be filled as "productX_stageZ", and the same for other products and stages.

Any help is appreciated!

Cynthia_sas · Posted 08-06-2021 10:43 AM

Hi:
This is hard to visualize given what you've posted. You say you only want "product1" and "stage1" but your end results shows "product2" and "stage2". Are the strings always consistent, IE, product1, product2, product3, product4 or are the products different names like wombat1, koala2, kangaroo3, eucalyptus4??? Same question for the "stage" string -- are the characters "stage" always in the string you're searching for. Providing data with just dots is misleading. You say that the product name and stage name can be anywhere in the string. Are there spaces in the string? Any delimiters that you can use for parsing. What code have you tried? How is the original data coming to you? Is it already in a SAS data set?

Cynthia

Al_senior · Posted 08-06-2021 10:46 AM

Product1 was just an example. I edited my post and generalized it to ProductX and StageZ.

Cynthia_sas · Posted 08-06-2021 10:54 AM

Hi:
You didn't answer my question. Do the products ALWAYS contain the letters P-R-O-D-U-C-T followed by either a number or a letter? Does the string for stages ALWAYS contain the letters S-T-A-G-E followed by either a number or a letter? Can the PRODUCT or the STAGE be followed by more than one number or letter:
Product11
Product222
ProductABC
StageX1
Stage2234
Stage-54

Do you have real data the shows the "words in btw"? What is the maximum length of the original COL1 string? What is the maximum length of the value of PRODUCT or STAGE in the string?? 8 characters,15 characters? Is it always Product1, product1 (without spaces in the string) or is it Product 1, and Product X?

What code have you tried?

Cynthia
I

Al_senior · Posted 08-06-2021 11:07 AM

Product names are different, like what you mentioned as wombat, koala, kangaroo, eucalyptus4. And the product name most of the time follows by a number but not always. It is the same story for stage. Stages have different names.
And yeas, there are spaces in the string.
The data is a SAS Table.
There is not specific character that I can use to parse the string.
What I tried so far was creating a new column for each product as:

indicator = index(col1, "productX")
if indicator gt 0 then product_col = "productX"

which obviously is an awful solution if there many product names.

ballardw · Posted 08-06-2021 11:25 AM

@Al_senior wrote:
Product names are different, like what you mentioned as wombat, koala, kangaroo, eucalyptus4. And the product name most of the time follows by a number but not always. It is the same story for stage. Stages have different names.
And yeas, there are spaces in the string.
The data is a SAS Table.
There is not specific character that I can use to parse the string.
What I tried so far was creating a new column for each product as:

indicator = index(col1, "productX")
if indicator gt 0 then product_col = "productX"

which obviously is an awful solution if there many product names.

Still not saying where the actual values of product or stage comes from. Are they in another data set? Macro variable? or what? How many of each are there? Are there restrictions that some stages only appear with some products?

And a truly nasty possible headache causer: do any of the products or stages appear more than once in a given value?

jimbarbour · Posted 08-06-2021 11:53 AM

Can you show us actual data? If the data is confidential somehow, then you can substitute wombat, koala, kangaroo, eucalyptus, etc. for the parts that are confidential. Otherwise, I don't think we have a clear idea of what it is that you're looking for, and it's going to be really difficult for us to help.

Jim

Al_senior · Posted 08-06-2021 01:15 PM

My data is confidential so I go with an example of car manufacturing which could be similar to the following table:

Col1

Camry red Manual

Corolla Blue Automatic

Camry Blue (Japan) - Manual

Prius (America) Toyota Automatic

And I want to get a table as bellow:

Col1	new_col
Camry red Manual	Camry Manual
Corolla Blue Automatic	Corolla Automatic
Civic Blue (Japan) - Manual	Civic Manual
Prius (America) - Toyota Automatic	Prius Automatic

I wrote the following code

data want;
set have;
car= index(Col1, "Automatic");
if car gt 0 then type= "Automatic";
else type="Manual";
run;

So now I think if I can create a column named CAR with the following logic (please pay attention it is not the code, it is just the logic)

if "Camry" is in Col1

then CAR = "Camry"

else if "Corolla" is in Col1

then CAR="Corolla"

else if "Prius" is in Col1

then CAR="Prius"

else if "Civic" in Col1

then CAR="Civic"

so if someone can help me with this if statement I guess that could be the solution for my problem.

And finally when I have column "CAR" and column "Type" I can combine them as the new column that I am looking for.

Cynthia_sas · Posted 08-06-2021 01:30 PM

Hi:

With data as you show above, the program doesn't need an IF statement.

However, if the data are more complex, then the solution will be more complex.

Cynthia

jimbarbour · Posted 08-06-2021 02:29 PM

OK, I think I might have something for you. Below is some SAS code. Below the code are the results. This code will parse out the values you're looking for no matter the word order and no matter placement. In other words if the type of car (automatic vs. manual) comes before the model of the car (Prius, Camry, etc.), it would still be picked up.

This code requires that you define a table of the cars (or whatever it really is) that you're looking for as well as the possible types and then does a double array search of each line of data. Note that if a definition is omitted, e.g. the Volt, "unknown" will be returned.

IMPORTANT: This code does not allow embedded blanks. If you wanted, for example "Accord LX", you would need to code "Accord_LX" (or "Accord-LX") in your both data and in your table definitions.

See what you think.

Jim

DATA	Have;
	INFILE	DATALINES	TRUNCOVER;
	INPUT
		Data_To_Examine	$CHAR128.;
		;
DATALINES;
Camry red Manual
Corolla Blue Automatic
Yugo (Yugoslavia) Manual
Tiguan Germany Automatic
Camry Blue (Japan) - Manual
Prius (America) Toyota Automatic
Volt (US) Automatic
Blue Automatic Honda Civic 4-Door
Gray Manual Accord LX 2012
;
RUN;

DATA	Cars_Table;
	INFILE	DATALINES	TRUNCOVER;
	INPUT
		Car	$CHAR32.;
		;
DATALINES;
Camry
Corolla
Prius
Accord
Civic
;
RUN;

PROC	SQL	NOPRINT;
	SELECT	COUNT(Car)
		INTO	:	Car_Count	TRIMMED
			FROM	Cars_Table;
QUIT;
%PUT	NOTE:  &=Car_Count;

PROC	TRANSPOSE	DATA=Cars_Table
					OUT	=Cars_Array	(DROP=_:)
					Prefix=Car_
					;
	Var	Car;
RUN;

DATA	Types_Table;
	INFILE	DATALINES	TRUNCOVER;
	INPUT
		Type	$CHAR32.;
		;
DATALINES;
Automatic
Manual
;	
RUN;

PROC	SQL	NOPRINT;
	SELECT	COUNT(Type)
		INTO	:	Type_Count	TRIMMED
			FROM	Types_Table;
QUIT;
%PUT	NOTE:  &=Type_Count;

PROC	TRANSPOSE	DATA=Types_Table
					OUT	=Types_Array	(DROP=_:)
					Prefix=Type_
					;
	Var	Type;
RUN;

DATA	Want;
	DROP	_:;
	DROP	Car_1	-	Car_&Car_Count;
	DROP	Type_1	-	Type_&Type_Count;

	IF	_N_								=	1										THEN
		DO;
			SET	Cars_Array;
			SET	Types_Array;
		END;
	ARRAY	Cars	{&Car_Count}	$32	Car_1	-	Car_&Car_Count;
	ARRAY	Types	{&Type_Count}	$32	Type_1	-	Type_&Type_Count;

	SET		Have;

	LENGTH	Car_Var		$32;
	LENGTH	Type_Var	$32;

	Car_Var								=	'Unknown';
	Type_Var							=	'Unknown';

	DO	_i								=	1	TO	&Car_Count;
		IF	INDEX(UPCASE(Data_To_Examine), UPCASE(STRIP(Cars{_i})))					THEN
			DO;
				Car_Var					=	Cars{_i};
				DO	_j					=	1	TO	&Type_Count;
					IF	INDEX(UPCASE(Data_To_Examine), UPCASE(STRIP(Types{_j})))	THEN
						DO;
							Type_Var	=	Types{_j};
						END;
				END;
			END;
	END;
RUN;

Results:

ballardw · Posted 08-06-2021 03:12 PM

IF you are going to type in literal strings to look for then a temporary array is one way to do this.

DATA	Have;
	INFILE	DATALINES	TRUNCOVER;
	INPUT
		Col1 $CHAR128.;
		;
DATALINES;
Camry red Manual
Corolla Blue Automatic
Yugo (Yugoslavia) Manual
Tiguan Germany Automatic
Camry Blue (Japan) - Manual
Prius (America) Toyota Automatic
Volt (US) Automatic
Blue Automatic Honda Civic 4-Door
Gray Manual Accord LX 2012
;
RUN;

data example;
   set have;
   length car $ 20;
   /* left out Volt to show how that the name must be in the temporary array to be found
       and added a value not in the data to show that have a value to search for not present
       does not cause problems
   */
   array p(8) $ 20 _temporary_ ("Camry","Corolla","Yugo","Tiguan","Prius","Civic","Accord","Not_in_data");
   do i=1 to dim(p);
     if find(col1, trim(p[i]),'i')>0 then do;
        Car=p[i];
        leave; /* this assumes only expect/want to find one car value in col1*/
     end;
   end;
   drop i;
run;

Note: SAS does have an IN operator but it is the equivalent of a bunch of : if var=value1 or var=value2 or var=value3 ....

Not the way your hopefully pseudo code was intended to work. Presence of a string in another is the work for Index, Indexw, Find, of Findw generally.

Exercise for the interested reader is to make a second temporary array for the "stage" equivalent text.

Al_senior · Posted 08-06-2021 10:43 AM

I have a dataset as follow:

col1

product1 ...(other words)......

product1....(other words)....

product2....(other words)....

.....

product120 ...(other words)...

And I want to create a new column as follow:

col1	new_col
product1 .....(other words)....	product1
product1....(other words)....	product2
product2....(other words)....	product3
.....	...
product120 ...(other words)...	product120

What I'd like to do is to write a conditional statement that if the word "productX" was in the string then it puts "productX" in the new column. The word product could be anywhere in the string.

ballardw · Posted 08-06-2021 11:49 AM

Duplicate post moved to this thread.

ballardw · Posted 08-06-2021 11:03 AM

Strong suggestion: Create TWO new variables. One for the product, other for the stage.

I don't know what you may envision doing but multiple values in a single variable almost invariably adds lots of extra work.

Consider that you want to to a summary count based on Product values. If the value is Product1_Stage1 or Product1_Stage2 that means you have to do something else to get just the product, otherwise every product will be duplicated with each level of stage. So save some headaches and split them up now. Or provide a real good use case for how you intend to use the combined value.

So, just where do the actual Product names that you want reside? Your statement " The product name and stage name can be anywhere in the string." means that we have no clue what you are looking for and can't parse what you show as it is basically meaningless. The actual "product" could very well be in the (words in btw) as far as we know. So how do you expect to tell a program what specific products you are looking for? There is a change if the verbiage actually is "Stage" that something could be done with that but that is rife with potential other words like "stage manager" "stage lights" or similar.

BTW your example data strongly implies that Product is always at the beginning of the variable and that does not match your "anywhere in the string" description.

Typical "find" of specific words in a string involve, strangely enough, a function named FINDW (for Find word), to identify the location and then use a function such as SCAN or SUBSTR to extract the value.

maguiremq · Posted 08-06-2021 11:30 AM

This is the best that I have, but I think it really depends on your data. If you're not familiar with regular expressions (I am a bit of a beginner/intermediate), I strongly consider learning about them to deal with weird parsing issues.

data have;
input col1 $50.;
datalines;
product1..... stage1.....
product1.....stage2......
product2....stage1....
product2 .....stage2....
;
run;

data want;
	set have;
		words = prxchange("s/[^(product\d{1}|stage\d{1})]//i", -1, col1);
		join_location = prxmatch("/\d{1}\w{1}/", words);
		new_col = catx("_", substr(words, 1, join_location), substr(words, join_location + 1, length(words)));
run;

col1 	words 	join_location 	new_col
product1..... stage1..... 	product1stage1 	8 	product1_stage1
product1.....stage2...... 	product1stage2 	8 	product1_stage2
product2....stage1.... 	product2stage1 	8 	product2_stage1
product2 .....stage2.... 	product2stage2 	8 	product2_stage2

I don't have enough of your data to know for sure if that's what you need or will solve your issue, but it seems to match your have example.

find a couple of words in a string