In a dataset with more than a million rows, I have a character field called SERVICE that has a length of 1000 bytes. The value of SERVICE looks like: ABC123,XYZ456,BRK000,ENG789,ABC963,ABC778 Basically, 3 characters followed by 3 digits combined. That is the pattern. There are multiple values like that separated by comma. I've provided a small example above but the total length can go up to 1000 at the most. I want to create a new variable called COUNT_ABC which contains a value that represents the number of occurences of the word ABC in the field SERVICE. For the above example, the value of COUNT_ABC should be 3. The challenge I'm facing here is to accomplish this in PROC SQL because this is an addition to an existing modeling code that is creating about 800 variables using lots of calculations and aggregations. I cannot use DATA step. If the length of the field is less than 200 bytes, I would use the following logic: PROC SQL;
CREATE TABLE FINAL AS
SELECT (LENGTH(SERVICE) - LENGTH(TRANSTRN(SERVICE,"ABC",TRIMN(" "))))/3 AS COUNT_ABC
FROM INPUT;
QUIT; Since TRANSTRN is a SAS character function that truncates the output to 200 bytes, this method won't work in my case where the length can go up to 1000 bytes. I'm anxious to solve this in PROC SQL and without the use of user/custom defined functions. Maybe there is way to solve this using a PRX function? I'm not an expert in this though. The following logic removes all ABCs. SELECT PRXCHANGE('s/(?:ABC)/$1/i',-1,SERVICE) AS ONLY_ABC So the resulting text is like: 123,XYZ456,BRK000,ENG789,963,778 If I could find a way to do the opposite(that is to keep only the ABCs and strip off everything else including numbers and commas) Like: ABCABCABC Then I can use the length function to get the count because the maximum length in that case will only be like around 60. Thanks
... View more