Wow - that is impressive, as i have just mentioned to DanielSantos both approaches work great except for the char limit on the other solution, hence why i believe yours may be the answer ( no sight of macros anywhere 🙂 )
before i accept the solution , i wonder of there is limitations to the extent of my query.
i.e if we change the second keyword to later on in the text it would be an invalid answer, for example the below.
I am looking for a yellow Yacht, but if the yacht keyword was later on in the code we would still pick this up as found2 > found1 still, but if my desire was specifically Yellow Yacht, this would infact be an invalid result. ( i realise you have answered the solution im just wondering how far my query could go with horrid data)
data colour_text;
input text $char50.;
datalines;
i want a BLUE BALL please
did you see a blue car go by
i have owned a yellow SHIP but not a yacht before
id never buy a red ball
we want a turqoise car
she said it was a yellow plane toy
he drove in his green Porsche
i didnt like his black heart
he had a grey ball
i saw a white plane
;
run;
to put this in context - The issue is: The real data is actually seperating actual cities from streets ,ie i want Paris when its a city in the text, and another dataset to identify when its a street
21 rouge street Paris - correct result for city dataset
21 rouge street Paris invoice to 15 kruger street - would go street dataset as street>cityname (found2>found1)
now, one answer would be to seperate the data after one of the cities is found ,i.e city found move additional string into next field or ignore, but that would miss the street string etc (yacht car ball)
In summary , is it possible to identify only those variables where the street is immediately after the city name (or in our instance the type is right after the colour) only.
I imagine this would involve a calculation like the index value of found2 (26) is <= 6 of end of found1 (19) i.e Paris (s is 19) Street (t is 26)
given the number of possibilities this was why i was inclined to go with reg expressions, but as i have learnt your query is more viable.
Am i asking too much?
Hopefully i have not lost you with that, and i appreciate the assistance regardless, sensational solutions
As @Tom pointed out, the correct approach to circumvent the problem would be to remove the direct references to the macro var and replace them with the symget function (the former is resolved at compile time, the latest at runtime).
This will suffice:
data want;
set Colour_text;
if prxmatch("/[a-z]*("!!strip(symget('NAMES'))!!";)( car| ball| yacht)/i",text);
run;
This will solve the problem of the yellow yacht/(not yellow) yacht and you'll be good with a prxmatch expression up to 32K in length... But, I would not use this kind of approach with a lot of data as it's probably not the best performance wise (regular expresions are quite CPU intensive)... Give it a try to see if it works well for you.
More on symget: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202938.htm
Hope it helps.
Daniel Santos @ www.cgd.pt
Yes, thats a bit more in depth. I would start with do some data cleaning on the data you have. Identify all the rules you need to apply. Split the string up into street/city parts. Why for instance does city before street matter? Its quite hard to try to visualise from whats provided. Likelihood is regex or text miner would be better at parsing string data. Set out all your data cleaning activities, implement them into a nicely structured dataset, then start processing that data. At the moment we are just going round in cricles.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.