Hello,
I have a dataset similar to the following that contains a text(a single word or phrase) variable. The strings are either in English or French.
Is there a way to flag the English words?
data list;
input name $20.;
datalines;
Côté
Boucher
Fournier
Cats
how to register
morning
Thibeault
Martin
Vaudron
Girard
Hello;
run;
Thank you!
May not be possible with just words out of context, but you could try incorporating Python. Take a look at: https://www.probytes.net/blog/python-language-detection/
Art, CEO, AnalystFinder.com
data list;
input name $20.;
flag=prxmatch('/[^a-z]/i',compress(name,,'ka'))>0;
datalines;
Côté
Boucher
Fournier
Cats
how to register
morning
Thibeault
Martin
Vaudron
Girard
Hello
;
run;
My French is pretty rusty but I do remember that a moderate number of nouns are the same in both French and English.
So without the articles the / a or le/ la /les/ un / une or similar clue those are going to be very problematic.
Some adjectives, grand, for example are going to be worse.
I would hesitate to assign any name to a specific language as the French and English have been interacting for so long names go both ways (and spelling gets butchered)
Hi @parmis ,
I know this is an answer that comes after 2 years :), but felt that you may derive some benefit nevertheless, knowledge at the least. In Jan of this year, SAS released a language identification action as part of its Viya platform. Here are details on how it works :
regards,
Sundaresh
Calling all data scientists and open-source enthusiasts! Want to solve real problems that impact your company or the world? Register to hack by August 31st!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.