01-23-2017 05:21 AM
I have a large dataset with dosage information as text strings. I am coverting these into numerical information.
Same dosages can be written in many different ways. Is there any way to convert this information to numeric is a "easy" way?
Example of dosages:
1 tablet 1 times daily (should correspond to the numeric value of 1)
1 TABLET. 1 TIMES DAILY (should correspond to the numeric value of 1)
1 TABL. 1 TIME DAILY (should correspond to the numeric value of 1)
1 tabl daily (should correspond to the numeric value of 1)
1 TABLET DAILY (should correspond to the numeric value of 1)
Can you use perl expressions? (No experience on this)
Thank for any help!
01-23-2017 05:34 AM
Why is dose information in a text field, that is madness. I would immeidately send it back to source and ask them to fix it. Sure you could try doing some string parsing on the data, but the sheer number of permutations makes it very difficult. Not to mention that some "doses" will require medical input as to what is what. In your example, why does 1 tablet daily = 1? Does it only last one day? If not then dose is not 1 but cumulation of number of days taken.
This is a good example of why, at database creation, every care should be taken to avoid free text fields.
If it was me and I had to do this, I would produce a listing of distinct values, send it to the medic/DM, and ask them to provide decodes for each of the values. Then read that file in and code based on responses.
01-23-2017 05:47 AM
Thank you RW9!
I agree that it is of course a bad idea having this information as string. But unfortunately that is how it is stored now. I have manually evaluated all dosages and transformed the into numeric. I am asking now since I would like to evaluate the perfomance of string parsing (or something else) compared to my manual evaluation. In my example 1 tablet daily was corresponding to the daily numeric dose of 1 but it says nothing about the duration of treatment.
01-23-2017 07:38 AM
I guess you could write code to try to standardize the text (i.e. covert "tabl" to "tablet", and "per day" to "daily"). Then parse the standardized code. But I think you should consiider using the results of your manual effort to construct a lookup table, which you can use on new data. That would leave only never-seen-before text to parse (and ultimately add to the lookup table).
01-23-2017 08:56 AM - edited 01-23-2017 08:57 AM
@bollibompa - you gave only one set of variations - set for tablets per day.
What other sets of text contents are available ?
I would work through next steps:
1) use function lowcase()
2) use function tranw() to replace variant words into standard keywords
3) use function findw() or index() to search for the keywords
and when found search for numeric amount