I am wanting to extract information from a list of about 5000 observations. They are all in text and currently downloaded into excel sheets.
Here are few examples of them:
1. 20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K
2. PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K
3. PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K
4. CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop
I wanted to extract the concentrations of individual chemcals from the lists like 20%PEG4000, 200 mM Sodiumchloride, 50 mM MES as different variables(firt three examples) and 45% saturated ammoniumsulfate and 50MM Potassiumphosphate (I find this more challenging).
Could anyone provide some inights.
Thanks in advance
Hari
Hari,
Here is just one small piece of the puzzle. In any data of this sort, expect to find variations in spelling for the same compound. You'll find many spellings for Ammonium Sulfate, for example. If you want to categorize them, you'll need to be able to translate from what's in the data to a standard term. This often involves setting up a format to translate all known spellings, running it against the variables, and seeing what hasn't yet been translated by the format. Define the format using upper case only, and try to pass the upper-cased version of your variables through the format.
Good luck.
Given your 4 examples, what do you hope to have for each as a result?
What output do you want ? You don't give us an example.
Ksharp
Well what I want is:
I need the the percentage and checoncentrations of each of the chemical constitutents .
Thanks
Hari
Given your 4 examples, please show the exact values you would expect to extract to represent the chemical consituents. The analysis part is easy. However, to write code to extract values, one has to know what is supposed to be extracted.
Are you looking to accomplish something like the following?:
data have (drop=i x temp);
length temp contents $80;
input;
record+1;
i=1;
do while (scan(_infile_,i,",-") ne "");
contents=upcase(scan(_infile_,i,",-"));
x=index(contents,". THE");
if x gt 0 then do;
temp=substr(contents,x+2);
contents=strip(substr(contents,1,x-1));
output;
contents=temp;
end;
x=index(contents," AND ");
if x gt 0 then do;
temp=substr(contents,x+5);
contents=strip(substr(contents,1,x-1));
output;
contents=temp;
end;
if length(contents) gt 30 then do;
x=anydigit(contents);
temp=contents;
contents=substr(contents,x);
end;
contents=strip(contents);
if substr(contents,length(contents),1) eq "."
then contents=substr(contents,1,length(contents)-1);
output;
i+1;
end;
cards;
20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K
PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K
PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K
CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop
;
proc freq data=have;
tables contents;
run;
It is what you are looking for ?
data want(keep=found); input; ExpressionID = prxparse('/((\d|\.)+\s*(mm|mM|Mm|MM)\s*\w+)|((\d|\.)+\s*%\s*(\w|\s)+)/o'); start = 1; stop = length(_infile_); call prxnext(ExpressionID, start, stop, _infile_, position, length); do while (position > 0); found = substr(_infile_, position, length); output; call prxnext(ExpressionID, start, stop, _infile_, position, length); end; cards; 20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop ; run;
Ksharp
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.