BookmarkSubscribeRSS Feed
hnam
Calcite | Level 5

I am wanting to extract information from a list of about 5000 observations. They are all in text and currently downloaded into excel sheets.

Here are few examples of them:

1. 20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K

2. PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K

3. PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K

4. CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop

I wanted to extract the concentrations of individual chemcals from the lists like 20%PEG4000, 200 mM Sodiumchloride, 50 mM MES as different variables(firt three examples) and 45% saturated ammoniumsulfate and 50MM Potassiumphosphate (I find this more challenging).

Could anyone provide some inights.

Thanks in advance

Hari

7 REPLIES 7
Astounding
PROC Star

Hari,

Here is just one small piece of the puzzle.  In any data of this sort, expect to find variations in spelling for the same compound.  You'll find many spellings for Ammonium Sulfate, for example.  If you want to categorize them, you'll need to be able to translate from what's in the data to a standard term.  This often involves setting up a format to translate all known spellings, running it against the variables, and seeing what hasn't yet been translated by the format.  Define the format using upper case only, and try to pass the upper-cased version of your variables through the format.

Good luck.

art297
Opal | Level 21

Given your 4 examples, what do you hope to have for each as a result?

Ksharp
Super User

What output do you want ? You don't give us an example.

Ksharp

hnam
Calcite | Level 5

Well what I want is:

I need the the percentage and checoncentrations of each of the chemical constitutents .         

Thanks

Hari

art297
Opal | Level 21

Given your 4 examples, please show the exact values you would expect to extract to represent the chemical consituents.  The analysis part is easy.  However, to write code to extract values, one has to know what is supposed to be extracted.

art297
Opal | Level 21

Are you looking to accomplish something like the following?:

data have (drop=i x temp);

  length temp contents $80;

  input;

  record+1;

  i=1;

  do while (scan(_infile_,i,",-") ne "");

    contents=upcase(scan(_infile_,i,",-"));

    x=index(contents,". THE");

    if x gt 0 then do;

      temp=substr(contents,x+2);

      contents=strip(substr(contents,1,x-1));

      output;

      contents=temp;

    end;

    x=index(contents," AND ");

    if x gt 0 then do;

      temp=substr(contents,x+5);

      contents=strip(substr(contents,1,x-1));

      output;

      contents=temp;

    end;

    if length(contents) gt 30 then do;

      x=anydigit(contents);

      temp=contents;

      contents=substr(contents,x);

    end;

    contents=strip(contents);

    if substr(contents,length(contents),1) eq "."

      then contents=substr(contents,1,length(contents)-1);

    output;

    i+1;

  end;

  cards;

20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K

PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K

PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K

CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop

;

proc freq data=have;

  tables contents;

run;

Ksharp
Super User

It is what you are looking for ?

data want(keep=found);
input;
ExpressionID = prxparse('/((\d|\.)+\s*(mm|mM|Mm|MM)\s*\w+)|((\d|\.)+\s*%\s*(\w|\s)+)/o');
start = 1;
stop = length(_infile_);
call prxnext(ExpressionID, start, stop, _infile_, position, length);
do while (position > 0);
found = substr(_infile_, position, length);
output;
call prxnext(ExpressionID, start, stop, _infile_, position, length);
end;
cards;
20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K
PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K
PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K
CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop
; run;




Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 983 views
  • 0 likes
  • 4 in conversation