Help using Base SAS procedures

Data Extraction

Reply
Occasional Contributor
Posts: 7

Data Extraction

I am wanting to extract information from a list of about 5000 observations. They are all in text and currently downloaded into excel sheets.

Here are few examples of them:

1. 20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K

2. PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K

3. PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K

4. CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop

I wanted to extract the concentrations of individual chemcals from the lists like 20%PEG4000, 200 mM Sodiumchloride, 50 mM MES as different variables(firt three examples) and 45% saturated ammoniumsulfate and 50MM Potassiumphosphate (I find this more challenging).

Could anyone provide some inights.

Thanks in advance

Hari

Super User
Posts: 5,085

Data Extraction

Hari,

Here is just one small piece of the puzzle.  In any data of this sort, expect to find variations in spelling for the same compound.  You'll find many spellings for Ammonium Sulfate, for example.  If you want to categorize them, you'll need to be able to translate from what's in the data to a standard term.  This often involves setting up a format to translate all known spellings, running it against the variables, and seeing what hasn't yet been translated by the format.  Define the format using upper case only, and try to pass the upper-cased version of your variables through the format.

Good luck.

PROC Star
Posts: 7,363

Data Extraction

Given your 4 examples, what do you hope to have for each as a result?

Super User
Posts: 9,682

Data Extraction

What output do you want ? You don't give us an example.

Ksharp

Occasional Contributor
Posts: 7

Data Extraction

Well what I want is:

I need the the percentage and checoncentrations of each of the chemical constitutents .         

Thanks

Hari

PROC Star
Posts: 7,363

Data Extraction

Given your 4 examples, please show the exact values you would expect to extract to represent the chemical consituents.  The analysis part is easy.  However, to write code to extract values, one has to know what is supposed to be extracted.

PROC Star
Posts: 7,363

Data Extraction

Are you looking to accomplish something like the following?:

data have (drop=i x temp);

  length temp contents $80;

  input;

  record+1;

  i=1;

  do while (scan(_infile_,i,",-") ne "");

    contents=upcase(scan(_infile_,i,",-"));

    x=index(contents,". THE");

    if x gt 0 then do;

      temp=substr(contents,x+2);

      contents=strip(substr(contents,1,x-1));

      output;

      contents=temp;

    end;

    x=index(contents," AND ");

    if x gt 0 then do;

      temp=substr(contents,x+5);

      contents=strip(substr(contents,1,x-1));

      output;

      contents=temp;

    end;

    if length(contents) gt 30 then do;

      x=anydigit(contents);

      temp=contents;

      contents=substr(contents,x);

    end;

    contents=strip(contents);

    if substr(contents,length(contents),1) eq "."

      then contents=substr(contents,1,length(contents)-1);

    output;

    i+1;

  end;

  cards;

20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K

PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K

PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K

CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop

;

proc freq data=have;

  tables contents;

run;

Super User
Posts: 9,682

Data Extraction

It is what you are looking for ?

data want(keep=found);
input;
ExpressionID = prxparse('/((\d|\.)+\s*(mm|mM|Mm|MM)\s*\w+)|((\d|\.)+\s*%\s*(\w|\s)+)/o');
start = 1;
stop = length(_infile_);
call prxnext(ExpressionID, start, stop, _infile_, position, length);
do while (position > 0);
found = substr(_infile_, position, length);
output;
call prxnext(ExpressionID, start, stop, _infile_, position, length);
end;
cards;
20% PEG4000, 200mM Sodiumchloride, 50mM MES, 0.3% dioxane, pH 6.0, VAPOR DIFFUSION, HANGING DROP, temperature 310K
PEG3350, LiCl, Tris, pH 9.0, VAPOR DIFFUSION, HANGING DROP, temperature 295.0K
PEG3350, MgOAc, Na Cacodylate, TCEP, pH 6.9, VAPOR DIFFUSION, SITTING DROP, temperature 280K
CRYSTALLISED FROM HANGING DROP WHICH ALSO CONTAINED 50MM Potassiumphosphate, 5MM DTT AND 30% SATURATED AMMONIUMSULFATE, PH 7.0. THE WELL SOLUTION WAS 45% SATURATED AMMONIUMSULFATE., vapor diffusion - hanging drop
; run;




Ksharp

Ask a Question
Discussion stats
  • 7 replies
  • 187 views
  • 0 likes
  • 4 in conversation