Solved: Re: how to sum numbers from selected phrases?

Jacek2 · Posted 06-10-2022 02:56 AM

Hello All

I would like to sum up the numbers from the text below:

Kwota=237 + Kwota=30.87 + Kwota=292.74 + Kwota=292.74 + Kwota=300.87 that is:

Kwota=1154,22

In the example there are 5 phrases but their number may vary.

text:

[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]

FreelanceReinh · Posted 06-10-2022 01:06 PM

Hello @Jacek2 and welcome to the SAS Support Communities!

This should get you started:

data have;
length text $1000;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]');
output;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota = 237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota 30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota:292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf  Kwota=292.74 Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porsche, kwota=2,300.87, Cena =852],',
         '[Forma sprzedazy=leasing, auto: Horch, Kwota=60,000],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=N/A, Cena =N/A],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=, Cena =]');
output;
run;

data want(keep=text total);
set have;
if ~id then id+prxparse('/Kwota[=: ]*(\d+,?\d*\.?\d*)/i');
start=1;
stop=length(text);
link add;
do while(pos);
  link add;
end;
add:
  call prxnext(id, start, stop, text, pos, len);
  total=sum(total,input(prxposn(id, 1, text),?? comma32.));
return;
run;

I have added a second observation with eight "phrases" involving various deviations from the standard layout to demonstrate the robustness of the algorithm:

blanks around the equal sign after "Kwota"
blank or colon instead of equal sign after "Kwota"
missing commas around the "Kwota" portion of the phrase
lowercase "kwota"
thousands separator
missing "Cena" portion of the phrase
invalid numeric data ("N/A") after "Kwota="
no data after "Kwota="

The Perl regular expression '/Kwota[=: ]*(\d+,?\d*\.?\d*)/i' describes a pattern consisting of the word "Kwota" (case-insensitive), followed by any number of equal signs, colons or blanks, at least one digit and possibly: a comma, more digts, a decimal point and decimals. The CALL PRXNEXT routine performs the search repeatedly as long as the pattern is found (condition pos>0). The relevant part of the string TEXT matching the pattern (i.e., the digits with thousands separator and decimal point, if any) is extracted by the PRXPOSN function, converted to a numeric value (if possible) by the INPUT function and finally added to variable TOTAL by the SUM function.

Note that a number format like 1.234,56 in the data would require the COMMAX. informat instead of the COMMA. informat. A mixture of different formats (e.g., 123.45 in some places and 123,45 in others) is not supported by the suggested code. Numbers with two thousands separators (like 1,199,000) are not supported either.

View solution in original post

ballardw · Posted 06-10-2022 11:05 AM

Do you have a SAS data set?

If so show us the results of Proc Contents on the data set.

If you do not have a SAS data set the first thing will be to create one and should show use some of the source file so we can help create one.

Almost certainly you have to do something to separate the text into meaningful variables possibly such as Model, Kwota and Cena (what ever those are). Then the sum would be easy.

Jacek2 · Posted 06-13-2022 03:10 AM

Thank you for the answer, the problem has already been solved by colleagues below.

Alternatively:

data want;
  length kwota_total 8;
  set have;
  rxid = prxparse ('/kwota=(\d+\.?\d+)/i');
  start = 1;
  stop = length(text);
  do while (_n_);
    call prxnext (rxid, start, stop, text, position, length);
    if position < 1 then leave;
    kwota_total = sum(kwota_total, input (prxposn(rxid,1,text), best32.));
  end;
  drop rxid start stop position length;
run;

Tom · Posted 06-10-2022 12:55 PM

If the text is pretty simple it is not hard to parse it in SAS code.

Looks like you have values in brackets (square brackets if you speak British) separated by commas. Within that you have name value pairs separated by commas. Strangely some of the name value pairs are separated by = and others by ':'.

You could build a regular expression to parse, but for your simple example you can just use SCAN(), especially if all you want is the KWOTA (?quota?) values.

Let's start by creating a dataset from the information in your post so we have something to code with.

data have;
  text='[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123]'
    ||',[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10]'
    ||',[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134]'
    ||',[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134]'
    ||',[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]'
  ;
run;

Now we can loop over the name/value pairs in TEXT and when the name is KWOTA convert the value to a number and accumulate the amount.

data want;
  set have;
  dlm='[],=:';
  do word = 1 to countw(text,dlm) by 2;
    length name value $50 ;
    name=left(scan(text,word,dlm));
    value=left(scan(text,word+1,dlm));
    if name='Kwota' then kwota=sum(kwota,input(value,??32.));
  end;
  drop word name value dlm  ;
run;

Result

Obs     kwota

 1     1154.22

Jacek2 · Posted 06-13-2022 03:12 AM

Thank you Tom, it works perfectly!!

FreelanceReinh · Posted 06-10-2022 01:06 PM

Hello @Jacek2 and welcome to the SAS Support Communities!

This should get you started:

data have;
length text $1000;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]');
output;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota = 237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota 30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota:292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf  Kwota=292.74 Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porsche, kwota=2,300.87, Cena =852],',
         '[Forma sprzedazy=leasing, auto: Horch, Kwota=60,000],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=N/A, Cena =N/A],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=, Cena =]');
output;
run;

data want(keep=text total);
set have;
if ~id then id+prxparse('/Kwota[=: ]*(\d+,?\d*\.?\d*)/i');
start=1;
stop=length(text);
link add;
do while(pos);
  link add;
end;
add:
  call prxnext(id, start, stop, text, pos, len);
  total=sum(total,input(prxposn(id, 1, text),?? comma32.));
return;
run;

I have added a second observation with eight "phrases" involving various deviations from the standard layout to demonstrate the robustness of the algorithm:

blanks around the equal sign after "Kwota"
blank or colon instead of equal sign after "Kwota"
missing commas around the "Kwota" portion of the phrase
lowercase "kwota"
thousands separator
missing "Cena" portion of the phrase
invalid numeric data ("N/A") after "Kwota="
no data after "Kwota="

The Perl regular expression '/Kwota[=: ]*(\d+,?\d*\.?\d*)/i' describes a pattern consisting of the word "Kwota" (case-insensitive), followed by any number of equal signs, colons or blanks, at least one digit and possibly: a comma, more digts, a decimal point and decimals. The CALL PRXNEXT routine performs the search repeatedly as long as the pattern is found (condition pos>0). The relevant part of the string TEXT matching the pattern (i.e., the digits with thousands separator and decimal point, if any) is extracted by the PRXPOSN function, converted to a numeric value (if possible) by the INPUT function and finally added to variable TOTAL by the SUM function.

Note that a number format like 1.234,56 in the data would require the COMMAX. informat instead of the COMMA. informat. A mixture of different formats (e.g., 123.45 in some places and 123,45 in others) is not supported by the suggested code. Numbers with two thousands separators (like 1,199,000) are not supported either.

Jacek2 · Posted 06-13-2022 03:16 AM

Thank you for your comprehensive answer, it works perfectly!!!

Registration is open

SAS Training: Just a Click Away