BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Jacek2
Obsidian | Level 7

Hello All

I would like to sum up the numbers from the text below:

Kwota=237 + Kwota=30.87 + Kwota=292.74 + Kwota=292.74 + Kwota=300.87 that is:

Kwota=1154,22

In the example there are 5 phrases but their number may vary.

text:

[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Jacek2 and welcome to the SAS Support Communities!

 

This should get you started:

data have;
length text $1000;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]');
output;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota = 237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota 30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota:292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf  Kwota=292.74 Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porsche, kwota=2,300.87, Cena =852],',
         '[Forma sprzedazy=leasing, auto: Horch, Kwota=60,000],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=N/A, Cena =N/A],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=, Cena =]');
output;
run;

data want(keep=text total);
set have;
if ~id then id+prxparse('/Kwota[=: ]*(\d+,?\d*\.?\d*)/i');
start=1;
stop=length(text);
link add;
do while(pos);
  link add;
end;
add:
  call prxnext(id, start, stop, text, pos, len);
  total=sum(total,input(prxposn(id, 1, text),?? comma32.));
return;
run;

I have added a second observation with eight "phrases" involving various deviations from the standard layout to demonstrate the robustness of the algorithm:

  • blanks around the equal sign after "Kwota"
  • blank or colon instead of equal sign after "Kwota"
  • missing commas around the "Kwota" portion of the phrase
  • lowercase "kwota"
  • thousands separator
  • missing "Cena" portion of the phrase
  • invalid numeric data ("N/A") after "Kwota=" 
  • no data after "Kwota="

The Perl regular expression '/Kwota[=: ]*(\d+,?\d*\.?\d*)/i' describes a pattern consisting of the word "Kwota" (case-insensitive), followed by any number of equal signs, colons or blanks, at least one digit and possibly: a comma, more digts, a decimal point and decimals. The CALL PRXNEXT routine performs the search repeatedly as long as the pattern is found (condition pos>0). The relevant part of the string TEXT matching the pattern (i.e., the digits with thousands separator and decimal point, if any) is extracted by the PRXPOSN function, converted to a  numeric value (if possible) by the INPUT function and finally added to variable TOTAL by the SUM function.

 

Note that a number format like 1.234,56 in the data would require the COMMAX. informat instead of the COMMA. informat. A mixture of different formats (e.g., 123.45 in some places and 123,45 in others) is not supported by the suggested code. Numbers with two thousands separators (like 1,199,000) are not supported either.

View solution in original post

6 REPLIES 6
ballardw
Super User

Do you have a SAS data set?

If so show us the results of Proc Contents on the data set.

If you do not have a SAS data set the first thing will be to create one and should show use some of the source file so we can help create one.

 

Almost certainly you have to do something to separate the text into meaningful variables possibly such as Model, Kwota and Cena (what ever those are). Then the sum would be easy.

Jacek2
Obsidian | Level 7

Thank you for the answer, the problem has already been solved by colleagues below.

Alternatively:

data want;
  length kwota_total 8;
  set have;
  rxid = prxparse ('/kwota=(\d+\.?\d+)/i');
  start = 1;
  stop = length(text);
  do while (_n_);
    call prxnext (rxid, start, stop, text, position, length);
    if position < 1 then leave;
    kwota_total = sum(kwota_total, input (prxposn(rxid,1,text), best32.));
  end;
  drop rxid start stop position length;
run;
Tom
Super User Tom
Super User

If the text is pretty simple it is not hard to parse it in SAS code.

Looks like you have values in brackets (square brackets if you speak British) separated by commas.  Within that you have name value pairs separated by commas.  Strangely some of the name value pairs are separated by = and others by ':'.

 

You could build a regular expression to parse, but for your simple example you can just use SCAN(), especially if all you want is the KWOTA (?quota?) values.

 

Let's start by creating a dataset from the information in your post so we have something to code with.

data have;
  text='[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123]'
    ||',[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10]'
    ||',[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134]'
    ||',[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134]'
    ||',[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]'
  ;
run;

Now we can loop over the name/value pairs in TEXT and when the name is KWOTA convert the value to a number and accumulate the amount.

data want;
  set have;
  dlm='[],=:';
  do word = 1 to countw(text,dlm) by 2;
    length name value $50 ;
    name=left(scan(text,word,dlm));
    value=left(scan(text,word+1,dlm));
    if name='Kwota' then kwota=sum(kwota,input(value,??32.));
  end;
  drop word name value dlm  ;
run;

Result

Obs     kwota

 1     1154.22

 

Jacek2
Obsidian | Level 7
Thank you Tom, it works perfectly!!
FreelanceReinh
Jade | Level 19

Hello @Jacek2 and welcome to the SAS Support Communities!

 

This should get you started:

data have;
length text $1000;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota=237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota=30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota=292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porche, Kwota=300.87, Cena =152]');
output;
text=cat('[Forma sprzedazy=leasing, auto: Audi A 6, Kwota = 237, Cena =123],',
         '[Forma sprzedazy=leasing, auto: Volvo, Kwota 30.87, Cena =10],',
         '[Forma sprzedazy=leasing, auto: VW Golf, Kwota:292.74, Cena =134],',
         '[Forma sprzedazy=leasing, auto: VW Golf  Kwota=292.74 Cena =134],',
         '[Forma sprzedazy=leasing, auto: Porsche, kwota=2,300.87, Cena =852],',
         '[Forma sprzedazy=leasing, auto: Horch, Kwota=60,000],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=N/A, Cena =N/A],',
         '[Forma sprzedazy=leasing, auto: N/A, Kwota=, Cena =]');
output;
run;

data want(keep=text total);
set have;
if ~id then id+prxparse('/Kwota[=: ]*(\d+,?\d*\.?\d*)/i');
start=1;
stop=length(text);
link add;
do while(pos);
  link add;
end;
add:
  call prxnext(id, start, stop, text, pos, len);
  total=sum(total,input(prxposn(id, 1, text),?? comma32.));
return;
run;

I have added a second observation with eight "phrases" involving various deviations from the standard layout to demonstrate the robustness of the algorithm:

  • blanks around the equal sign after "Kwota"
  • blank or colon instead of equal sign after "Kwota"
  • missing commas around the "Kwota" portion of the phrase
  • lowercase "kwota"
  • thousands separator
  • missing "Cena" portion of the phrase
  • invalid numeric data ("N/A") after "Kwota=" 
  • no data after "Kwota="

The Perl regular expression '/Kwota[=: ]*(\d+,?\d*\.?\d*)/i' describes a pattern consisting of the word "Kwota" (case-insensitive), followed by any number of equal signs, colons or blanks, at least one digit and possibly: a comma, more digts, a decimal point and decimals. The CALL PRXNEXT routine performs the search repeatedly as long as the pattern is found (condition pos>0). The relevant part of the string TEXT matching the pattern (i.e., the digits with thousands separator and decimal point, if any) is extracted by the PRXPOSN function, converted to a  numeric value (if possible) by the INPUT function and finally added to variable TOTAL by the SUM function.

 

Note that a number format like 1.234,56 in the data would require the COMMAX. informat instead of the COMMA. informat. A mixture of different formats (e.g., 123.45 in some places and 123,45 in others) is not supported by the suggested code. Numbers with two thousands separators (like 1,199,000) are not supported either.

Jacek2
Obsidian | Level 7
Thank you for your comprehensive answer, it works perfectly!!!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 624 views
  • 3 likes
  • 4 in conversation