Re: How to find the positions of the same substring in a long string

owenwqp1 · Posted 11-29-2017 06:15 AM

Dear all, I want to substr substrings from very long strings into different variables, but for each substring,they appear several times. For example: "price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496" . I want to substr the first price to be variable price1, the second price to be price2, .....the 10th price to be price10. But sas functions like find, index can only return the position of the first "price", so anyone could tell me how to find the positions of other substring "price"? Thanks

RM6 · Posted 11-29-2017 06:39 AM

can you be more clear

Kurt_Bremser · Posted 11-29-2017 06:41 AM

So you have name/value pairs separated by commas, and those come in groups separated by semicolons.

Use countw() and scan() functions to dissect the string. Instead of creating a wide dataset with a number of variables where you need to know the number beforehand, create a long dataset where that number does not need to be known:

data want (keep=group price volume);
instring = 'price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496';
do group = 1 to countw(instring,';');
  substring = scan(instring,group,';');
  price = input(scan(scan(substring,1,','),2,'='),best.);
  volume = input(scan(scan(substring,2,','),2,'='),best.);
  output;
end;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

owenwqp1 · Posted 12-18-2017 02:28 AM

@Kurt_Bremser Thank you so much KurtBremser, my problem solved with your code! Cheers, Owen

RW9 · Posted 11-29-2017 06:41 AM

Post test data in the form of a datastep using the code window {i}!

As such untested:

data want (drop=p_i v_i wrd);
  set have;
  length wrd $200;
  array price {10};
  array value {10};
  p_i=1;
  v_i=1;
  do i=1 to countw(str,",");
    wrd=scan(str,i,",");
    if index(wrd,"price") > 0 then do;
      price{p_i}=input(scan(wrd,2,"="),best.);
      p_i=p_i+1;
    end;
    else do;
      value{v_i}=input(scan(wrd,2,"="),best.);
      v_i=v_i+1;
    end;
  end;
run;

Note I assumed no more than 10 of each, and all in one string delimited by comma, and individually delimited by =.

Is this coming from a text file, as you can read named variables direct from the text file, then just transpose them up - again examples are extremely important to get good answers!

ballardw · Posted 11-29-2017 10:51 AM

Is this an external file that you are attempting to read?

It may help to post a few lines of the file if so.

Reeza · Posted 11-29-2017 10:55 AM

@owenwqp1 wrote:
But sas functions like find, index can only return the position of the first "price", so anyone could tell me how to find the positions of other substring "price"? Thanks

That's incorrect. Note the third parameter to those functions is the Start Position. So once you've found it once, you can set the start position to the last found value and it searches from there on.

FINDW(string, word, character(s), modifier(s) <, start-position>)

FINDW(string, word, character(s), start-position <, modifier(s)>)

http://documentation.sas.com/?docsetId=lefunctionsref&docsetTarget=p16rdsa30vmm43n1ej4936nwa01t.htm&...

@owenwqp1 wrote:
Dear all, I want to substr substrings from very long strings into different variables, but for each substring,they appear several times. For example: "price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496" . I want to substr the first price to be variable price1, the second price to be price2, .....the 10th price to be price10. But sas functions like find, index can only return the position of the first "price", so anyone could tell me how to find the positions of other substring "price"? Thanks

owenwqp1 · Posted 12-06-2017 01:18 AM

Dear all, Thank you for your replies. Because I did not put my question very clearly, your answers could not completely solve my problem. My data is like the following: firmid date transactions 1001 1/27/2017 price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496 1002 1/28/2017 price=20.9,volume=1300;price=20.9,volume=1000;price=21,volume=2018" 1003 1/29/2017 price=21.9,volume=2200;price=22.5,volume=1700 number of transactions everyday is not the same, and the position of each transaction price and volume is not regular. I want to substring the first transction price to variable price1, volume to variable volume1.And the second and other transctions to variable price2 and varible volume2 and so on. Thanks a lot

owenwqp1 · Posted 12-06-2017 02:03 AM

Dear all, Thank you for your replies. Because I did not put my question very clearly, your answers could not completely solve my problem. I'm really sorry to make it worse in the last reply.My sample excel data is attached: number of transactions everyday is not the same, and the position of each transaction price and volume is not regular. I want to substring the first transction price to variable price1, volume to variable volume1.And the second and other transctions to variable price2 and varible volume2 and so on. By the way, anyone could tell me how to directly paste excel data in table form in my posts? Thanks a lot

RW9 · Posted 12-06-2017 04:17 AM

Please see my initial post:

Post test data in the form of a datastep in the post using a code window!

What does your data look like, are you doing a proc import from that Excel file? Is it all in one line? Did you try my code, as adjusting the number of arrays it should be quite close.

Example of how to provide test data:

data have;
  length longstr $2000;
  input no dt $ longstr $
datalines4;
1001 1/27/2017 price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496 
1002 1/28/2017 price=20.9,volume=1300;price=20.9,volume=1000;price=21,volume=2018
;;;;
run;

Basically you just need to loop over longstr scanning each section delimited by ;, then scan that part by , - and then assign each to array.

owenwqp1 · Posted 12-06-2017 10:35 AM

Dear all, Thank you for your replies. Because I did not put my question very clearly, your answers could not completely solve my problem. I'm really sorry to make it worse in my previous reply. My sample data is as follows: data sample; length transactions $ 1000; input firmid date transactions $ datalines4; 1001 1/27/2017 price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496 1002 1/28/2017 price=20.9,volume=1300;price=20.9,volume=1000;price=21,volume=2018 1003 1/29/2017 price=21.9,volume=2200;price=22.5,volume=1700 ......; run; The number of transactions everyday is not the same, and the position of each transaction price and volume is not regular. I want to substring the first transction price to variable price1, volume to variable volume1. And the second and other transctions to variable price2 and varible volume2 and so on. Thanks a lot

Kurt_Bremser · Posted 12-07-2017 03:09 AM

Based on your sample data (with some corrections to the code:

data have;
length firmid date 4 transactions $ 1000;
format date mmddyy10.;
input firmid date :mmddyy10. transactions;
datalines4;
1001 1/27/2017 price=20.2,volume=1200;price=21,volume=1555;price=23,volume=1187;price=21,volume=1496
1002 1/28/2017 price=20.9,volume=1300;price=20.9,volume=1000;price=21,volume=2018
1003 1/29/2017 price=21.9,volume=2200;price=22.5,volume=1700
;;;;
run;

data want (keep=firmid date group price volume);
set have;
do group = 1 to countw(transactions,';');
  substring = scan(transactions,group,';');
  price = input(scan(scan(substring,1,','),2,'='),best.);
  volume = input(scan(scan(substring,2,','),2,'='),best.);
  output;
end;
run;

Gives you a nice longitudinal dataset to work with.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 12-07-2017 05:15 AM

One possibility is to use PRX (Pearl Regular Expressions). It looks like you always have first the price, then the volume, separated by a comma:

Data want;
  set have;
  prxData=prxParse('/price=([0-9\.]+),volume=([0-9\.]+)/i');
  array prices(*) 8 price1-price10;
  array volumes(*) 8 volume1-volume10;
  start=1;
  do _N_=1 to 10;
    call prxNext(prxData,start,-1,transactions,pos,len);
    if len=0 then leave;
    prices(_N_)=input(prxposn(prxData,1,transactions),best32.);
    volumes(_N_)=input(prxposn(prxData,2,transactions),best32.);
    end;
  drop start pos len;
run;

A short explanation of the string in PRXPARSE: it first looks for "price=". The string that follows is put in a paranthesis, so that it is stored in a capture buffer, the contents of the buffer must be digits or decimal point. Then follows the text ",volume=" and a new capture buffer with a number. The final "i" tells the function not to care about upper/lower case.

The PRXNEXT call routine automatically updates the START variable, so that the next call will look for the next occurence of the pattern. When no more patterns are found, the POS variable (position of pattern) is set to 0 when no more occurrences are found.

The PRXPOSN function retrieves the content of the capture buffers.

Registration is open

SAS Training: Just a Click Away