Solved: Re: how to serach for specific text within a string and create a new v...

sas_student1 · Posted 01-23-2018 08:22 PM

Hello,

I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.

What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the search.

In this dataset all the elements are in one column. The values in each row are in order.

That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.

After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.

Then follows a three digit number

After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)

Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.

Finally there is a string of characters that always starts with "1-RT BE"

This is what the sample looks like:

4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF

4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA

1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF

1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD

1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ

1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF

And this is what I am trying to get to:

Rate	ID	Provider	NAME	ProvID	type	Date	amount	Desc	RT
4278	01234567	Auto Enroll Provider	FAMILY HEALTHCARE	008	02-02	1/1/2018	131.23	0287-HSP OUT SR	1-RT BE 009-COSNF
4273	34876539	Original Provider	BURKE HEALTH CENTER	003	02-02	1/1/2018	187.01	0287-HSP OUT SR	1-RT BE 010-COSNA
1468	04876298	Collapsed Provider	PHOENIX RISES OF NEW JESRSEY INC	005	02-02	1/1/2018	17.67	0160-D&TCS	1-RT BE 009-COSNF
1540	10983746	Genesis Provider	COLLEGE PARK	016	02-02	1/1/2018	148.43	0160-D&TCS	1-RT BE 013-INVRATECD
1671	34098734	Auto Enroll Provider	THIS CNTY DOH	004	02-02	1/1/2018	138.7	0287-HSP OUT SR	1-RETR TO BE 040-INVREQ
1671	45398726	Auto Enroll Provider	THAT CLINIC	004	02-02	1/1/2018	138.7	0287-HSP OUT SR	1-RT BE 009-COSNF

Any suggestions would be great!

Thanks

art297 · Posted 01-24-2018 02:40 PM

Only one change was needed to @kiranv_'s code to parse provID:

data have;
  input line $ 1-115;
  datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want (drop=line);
  set have;
  length rate $4;
  rate =scan(line,1,'');
  ID= scan(line,2,'') ;
  provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
  name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
  provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1,  line);
  type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
  date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
  amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
  desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
  rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

Art, CEO, AnalystFinder.com

View solution in original post

Reeza · Posted 01-23-2018 09:04 PM

Possibly, but you’re going to need to provide better and more sample,data, ideally enough to illustrate several variations of what you’re likely to encounter. Otherwise you’ll get code for a single case that won’t work on your actual data, but technically answers your question.

@sas_student1 wrote:

Hello,

@I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.

What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the serach.

For example, If I have a database with one coloumn that has something like this:

1111 23456789 Hospital-name-that-can-be-any-length 003 1/1/13 dollar-amount-that-can-be-any-length

I would like to create a table from the above that would have 6 coloumns that would contain the following:

rate : 1111

ID: 23456789

hospname: Hospital-name-that-can-be-any-length

hospID: 003

Date: 1/1/13

dollar: dollar-amount-that-can-be-any-length

I think that we can use the PRX function but am not too familiar.

Any suggestions would be great!

Hope the example above makes sense.

Thanks

sas_student1 · Posted 01-23-2018 10:40 PM

Thanks,

Here is a better example:

All the elements are in one column. The values in each row are in order.

That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.

After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.

Then follows a three digit number

After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)

Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.

Finally there is a string of characters that always starts with "1-RT BE"

This is what the sample looks like:

4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF

4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA

1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF

1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD

1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ

1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF

And this is what I am trying to get to:

Rate	ID	Provider	NAME	ProvID	type	Date	amount	Desc	RT
4278	01234567	Auto Enroll Provider	FAMILY HEALTHCARE	008	02-02	1/1/2018	131.23	0287-HSP OUT SR	1-RT BE 009-COSNF
4273	34876539	Original Provider	BURKE HEALTH CENTER	003	02-02	1/1/2018	187.01	0287-HSP OUT SR	1-RT BE 010-COSNA
1468	04876298	Collapsed Provider	PHOENIX RISES OF NEW JESRSEY INC	005	02-02	1/1/2018	17.67	0160-D&TCS	1-RT BE 009-COSNF
1540	10983746	Genesis Provider	COLLEGE PARK	016	02-02	1/1/2018	148.43	0160-D&TCS	1-RT BE 013-INVRATECD
1671	34098734	Auto Enroll Provider	THIS CNTY DOH	004	02-02	1/1/2018	138.7	0287-HSP OUT SR	1-RETR TO BE 040-INVREQ
1671	45398726	Auto Enroll Provider	THAT CLINIC	004	02-02	1/1/2018	138.7	0287-HSP OUT SR	1-RT BE 009-COSNF

Suggestions?

Shmuel · Posted 01-24-2018 12:36 AM

As you hav a space delimiter between variables use scan function to assign values to the variables.

You can also read the variables by INPUT statement without the @ to assign starting position, but just

by assigning $ for char type variables, and preassign max length for each char type variable.

sas_student1 · Posted 01-24-2018 01:01 AM

Thank you, it was pointed out to me that the example that I gave was not good as I would get a helpful response but not to my example and that was my mistake.

I have edited my original text with a better example.

You will note that I cannot use space delimiter, or @ because the dataset has provider names and descriptions that have space in them. Also the length for some of the elements are not known.

I hope my edited post helps clarify the type of dataset I have.

Thank you for your assistance.

Shmuel · Posted 01-24-2018 06:41 AM

You may need work by try and error.

Check next code and its results, the adapt length of variables if need:

data example;
  infile datalines truncover;
  input a_line $140;
data lines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF

4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA

1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  

1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD

1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 

1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
run;

data test;
 set example;
     length var1 provider name $50; /* assumed max any variable length is 50 characters */
	 length type $5 desc $20 rt $20;  /* addapt to max length */
	 drop var1 len1 ix;
	 
     var1 = scan(a_line,1);
 	 len1 = length(strip(var1));
     rate = inputn(var1,best., len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
	 var1 = scan(a_line,1);
 	 len1 = length(strip(var1));
     id   = inputn(var1,best., len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     ix = index(a_line, 'Provider');
	 len1 = ix + 8;  /* add 'Provider' length */
	 Provider = substr(a_line,1,len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     provid =  input(scan(a_line,1),3.);
	 type = scan(a_line,2);
	 date = input(scan(a_line,3),ddmmyy8.);
	 amount = input(scan(a_line,4),best7.2;
  /*-------------------------------------------------------------*/	 
     len1 = 23; /* sum length of provider type date & amount & gaps - assumed fix length */
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     desc = scan(a_line,1);
	 rt = scan(a_line,2);
run;

kiranv_ · Posted 01-24-2018 01:18 AM

i am no expert in regular expressions, this code can be definitely enhanced or there could be better code. I got the results required for your sample. as @Reeza explained more datapoint, the better the answer can be. Another important please set length of variable as you want(prxchange defaults the length to 200). I have fixed length for one variable

data have;
input line $ 1-115;
datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

sas_student1 · Posted 01-24-2018 12:45 PM

Thank you Sooo much!!! this did exactly what I needed!

And this worked on the 99.99% of the data.

The one that it did not work on (oddly) was on the code that created provid. It was supposed to capture the three digit after the name variable (i.e. 008, 003, 005 etc.) but interestingly it captured and displayed three digits from the rate variable? suggestions on how to fix that?

BTW. I checked and double checked and there were no delimiters (space, tab or any other items ) that I could possibly use, but the PRXchange worked!!!!

I knew there were people here who were savvy with SAS that even if given a dataset that at this time and age one should not be getting we had people who could figure it out!!!

You all are awesome!!!

data have;
input line $ 1-115;
datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

art297 · Posted 01-24-2018 02:40 PM

Only one change was needed to @kiranv_'s code to parse provID:

data have;
  input line $ 1-115;
  datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want (drop=line);
  set have;
  length rate $4;
  rate =scan(line,1,'');
  ID= scan(line,2,'') ;
  provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
  name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
  provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1,  line);
  type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
  date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
  amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
  desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
  rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

Art, CEO, AnalystFinder.com

sas_student1 · Posted 01-24-2018 03:00 PM

YES!!!!! Thank you! This Worked!!!

You are all the best!!

Thank you!

RW9 · Posted 01-24-2018 04:37 AM

Unfortunately you have been provided with a garbage file format. What this essentially means is your going to have to write a parser which process the string out in the various variables. Me I would reject the file out of hand or increase costs associated with processing it. Anyways to get you started, we know that the first four characters will be a number and goto variable 1 so:

data want;
  set have;
  rate=input(substr(thestring,1,4),best.);
/* this trims off the bit we just took */
  thestring=substr(thestring,5);
  id=input(substr(thestring,1,8),best.);
  thestring=substr(thestring,9);
...
run;

So that does the first two. From there on you just add your rules in one by one to get the various data parts out. Or you point out that it is 2018 and these unstructured mess files are just not acceptable. Just by putting a comma in between makes everyones life so much easier.

ballardw · Posted 01-24-2018 11:26 AM

Large economy sized hint: Post your example data into a code box and not the main message window. Open a code box with the forum {I} icon. There is a chance that your data is actually fixed column, which is very easy to read when identified, but this forum will reformat pasted data and possibly removed the duplicate blanks likely in your original data.

OR copy the text from a text editor and not any other type program such as Word as they may "helpfully" reformat text without telling you.

how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Re: how to serach for specific text within a string and create a new variable

Registration is open

SAS Training: Just a Click Away