Hello,
I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.
What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the search.
In this dataset all the elements are in one column. The values in each row are in order.
That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.
After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.
Then follows a three digit number
After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)
Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.
Finally there is a string of characters that always starts with "1-RT BE"
This is what the sample looks like:
4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF
4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA
1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF
1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD
1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ
1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF
And this is what I am trying to get to:
Rate | ID | Provider | NAME | ProvID | type | Date | amount | Desc | RT |
4278 | 01234567 | Auto Enroll Provider | FAMILY HEALTHCARE | 008 | 02-02 | 1/1/2018 | 131.23 | 0287-HSP OUT SR | 1-RT BE 009-COSNF |
4273 | 34876539 | Original Provider | BURKE HEALTH CENTER | 003 | 02-02 | 1/1/2018 | 187.01 | 0287-HSP OUT SR | 1-RT BE 010-COSNA |
1468 | 04876298 | Collapsed Provider | PHOENIX RISES OF NEW JESRSEY INC | 005 | 02-02 | 1/1/2018 | 17.67 | 0160-D&TCS | 1-RT BE 009-COSNF |
1540 | 10983746 | Genesis Provider | COLLEGE PARK | 016 | 02-02 | 1/1/2018 | 148.43 | 0160-D&TCS | 1-RT BE 013-INVRATECD |
1671 | 34098734 | Auto Enroll Provider | THIS CNTY DOH | 004 | 02-02 | 1/1/2018 | 138.7 | 0287-HSP OUT SR | 1-RETR TO BE 040-INVREQ |
1671 | 45398726 | Auto Enroll Provider | THAT CLINIC | 004 | 02-02 | 1/1/2018 | 138.7 | 0287-HSP OUT SR | 1-RT BE 009-COSNF |
Any suggestions would be great!
Thanks
Only one change was needed to @kiranv_'s code to parse provID:
data have; input line $ 1-115; datalines; 4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF 4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA 1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF 1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD 1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ 1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF ; data want (drop=line); set have; length rate $4; rate =scan(line,1,''); ID= scan(line,2,'') ; provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line); name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line); provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1, line); type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line); date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line); amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line); desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line); rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line); run;
Art, CEO, AnalystFinder.com
Possibly, but you’re going to need to provide better and more sample,data, ideally enough to illustrate several variations of what you’re likely to encounter. Otherwise you’ll get code for a single case that won’t work on your actual data, but technically answers your question.
@sas_student1 wrote:
Hello,
@I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.
What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the serach.
For example, If I have a database with one coloumn that has something like this:
1111 23456789 Hospital-name-that-can-be-any-length 003 1/1/13 dollar-amount-that-can-be-any-length
I would like to create a table from the above that would have 6 coloumns that would contain the following:
rate : 1111
ID: 23456789
hospname: Hospital-name-that-can-be-any-length
hospID: 003
Date: 1/1/13
dollar: dollar-amount-that-can-be-any-length
I think that we can use the PRX function but am not too familiar.
Any suggestions would be great!
Hope the example above makes sense.
Thanks
Thanks,
Here is a better example:
All the elements are in one column. The values in each row are in order.
That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.
After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.
Then follows a three digit number
After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)
Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.
Finally there is a string of characters that always starts with "1-RT BE"
This is what the sample looks like:
4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF
4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA
1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF
1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD
1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ
1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF
And this is what I am trying to get to:
Rate | ID | Provider | NAME | ProvID | type | Date | amount | Desc | RT |
4278 | 01234567 | Auto Enroll Provider | FAMILY HEALTHCARE | 008 | 02-02 | 1/1/2018 | 131.23 | 0287-HSP OUT SR | 1-RT BE 009-COSNF |
4273 | 34876539 | Original Provider | BURKE HEALTH CENTER | 003 | 02-02 | 1/1/2018 | 187.01 | 0287-HSP OUT SR | 1-RT BE 010-COSNA |
1468 | 04876298 | Collapsed Provider | PHOENIX RISES OF NEW JESRSEY INC | 005 | 02-02 | 1/1/2018 | 17.67 | 0160-D&TCS | 1-RT BE 009-COSNF |
1540 | 10983746 | Genesis Provider | COLLEGE PARK | 016 | 02-02 | 1/1/2018 | 148.43 | 0160-D&TCS | 1-RT BE 013-INVRATECD |
1671 | 34098734 | Auto Enroll Provider | THIS CNTY DOH | 004 | 02-02 | 1/1/2018 | 138.7 | 0287-HSP OUT SR | 1-RETR TO BE 040-INVREQ |
1671 | 45398726 | Auto Enroll Provider | THAT CLINIC | 004 | 02-02 | 1/1/2018 | 138.7 | 0287-HSP OUT SR | 1-RT BE 009-COSNF |
Suggestions?
As you hav a space delimiter between variables use scan function to assign values to the variables.
You can also read the variables by INPUT statement without the @ to assign starting position, but just
by assigning $ for char type variables, and preassign max length for each char type variable.
Thank you, it was pointed out to me that the example that I gave was not good as I would get a helpful response but not to my example and that was my mistake.
I have edited my original text with a better example.
You will note that I cannot use space delimiter, or @ because the dataset has provider names and descriptions that have space in them. Also the length for some of the elements are not known.
I hope my edited post helps clarify the type of dataset I have.
Thank you for your assistance.
You may need work by try and error.
Check next code and its results, the adapt length of variables if need:
data example;
infile datalines truncover;
input a_line $140;
data lines;
4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF
4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA
1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF
1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD
1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ
1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF
;
run;
data test;
set example;
length var1 provider name $50; /* assumed max any variable length is 50 characters */
length type $5 desc $20 rt $20; /* addapt to max length */
drop var1 len1 ix;
var1 = scan(a_line,1);
len1 = length(strip(var1));
rate = inputn(var1,best., len1);
a_line = substr(a_line, len1+1); /* rest of line */
/*-------------------------------------------------------------*/
var1 = scan(a_line,1);
len1 = length(strip(var1));
id = inputn(var1,best., len1);
a_line = substr(a_line, len1+1); /* rest of line */
/*-------------------------------------------------------------*/
ix = index(a_line, 'Provider');
len1 = ix + 8; /* add 'Provider' length */
Provider = substr(a_line,1,len1);
a_line = substr(a_line, len1+1); /* rest of line */
/*-------------------------------------------------------------*/
provid = input(scan(a_line,1),3.);
type = scan(a_line,2);
date = input(scan(a_line,3),ddmmyy8.);
amount = input(scan(a_line,4),best7.2;
/*-------------------------------------------------------------*/
len1 = 23; /* sum length of provider type date & amount & gaps - assumed fix length */
a_line = substr(a_line, len1+1); /* rest of line */
/*-------------------------------------------------------------*/
desc = scan(a_line,1);
rt = scan(a_line,2);
run;
i am no expert in regular expressions, this code can be definitely enhanced or there could be better code. I got the results required for your sample. as @Reeza explained more datapoint, the better the answer can be. Another important please set length of variable as you want(prxchange defaults the length to 200). I have fixed length for one variable
data have;
input line $ 1-115;
datalines;
4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF
4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA
1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF
1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD
1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ
1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;
Thank you Sooo much!!! this did exactly what I needed!
And this worked on the 99.99% of the data.
The one that it did not work on (oddly) was on the code that created provid. It was supposed to capture the three digit after the name variable (i.e. 008, 003, 005 etc.) but interestingly it captured and displayed three digits from the rate variable? suggestions on how to fix that?
BTW. I checked and double checked and there were no delimiters (space, tab or any other items ) that I could possibly use, but the PRXchange worked!!!!
I knew there were people here who were savvy with SAS that even if given a dataset that at this time and age one should not be getting we had people who could figure it out!!!
You all are awesome!!!
data have;
input line $ 1-115;
datalines;
4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF
4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA
1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF
1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD
1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ
1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;
Only one change was needed to @kiranv_'s code to parse provID:
data have; input line $ 1-115; datalines; 4278 01234567 Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR 1-RT BE 009-COSNF 4273 34876539 Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR 1-RT BE 010-COSNA 1468 04876298 Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS 1-RT BE 009-COSNF 1540 10983746 Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS 1-RT BE 013-INVRATECD 1671 34098734 Auto Enroll Provider THIS CNTY DOH 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 040-INVREQ 1671 45398726 Auto Enroll Provider THAT CLINIC 004 02-02 01/01/2018 138.70 0287-HSP OUT SR 1-RT BE 009-COSNF ; data want (drop=line); set have; length rate $4; rate =scan(line,1,''); ID= scan(line,2,'') ; provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line); name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line); provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1, line); type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line); date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line); amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line); desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line); rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line); run;
Art, CEO, AnalystFinder.com
YES!!!!! Thank you! This Worked!!!
You are all the best!!
Thank you!
Unfortunately you have been provided with a garbage file format. What this essentially means is your going to have to write a parser which process the string out in the various variables. Me I would reject the file out of hand or increase costs associated with processing it. Anyways to get you started, we know that the first four characters will be a number and goto variable 1 so:
data want; set have; rate=input(substr(thestring,1,4),best.); /* this trims off the bit we just took */ thestring=substr(thestring,5); id=input(substr(thestring,1,8),best.); thestring=substr(thestring,9); ... run;
So that does the first two. From there on you just add your rules in one by one to get the various data parts out. Or you point out that it is 2018 and these unstructured mess files are just not acceptable. Just by putting a comma in between makes everyones life so much easier.
Large economy sized hint: Post your example data into a code box and not the main message window. Open a code box with the forum {I} icon. There is a chance that your data is actually fixed column, which is very easy to read when identified, but this forum will reformat pasted data and possibly removed the duplicate blanks likely in your original data.
OR copy the text from a text editor and not any other type program such as Word as they may "helpfully" reformat text without telling you.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.