BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sas_student1
Quartz | Level 8

Hello,

 

I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.

 

What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the search.

 

In this dataset all the elements are in one column. The values in each row are in order.

That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.

After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.

Then follows a three digit number

After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)

Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.

Finally there is a string of characters that always starts with "1-RT BE"

 

This is what the sample looks like:

4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF

4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA

1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  

1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD

1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 

1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF

 

And this is what I am trying to get to:

 

RateIDProviderNAMEProvIDtypeDateamountDescRT
427801234567Auto Enroll ProviderFAMILY HEALTHCARE00802-021/1/2018131.230287-HSP OUT SR1-RT BE 009-COSNF
427334876539Original ProviderBURKE HEALTH CENTER00302-021/1/2018187.010287-HSP OUT SR1-RT BE 010-COSNA
146804876298Collapsed ProviderPHOENIX RISES OF NEW JESRSEY INC00502-021/1/201817.670160-D&TCS1-RT BE 009-COSNF
154010983746Genesis ProviderCOLLEGE PARK01602-021/1/2018148.430160-D&TCS1-RT BE 013-INVRATECD
167134098734Auto Enroll ProviderTHIS CNTY DOH00402-021/1/2018138.70287-HSP OUT SR1-RETR TO BE 040-INVREQ
167145398726Auto Enroll ProviderTHAT CLINIC00402-021/1/2018138.70287-HSP OUT SR1-RT BE 009-COSNF

 

 

Any suggestions would be great!

 

 

Thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Only one change was needed to @kiranv_'s code to parse provID:

data have;
  input line $ 1-115;
  datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want (drop=line);
  set have;
  length rate $4;
  rate =scan(line,1,'');
  ID= scan(line,2,'') ;
  provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
  name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
  provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1,  line);
  type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
  date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
  amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
  desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
  rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

Art, CEO, AnalystFinder.com

 

View solution in original post

11 REPLIES 11
Reeza
Super User

Possibly, but you’re going to need to provide better and more sample,data, ideally enough to illustrate several variations of what you’re likely to encounter. Otherwise you’ll get code for a single case that won’t work on your actual data, but technically answers your question. 

 


@sas_student1 wrote:

Hello,

 

@I was given a dataset with about 6000 rows, unfortunately it was given to me in a text and was not formatted properly. I cannot use the @ or format options to create separate variables/columns for the one long string that was shared.

 

What I need to do is to search for a specific set of characters for each row of the database and create a new variable that populates with the serach.

 

For example, If I have a database with one coloumn that has something like this:

1111 23456789 Hospital-name-that-can-be-any-length 003 1/1/13 dollar-amount-that-can-be-any-length

 

I would like to create a table from the above that would have 6 coloumns that would contain the following:

rate : 1111

ID: 23456789 

hospname: Hospital-name-that-can-be-any-length

hospID: 003

Date: 1/1/13

dollar: dollar-amount-that-can-be-any-length 

 

I think that we can use the PRX function but am not too familiar.

Any suggestions would be great!

Hope the example above makes sense.

 

Thanks

 

 


 

sas_student1
Quartz | Level 8

Thanks,

 

Here is a better example:

All the elements are in one column. The values in each row are in order.

That is, each row will start with 4 numbers a space and then 8 numbers. After that I know that there well be a provider type and there are a certain set of provider types that are listed only, here we see "Auto Enroll Provider", "original Provider", "Collapsed Provider" and "Genesis Provider" there are a few more but not listed in the example below.

After the provider type the name of the provider is listed, I am not sure the length of the name it can be 20 characters, 35 characters or even longer.

Then follows a three digit number

After the number there is always a 02-02 followed by a date that is in a mmddyy10 format (although note that since all this is in one column the date is not formatted)

Following the date will be a four digit number with a "-" and then a small description, not sure the length of the description.

Finally there is a string of characters that always starts with "1-RT BE"

 

This is what the sample looks like:

4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF

4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA

1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  

1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD

1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 

1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF

 

And this is what I am trying to get to:

 

RateIDProviderNAMEProvIDtypeDateamountDescRT
427801234567Auto Enroll ProviderFAMILY HEALTHCARE00802-021/1/2018131.230287-HSP OUT SR1-RT BE 009-COSNF
427334876539Original ProviderBURKE HEALTH CENTER00302-021/1/2018187.010287-HSP OUT SR1-RT BE 010-COSNA
146804876298Collapsed ProviderPHOENIX RISES OF NEW JESRSEY INC00502-021/1/201817.670160-D&TCS1-RT BE 009-COSNF
154010983746Genesis ProviderCOLLEGE PARK01602-021/1/2018148.430160-D&TCS1-RT BE 013-INVRATECD
167134098734Auto Enroll ProviderTHIS CNTY DOH00402-021/1/2018138.70287-HSP OUT SR1-RETR TO BE 040-INVREQ
167145398726Auto Enroll ProviderTHAT CLINIC00402-021/1/2018138.70287-HSP OUT SR1-RT BE 009-COSNF

 

Suggestions?

Shmuel
Garnet | Level 18

As you hav a space delimiter between variables use scan function to assign values to the variables.

 

You can also read the variables by INPUT statement without the @ to assign starting position, but just 

by assigning $ for char type variables, and preassign max length for each char type variable.

sas_student1
Quartz | Level 8

Thank you, it was pointed out to me that the example that I gave was not good as I would get a helpful response but not to my example and that was my mistake.

I have edited my original text with a better example.

You will note that I cannot use space delimiter, or @ because the dataset has provider names and descriptions that have space in them. Also the length for some of the elements are not known.

 

I hope my edited post helps clarify the type of dataset I have.

 

Thank you for your assistance.

Shmuel
Garnet | Level 18

You may need work by try and error.

Check next code and its results, the adapt length of variables if need:

 

data example;
  infile datalines truncover;
  input a_line $140;
data lines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF

4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA

1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  

1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD

1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 

1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
run;

data test;
 set example;
     length var1 provider name $50; /* assumed max any variable length is 50 characters */
	 length type $5 desc $20 rt $20;  /* addapt to max length */
	 drop var1 len1 ix;
	 
     var1 = scan(a_line,1);
 	 len1 = length(strip(var1));
     rate = inputn(var1,best., len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
	 var1 = scan(a_line,1);
 	 len1 = length(strip(var1));
     id   = inputn(var1,best., len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     ix = index(a_line, 'Provider');
	 len1 = ix + 8;  /* add 'Provider' length */
	 Provider = substr(a_line,1,len1);
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     provid =  input(scan(a_line,1),3.);
	 type = scan(a_line,2);
	 date = input(scan(a_line,3),ddmmyy8.);
	 amount = input(scan(a_line,4),best7.2;
  /*-------------------------------------------------------------*/	 
     len1 = 23; /* sum length of provider type date & amount & gaps - assumed fix length */
	 a_line = substr(a_line, len1+1);  /* rest of line */
  /*-------------------------------------------------------------*/	 
     desc = scan(a_line,1);
	 rt = scan(a_line,2);
run;	  
kiranv_
Rhodochrosite | Level 12

i am no expert in regular expressions,  this code can be definitely enhanced or there could be better code. I got the results required for your sample. as @Reeza explained more datapoint, the better the answer can be. Another important please set length of variable as you want(prxchange defaults the length to 200). I have fixed length for one variable 

data have;
input line $ 1-115;
datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

 

sas_student1
Quartz | Level 8

Thank you Sooo much!!! this did exactly what I needed!

And this worked on the 99.99% of the data.

The one that it did not work on (oddly) was on the code that created provid. It was supposed to capture the three digit after the name variable (i.e. 008, 003, 005 etc.) but interestingly it captured and displayed three digits from the rate variable? suggestions on how to fix that?

 

BTW. I checked and double checked and there were no delimiters (space, tab or any other items ) that I could possibly use, but the PRXchange worked!!!!

 

I knew there were people here who were savvy with SAS that even if given a dataset that at this time and age one should not be getting we had people who could figure it out!!!

You all are awesome!!!

 

data have;
input line $ 1-115;
datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want;
set have;
length rate $4;
rate =scan(line,1,'');
ID= scan(line,2,'') ;
provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
*provider = prxchange('s/(.+?)(.+)/$2/', -1, line);
name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
provid = prxchange('s/(.+?)([0-9]{3})(.+)/$2/', -1, line);
type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

 

 

art297
Opal | Level 21

Only one change was needed to @kiranv_'s code to parse provID:

data have;
  input line $ 1-115;
  datalines;
4278 01234567  Auto Enroll Provider FAMILY HEACARE 008 02-02 01/01/2018 131.23 0287-HSP OUT SR  1-RT BE 009-COSNF
4273 34876539  Original Provider BURKE CENTER 003 02-02 01/01/2018 187.01 0287-HSP OUT SR   1-RT BE 010-COSNA
1468 04876298  Collapsed Provider PHOENIX RISES OF NJ INC 005 02-02 01/01/2018 17.67 0160-D&TCS  1-RT BE 009-COSNF  
1540 10983746  Genesis Provider COLLEGE PARK 016 02-02 01/01/2018 148.43 0160-D&TCS  1-RT BE 013-INVRATECD
1671 34098734  Auto Enroll Provider THIS CNTY DOH  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 040-INVREQ 
1671 45398726  Auto Enroll Provider THAT CLINIC  004 02-02 01/01/2018 138.70 0287-HSP OUT SR  1-RT BE 009-COSNF
;
data want (drop=line);
  set have;
  length rate $4;
  rate =scan(line,1,'');
  ID= scan(line,2,'') ;
  provider = prxchange('s/(.+?)([A-Za-z\s]+?Provider)(.+)/$2/', -1, line);
  name= prxchange('s/(.+?[A-Za-z\s]+?Provider)([A-Z\s]+)(.+)/$2/', -1, line);
  provid = prxchange('s/(.+?)(\s)([0-9]{3})(\s)(.+)/$3/',-1,  line);
  type = prxchange('s/(.+?)(\d{2}\-\d{2})(.+)/$2/', -1, line);
  date = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4})(.+)/$2/', -1, line);
  amount = prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+)(.+)/$3/', -1, line);
  desc=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$4/', -1, line);
  rt=prxchange('s/(.+?)(\d{2}\/\d{2}\/\d{4}\s+)([0-9\.]+\s+)(.+?)(1\-.+)/$5/', -1, line);
run;

Art, CEO, AnalystFinder.com

 

sas_student1
Quartz | Level 8

YES!!!!! Thank you! This Worked!!!

 

You are all the best!!

 

Thank you!

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Unfortunately you have been provided with a garbage file format.  What this essentially means is your going to have to write a parser which process the string out in the various variables.  Me I would reject the file out of hand or increase costs associated with processing it.  Anyways to get you started, we know that the first four characters will be a number and goto variable 1 so:

data want;
  set have;
  rate=input(substr(thestring,1,4),best.);
/* this trims off the bit we just took */
  thestring=substr(thestring,5);
  id=input(substr(thestring,1,8),best.);
  thestring=substr(thestring,9);
...
run;

So that does the first two.  From there on you just add your rules in one by one to get the various data parts out.  Or you point out that it is 2018 and these unstructured mess files are just not acceptable.  Just by putting a comma in between makes everyones life so much easier.

ballardw
Super User

Large economy sized hint: Post your example data into a code box and not the main message window. Open a code box with the forum {I} icon. There is a chance that your data is actually fixed column, which is very easy to read when identified, but this forum will reformat pasted data and possibly removed the duplicate blanks likely in your original data.

OR copy the text from a text editor and not any other type program such as Word as they may "helpfully" reformat text without telling you.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1946 views
  • 1 like
  • 7 in conversation