Text mining and content categorization

Splitting a string from unstructured data into columns

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 12
Accepted Solution

Splitting a string from unstructured data into columns

Hi all, 

 

I am new to SAS (7.1 Enterprise Guide, 64 bit) and have some unstructed data that was originally stored in a database and looks like this

Presentation1.pptx

 

When I use enterpris guide and bring the data in from the library it shows up like this:

 

pic 2.png

 

When i expand the column it turns into this

 

pic3.png

 

My desired out put is

pic4.png

 

I wont go into what ive tried since i am a beginner and it was really basic code.  I have attaced a excel file with example data so everyone can work it on there machine if desired.  This is my first post so if Violated etiquette please let me know.  Always trying to get better.

 

Thanks


Accepted Solutions
Solution
‎01-05-2017 07:18 PM
PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

If it turns out that you only want/need the results as you showed them in your example, adding a code node with the following should suffice (assuming, that is, your file is called work.have and you want to end up with a file called work.want):

 

data want;
set have;
array var(39) $80.;
_i=1;
do while (scan(comments,_i,'0A'x) ne '');
var(_i)=scan(comments,_i,'0A'x);
_i+1;
end;
run;

 

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

Your comment column contains all of the desired data fields, separated by Line Feeds ('0A'x or '0D'x depending upon your computer's operating system).

 

Instead of var1,var2, etc. I'd think you'd rather end up with variables like date, start_time, etc.

 

Which do you prefer?

 

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

Yes you are correct that would be the ideal situation.  I was hoping to get any help and didnt want to ask for too much

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

What also may be very much appreciated  is to have the output look likepic5.png

Solution
‎01-05-2017 07:18 PM
PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

If it turns out that you only want/need the results as you showed them in your example, adding a code node with the following should suffice (assuming, that is, your file is called work.have and you want to end up with a file called work.want):

 

data want;
set have;
array var(39) $80.;
_i=1;
do while (scan(comments,_i,'0A'x) ne '');
var(_i)=scan(comments,_i,'0A'x);
_i+1;
end;
run;

 

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

I used the code below and got one error - Array subscript out of range at line 28 column1

 

data work.comments;
set work.testdata;
array var(39) $80.;
_i=1;
do while (scan(comments,_i,'0A'x) ne '');
var(_i)=scan(comments,_i,'0A'x);
_i+1;
end;
run;


Log

                                                      The SAS System                            10:13 Thursday, January 5, 2017

1          ;*';*";*/;quit;run;
2          OPTIONS PAGENO=MIN;
3          %LET _CLIENTTASKLABEL='Program (6)';
4          %LET _CLIENTPROJECTPATH='O:\permanent\PTIM\DataScience\Advanced Analytics Team\Lee
4        ! Projects\TextStringExtractionTest1.egp';
5          %LET _CLIENTPROJECTNAME='TextStringExtractionTest1.egp';
6          %LET _SASPROGRAMFILE=;
7          
8          ODS _ALL_ CLOSE;
9          OPTIONS DEV=ACTIVEX;
10         GOPTIONS XPIXELS=0 YPIXELS=0;
11         FILENAME EGSR TEMP;
12         ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
13             STYLE=HtmlBlue
14             STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
15             NOGTITLE
16             NOGFOOTNOTE
17             GPATH=&sasworklocation
18             ENCODING=UTF8
19             options(rolap="on")
20         ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
21         
22         GOPTIONS ACCESSIBLE;
23         data work.comments;
24         set work.testdata;
25         array var(39)
NOTE: The array var has the same name as a SAS-supplied or user-defined function.  Parentheses following this name are treated as 
      array references and not function references.
25       !               $80.;
26         _i=1;
27         do while (scan(comments,_i,'0A'x) ne '');
28         var(_i)=scan(comments,_i,'0A'x);
29         _i+1;
30         end;
31         run;

ERROR: Array subscript out of range at line 28 column 1.
API_IPO_UWI=4201334737 MOC_UWI=420133473700 SUB_ASSET=PLEASANTON ASSET=EAGLE FORD WELL_NAME=74 RANCH SPECIAL UNIT EAST 2H
MOC_WELL_NAME=74 RANCH SPECIAL UNIT EAST 2H
COMMENTS=Stage Number		1Date			13-Sep-13	 	 	 	 	 	 Start Time		10:43	 	 	 End Time			13:28	 	 	 Pump Time		2:45	hr:min	 	
 Initial Wellhead Pressure	4,452	psi	 	 Max Treating Pressure	9,493	psi	 	 Avg Treating Pressure	8,686	psi	 	 Max Slurry Rat
e		54.8	bpm	 	 Avg Slurry Rate		50.1	bpm	 	 Avg HHP			10,662	hp	 	 15% HCl Acid		6,010	gal	143	bblFR-66 Water		48,790	gal	1,
162	bbl25# Waterfrac (19 cP)	11,866	gal	283	bbl25# Hybor G (19 cP)	350,691	gal	8,350	bblClean Volume		417,357	gal	9,937	bblS
lurry Volume		424,979	gal	10,119	bblPulsed Slurry Volume	21,215	gal	505	bblMax Proppant Conc.	3.07	lb/gal	 	 BH Max Proppant C
onc.	3.07	lb/gal	 	 30/50 Premium White	87,620	lb	876.20	sack20/40 Premium White	80,640	lb	806.40	sackSandwedge ABC Plus Mass 
Coated	87,620	lb	876.20	sackSandwedge ABC Max Mass Coated	80,640	lb	806.40	sackProppant Mass Pumped	168,260	lb	1,682.60	sackPr
oppant Mass Counted	167,185	lb	1,671.85	sackBreakdown Pressure	N/A	psi	 	 Rate @ Breakdown	N/A	bpm	 	 Ball on Seat	N/A	bbls	N/
A	galsAcid Pressure Drop		1,018	psi	 	 STP Before Acid		9301	psi	 	 STP After Acid		8283	psi	 	 Final ISIP			6,228	psi	 	 
Final FG			0.994	psi/ft	 	 5 Min. Shut-In		5,030	psi	240	psi/minPad Volume		124,099	gal	2,955	bblXL Pad Volume		112,233	gal	2,
672	bblClean Volume for PLF	259,992	gal	6,190	bblProppant Pumped	99	%	 	  STAGE_FRAC_TYPE=  INTERVAL_NUM=1
var1=Stage Number		1 var2=Date			13-Sep-13	 	 	  var3=	 	 	  var4=Start Time		10:43	 	 	  var5=End Time			13:28	 	 	 
var6=Pump Time		2:45	hr:min	 	  var7=Initial Wellhead Pressure	4,452	psi	 	  var8=Max Treating Pressure	9,493	psi	 	 
var9=Avg Treating Pressure	8,686	psi	 	  var10=Max Slurry Rate		54.8	bpm	 	  var11=Avg Slurry Rate		50.1	bpm	 	 
var12=Avg HHP			10,662	hp	 	  var13=15% HCl Acid		6,010	gal	143	bbl var14=FR-66 Water		48,790	gal	1,162	bbl
var15=25# Waterfrac (19 cP)	11,866	gal	283	bbl var16=25# Hybor G (19 cP)	350,691	gal	8,350	bbl
var17=Clean Volume		417,357	gal	9,937	bbl var18=Slurry Volume		424,979	gal	10,119	bbl
var19=Pulsed Slurry Volume	21,215	gal	505	bbl var20=Max Proppant Conc.	3.07	lb/gal	 	 
2                                                          The SAS System                            10:13 Thursday, January 5, 2017

var21=BH Max Proppant Conc.	3.07	lb/gal	 	  var22=30/50 Premium White	87,620	lb	876.20	sack
var23=20/40 Premium White	80,640	lb	806.40	sack var24=Sandwedge ABC Plus Mass Coated	87,620	lb	876.20	sack
var25=Sandwedge ABC Max Mass Coated	80,640	lb	806.40	sack var26=Proppant Mass Pumped	168,260	lb	1,682.60	sack
var27=Proppant Mass Counted	167,185	lb	1,671.85	sack var28=Breakdown Pressure	N/A	psi	 	  var29=Rate @ Breakdown	N/A	bpm	 	 
var30=Ball on Seat	N/A	bbls	N/A	gals var31=Acid Pressure Drop		1,018	psi	 	  var32=STP Before Acid		9301	psi	 	 
var33=STP After Acid		8283	psi	 	  var34=Final ISIP			6,228	psi	 	  var35=Final FG			0.994	psi/ft	 	 
var36=5 Min. Shut-In		5,030	psi	240	psi/min var37=Pad Volume		124,099	gal	2,955	bbl var38=XL Pad Volume		112,233	gal	2,672	bbl
var39=Clean Volume for PLF	259,992	gal	6,190	bbl _i=40 _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 1 observations read from the data set WORK.TESTDATA.
WARNING: The data set WORK.COMMENTS may be incomplete.  When this step was stopped there were 0 observations and 49 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      user cpu time       0.01 seconds
      system cpu time     0.01 seconds
      memory              1084.21k
      OS Memory           20904.00k
      Timestamp           01/05/2017 05:45:15 PM
      Step Count                        46  Switch Count  126
      Page Faults                       0
      Page Reclaims                     445
      Page Swaps                        0
      Voluntary Context Switches        508
      Involuntary Context Switches      2
      Block Input Operations            0
      Block Output Operations           280
      

32         
33         GOPTIONS NOACCESSIBLE;
34         %LET _CLIENTTASKLABEL=;
35         %LET _CLIENTPROJECTPATH=;
36         %LET _CLIENTPROJECTNAME=;
37         %LET _SASPROGRAMFILE=;
38         
39         ;*';*";*/;quit;run;
40         ODS _ALL_ CLOSE;
41         
42         
43         QUIT; RUN;
44         

 

PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

Then your actual workbook contains more than 39 fields in the comment section. Increase the statement '

array var(39) $80.; 

 to something like:

 

array var(45) $80.;

 

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

Art 297 that did the trick,  This is an awesome forum, thanks so much, i worked on this for over 8 hours today,  I cant believe This worked out so well.  Thanks again.  The quick responses are very much appreciated.

PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

[ Edited ]

I don't have time to write the code for you but, if you want your data in the other format we discussed, here is a start to the code:

 

data want;
  set have;
  length _var $80;
  _i=1;
  do while (scan(comments,_i,'0A'x) ne '');
    _var=strip(translate(strip(scan(comments,_i,'0A'x)),'','09'x));
    if substr(_var,1,12) eq 'Stage Number' then Stage_Number=substr(_var,13);
    else if substr(_var,1,4) eq 'Date' then Date=substr(_var,5);
    else if substr(_var,1,10) eq 'Start Time' then Start_Time=substr(_var,11);
    else if substr(_var,1,8) eq 'End Time' then End_Time=substr(_var,9);
    else if substr(_var,1,9) eq 'Pump Time' then Pump_Time=substr(_var,10);
    else if substr(_var,1,25) eq 'Initial Wellhead Pressure' then Initial_Wellhead_Pressure=substr(_var,26);
    else if substr(_var,1,21) eq 'Max Treating Pressure' then Max_Treating_Pressure=substr(_var,22);
    else if substr(_var,1,21) eq 'Avg Treating Pressure' then Avg_Treating_Pressure=substr(_var,22);
    else if substr(_var,1,15) eq 'Max Slurry Rate' then Max_Slurry_Rate=substr(_var,16);
    else if substr(_var,1,15) eq 'Avg Slurry Rate' then Avg_Slurry_Rate=substr(_var,16);
    else if substr(_var,1,7) eq 'Avg HHP' then Avg_HHP=substr(_var,8);
    else if substr(_var,1,12) eq '15% HCl Acid' then Fifteen_Percent_HCl_Acid=substr(_var,13);
    else if substr(_var,1,11) eq 'FR-66 Water' then FR_66_Water=substr(_var,12);
    else if substr(_var,1,21) eq '25# Waterfrac (19 cP)' then _25_Pound_Waterfrac_19_cP=substr(_var,22);
    _i+1;
  end;
run;

 

HTH,

Art, CEO, AnalystFinder.com

 

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

Art I continued on with your code and ran into an issue i could not resolve.  

else if substr(_var,1,12) eq '15% HCl Acid' then Acid_15_Perc=substr(_var,13);
else if substr(_var,1,7) eq '15% HCl' then Acid_15_Perc=substr(_var,8);
else if substr(_var,1,11) eq 'Acid (bbls)' then Acid_15_Perc=substr(_var,13);
else if substr(_var,1,10) eq 'Acid(bbls)' then Acid_15_Perc=substr(_var,11);
else if substr(_var,1,5) eq 'Acid:' then Acid_15_Perc=substr(_var,6);
else if substr(_var,1,4) eq 'Acid' then Acid_15_Perc=substr(_var,5);

The last line takes over for the two prevoius lines because they contain the word Acid in them.  Thus it cuts the data strings off at 5 for all 3.  What would you suggest.  Thanks

PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

The statements are encountered from top to bottom and, within each record, stop as soon as a condition is met. Thus the lower conditions don't override conditions that preceded them.

 

My guess is that your spelling/typing is off on some of the entries.  e.g., are the characters in the text "HCl" or "HC1" or "HCL"?

 

Art, CEO, AnalystFinder.com

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

Art it looks like it is grabbing the pressure drop line and overwriting the value desired in the 15% HCL Acid.

The reason i added Acid is because on some of the other entries the 15% HCL Acid is just listed as Acid

 

 

Pic 1.png

PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

Post all of your code and your data file (or at least a subset of it .. including at least two sets of records).

 

Occasional Contributor
Posts: 12

Re: Splitting a string from unstructured data into columns

THis is the code i am using.  I just continued with what you posted earlier

 

 

data work.comments;
set WORK.TestData;

length _var $80;
_i=1;
do while (scan(comments,_i,'0A'x) ne '');
_var=strip(translate(strip(scan(comments,_i,'0A'x)),'','09'x));
if substr(_var,1,12) eq 'Stage Number' then Stage_Number=substr(_var,13);
else if substr(_var,1,10) eq 'Stage Name' then Stage_Number=substr(_var,12);
else if substr(_var,1,5) eq 'Stage' then Stage_Number=substr(_var,6);
else if substr(_var,1,19) eq 'Adams-Tipton 2H Stg' then Stage_Number=substr(_var,20);
else if substr(_var,1,19) eq 'Adams Tipton 3H Stg' then Stage_Number=substr(_var,20);

else if substr(_var,1,4) eq 'Date' then Date=substr(_var,5);

else if substr(_var,1,10) eq 'Start Time' then Start_Time=substr(_var,11);

else if substr(_var,1,8) eq 'End Time' then End_Time=substr(_var,9);

else if substr(_var,1,14) eq 'Open WHP (psi)' then Initial_Wellhead_Pressure=substr(_var,16);
else if substr(_var,1,8) eq 'Open WHP' then Initial_Wellhead_Pressure=substr(_var,10);
else if substr(_var,1,17) eq 'Wellhead Pressure' then Initial_Wellhead_Pressure=substr(_var,19);

else if substr(_var,1,9) eq 'Pump Time' then Pump_Time=substr(_var,10);

else if substr(_var,1,25) eq 'Initial Wellhead Pressure' then Initial_Wellhead_Pressure=substr(_var,26);

else if substr(_var,1,21) eq 'Max Treating Pressure' then Max_Treating_Pressure=substr(_var,22);
else if substr(_var,1,18) eq 'Max Pressure (psi)' then Max_Treating_Pressure=substr(_var,20);
else if substr(_var,1,12) eq 'Max Pressure' then Max_Treating_Pressure=substr(_var,14);
else if substr(_var,1,7) eq 'Max PSI' then Max_Treating_Pressure=substr(_var,9);

else if substr(_var,1,21) eq 'Avg Treating Pressure' then Avg_Treating_Pressure=substr(_var,22);
else if substr(_var,1,18) eq 'Avg Pressure (psi)' then Avg_Treating_Pressure=substr(_var,20);
else if substr(_var,1,12) eq 'Avg Pressure' then Avg_Treating_Pressure=substr(_var,14);
else if substr(_var,1,7) eq 'Avg PSI' then Avg_Treating_Pressure=substr(_var,9);

else if substr(_var,1,15) eq 'Max Slurry Rate' then Max_Slurry_Rate=substr(_var,16);
else if substr(_var,1,14) eq 'Max Rate (bpm)' then Max_Slurry_Rate=substr(_var,16);
else if substr(_var,1,8) eq 'Max Rate' then Max_Slurry_Rate=substr(_var,10);

else if substr(_var,1,15) eq 'Avg Slurry Rate' then Avg_Slurry_Rate=substr(_var,16);
else if substr(_var,1,14) eq 'Avg Rate (bpm)' then Avg_Slurry_Rate=substr(_var,16);
else if substr(_var,1,8) eq 'Avg Rate' then Avg_Slurry_Rate=substr(_var,10);

else if substr(_var,1,7) eq 'Avg HHP' then Avg_HHP=substr(_var,8);

else if substr(_var,1,12) eq '15% HCl Acid' then Acid_15_Perc=substr(_var,13);
else if substr(_var,1,7) eq '15% HCl' then Acid_15_Perc=substr(_var,8);
else if substr(_var,1,11) eq 'Acid (bbls)' then Acid_15_Perc=substr(_var,13);
else if substr(_var,1,10) eq 'Acid(bbls)' then Acid_15_Perc=substr(_var,11);
else if substr(_var,1,5) eq 'Acid:' then Acid_15_Perc=substr(_var,6);
*else if substr(_var,1,4) eq 'Acid' then Acid_15_Perc=substr(_var,5);

else if substr(_var,1,11) eq 'FR-66 Water' then FR_66_Water=substr(_var,12);

else if substr(_var,1,23) eq '20# Waterfrac G (14 cP)' then _20_Pound_Waterfrac_G_14_cP=substr(_var,24);

else if substr(_var,1,21) eq '25# Waterfrac (19 cP)' then _25_Pound_Waterfrac_19_cP=substr(_var,22);
else if substr(_var,1,15) eq '25# WaterFrac G' then _25_Pound_Waterfrac_19_cP=substr(_var,16);
*else if substr(_var,1,15) eq '25# Waterfrac G' then _25_Pound_Waterfrac_19_cP=substr(_var,16);
else if substr(_var,1,23) eq '25# Waterfrac G (19 cP)' then _25_Pound_Waterfrac_19_cP=substr(_var,24);


else if substr(_var,1,19) eq '25# Hybor G (19 cP)' then _25_Pound_Hybor_G=substr(_var,20);
else if substr(_var,1,11) eq '25# Hybor G' then _25_Pound_Hybor_G=substr(_var,12);

*else if substr(_var,1,12) eq 'Clean Volume' then Clean_Volume=substr(_var,13);

else if substr(_var,1,13) eq 'Slurry Volume' then Slurry_Volume=substr(_var,14);
else if substr(_var,1,10) eq 'Slurry Vol' then Slurry_Volume=substr(_var,11);

else if substr(_var,1,18) eq 'Max Proppant Conc.' then Max_Proppant_Conc=substr(_var,19);
else if substr(_var,1,26) eq 'Max Proppant Concentration' then Max_Proppant_Conc=substr(_var,28);
else if substr(_var,1,13) eq 'Max Prop Conc' then Max_Proppant_Conc=substr(_var,15);

else if substr(_var,1,21) eq 'BH Max Proppant Conc.' then BH_Max_Proppant_Conc=substr(_var,22);

else if substr(_var,1,19) eq '30/50 Premium White' then _30_50_Premium_White=substr(_var,20);
else if substr(_var,1,11) eq '30/50 (lbs)' then _30_50_Premium_White=substr(_var,13);
else if substr(_var,1,10) eq '30/50(lbs)' then _30_50_Premium_White=substr(_var,11);
*else if substr(_var,1,6) eq '30/50 ' then _30_50_Premium_White=substr(_var,7);
else if substr(_var,1,13) eq '30/50 Premium' then _30_50_Premium_White=substr(_var,14);
else if substr(_var,1,10) eq '30/50 Sand' then _30_50_Premium_White=substr(_var,11);
else if substr(_var,1,12) eq '30/50  White' then _30_50_Premium_White=substr(_var,13);
else if substr(_var,1,11) eq '30/50 White' then _30_50_Premium_White=substr(_var,12);
else if substr(_var,1,11) eq '30/50 white' then _30_50_Premium_White=substr(_var,12);

else if substr(_var,1,10) eq '30/50  CRC' then _30_50_CRC=substr(_var,11);
else if substr(_var,1,16) eq '30/50  CRC Resin' then _30_50_CRC=substr(_var,17);
else if substr(_var,1,9) eq '30/50 CRC' then _30_50_CRC=substr(_var,10,18);

else if substr(_var,1,5) eq '30/70' then _30_70_White=substr(_var,6);
else if substr(_var,1,11) eq '30/70 White' then _30_70_White=substr(_var,12);

else if substr(_var,1,5) eq '30/35 CRC CRC Resin' then _30_35_CRC_CRC_Resin=substr(_var,6);

else if substr(_var,1,17) eq '20/40 White (lbs)' then _20_40_Premium_White=substr(_var,19);                  
else if substr(_var,1,12) eq '20/40  White' then _20_40_Premium_White=substr(_var,13);
else if substr(_var,1,19) eq '20/40 Premium White' then _20_40_Premium_White=substr(_var,20);
else if substr(_var,1,11) eq '20/40 White' then _20_40_Premium_White=substr(_var,12);
else if substr(_var,1,11) eq '20/40 white' then _20_40_Premium_White=substr(_var,12);
else if substr(_var,1,19) eq '20/40 Premium White' then _20_40_Premium_White=substr(_var,20);
else if substr(_var,1,11) eq '20/40(lbs)' then _20_40_Premium_White=substr(_var,12);
else if substr(_var,1,7) eq '20/40  ' then _20_40_Premium_White=substr(_var,8); 
else if substr(_var,1,10) eq '20/40 Sand' then _20_40_Premium_White=substr(_var,12); 

else if substr(_var,1,10) eq '20/40  CRC' then _20_40_CRC=substr(_var,11);
else if substr(_var,1,15) eq '20/40 CRC (lbs)' then _20_40_CRC=substr(_var,17);
else if substr(_var,1,16) eq '20/40  CRC Resin' then _20_40_CRC=substr(_var,17);
else if substr(_var,1,15) eq '20/40 CRC Resin' then _20_40_CRC=substr(_var,16);
else if substr(_var,1,9) eq '20/40 SLC' then _20_40_CRC=substr(_var,10);

else if substr(_var,1,3) eq 'CRC' then _20_40_CRC=substr(_var,5);


else if substr(_var,1,8) eq '100 Mesh' then _100_MESH=substr(_var,9);
else if substr(_var,1,8) eq '100 mesh' then _100_MESH=substr(_var,9);



else if substr(_var,1,30) eq 'Sandwedge ABC Plus Mass Coated' then Sandwedge_ABC_Plus_Mass_Coated=substr(_var,31);
else if substr(_var,1,29) eq 'Sandwedge ABC Max Mass Coated' then Sandwedge_ABC_Max_Mass_Coated=substr(_var,30);

else if substr(_var,1,20) eq 'Proppant Mass Pumped' then Proppant_Mass_Pumped=substr(_var,21);
else if substr(_var,1,20) eq 'Total Proppant (lbs)' then Proppant_Mass_Pumped=substr(_var,22);
else if substr(_var,1,19) eq 'Total Proppant(lbs)' then Proppant_Mass_Pumped=substr(_var,20);
else if substr(_var,1,10) eq 'Total Prop' then Proppant_Mass_Pumped=substr(_var,12);

else if substr(_var,1,21) eq 'Proppant Mass Counted' then Proppant_Mass_Counted=substr(_var,22);


else if substr(_var,1,18) eq 'Breakdown Pressure' then Breakdown_Pressure=substr(_var,19);
else if substr(_var,1,14) eq 'Breakdown(psi)' then Breakdown_Pressure=substr(_var,15);

else if substr(_var,1,16) eq 'Rate @ Breakdown' then Rate_At_Breakdown=substr(_var,17);
else if substr(_var,1,12) eq 'Ball on Seat' then Ball_on_Seat=substr(_var,13);

else if substr(_var,1,18) eq 'Acid Pressure Drop' then Acid_Pressure_Drop=substr(_var,19);

else if substr(_var,1,15) eq 'STP Before Acid' then STP_Before_Acid=substr(_var,16);

else if substr(_var,1,14) eq 'STP After Acid' then STP_After_Acid=substr(_var,15);

else if substr(_var,1,10) eq 'Final ISIP' then ISIP=substr(_var,11);
else if substr(_var,1,11) eq 'Finial ISIP' then ISIP=substr(_var,13);
else if substr(_var,1,10) eq 'ISIP (psi)' then ISIP=substr(_var,12);
else if substr(_var,1,5) eq 'ISIP:' then ISIP=substr(_var,6);
else if substr(_var,1,4) eq 'ISIP' then ISIP=substr(_var,5);



else if substr(_var,1,8) eq 'Final FG' then Final_FG=substr(_var,9);
else if substr(_var,1,13) eq 'Frac Gradient' then Final_FG=substr(_var,15);
else if substr(_var,1,9) eq 'Frac Grad' then Final_FG=substr(_var,10);

else if substr(_var,1,14) eq '5 Min. Shut-In' then _5_Min_Shut_In=substr(_var,15);
else if substr(_var,1,13) eq '5 min shut in' then _5_Min_Shut_In=substr(_var,15);
else if substr(_var,1,16) eq '5 Minute Shut-In' then _5_Min_Shut_In=substr(_var,17);
else if substr(_var,1,16) eq '5 Minute Shut In' then _5_Min_Shut_In=substr(_var,17);
else if substr(_var,1,6) eq '5 Min.' then _5_Min_Shut_In=substr(_var,8);
else if substr(_var,1,5) eq '5 min' then _5_Min_Shut_In=substr(_var,7);
else if substr(_var,1,13) eq '5 Min Shut-In' then _5_Min_Shut_In=substr(_var,14);


else if substr(_var,1,10) eq 'Pad Volume' then Pad_Volume=substr(_var,11);
else if substr(_var,1,10) eq 'Pad (bbls)' then Pad_Volume=substr(_var,12);
else if substr(_var,1,9) eq 'Pad(bbls)' then Pad_Volume=substr(_var,11);
else if substr(_var,1,4) eq 'Pad:' then Pad_Volume=substr(_var,5);
else if substr(_var,1,3) eq 'PAD' then Pad_Volume=substr(_var,4);

else if substr(_var,1,13) eq 'XL Pad Volume' then XL_Pad_Volume=substr(_var,14);

else if substr(_var,1,20) eq 'Clean Volume for PLF' then Clean_Volume_for_PLF=substr(_var,21);
else if substr(_var,1,18) eq 'Total Clean Volume' then Clean_Volume_for_PLF=substr(_var,20);
else if substr(_var,1,35) eq 'Total Clean Fluid to Recover (bbls)' then Clean_Volume_for_PLF=substr(_var,37);
else if substr(_var,1,22) eq 'Total Clean to Recover' then Clean_Volume_for_PLF=substr(_var,24);
else if substr(_var,1,11) eq 'Clean Fluid' then Clean_Volume_for_PLF=substr(_var,12);


else if substr(_var,1,15) eq 'Proppant Pumped' then Proppant_Pumped=substr(_var,16);
else if substr(_var,1,17) eq 'Proppant % Pumped' then Proppant_Pumped=substr(_var,18);
else if substr(_var,1,15) eq 'Proppant Placed' then Proppant_Pumped=substr(_var,16);
else if substr(_var,1,15) eq 'based on design' then Proppant_Pumped=substr(_var,17);
else if substr(_var,1,6) eq '% sand' then Proppant_Pumped=substr(_var,7);


_i+1;
end;
run;

 

I then use a second program to remove alpha characters and combine and convert into numeric

 

 

 

 

data work.StimInt_Comments;
set work.comments;
Wellhead_Pressure = input (compress(Initial_Wellhead_Pressure, ',psi(-) '), 4.);
drop Initial_Wellhead_Pressure;
rename Wellhead_Pressure=Initial_Wellhead_Pressure;

 Max_TreatingPressure= input (compress(Max_Treating_Pressure, ',psi(-) '), 4.);
drop Max_Treating_Pressure;
rename Max_TreatingPressure=Max_Treating_Pressure;

Avg_TreatingPressure= input (compress(Avg_Treating_Pressure, ',psi(-) '), 4.);
drop Avg_Treating_Pressure;
rename Avg_TreatingPressure=Avg_Treating_Pressure;

If length(compress(Max_Slurry_Rate,', bpm psi hp ()+- ')) le 4 then
Max_SlurryRate_bpm= input (compress(Max_Slurry_Rate,', bpm psi hp ()+- '), 2.1); else
Max_SlurryRate_bpm= input (compress(Max_Slurry_Rate,', bpm psi hp ()+- '), 4.1);
drop Max_Slurry_Rate;
*Max_SlurryRate_bpm_length=length (compress(Max_Slurry_Rate,', bpm psi hp ()+- '));


If length(compress(Avg_Slurry_Rate,', bpm psi hp ()+- ')) le 3 then
Avg_SlurryRate_bpm= input (compress(Avg_Slurry_Rate,', bpm psi hp ()+- '), 2.);else
Avg_SlurryRate_bpm= input (compress(Avg_Slurry_Rate,', bpm psi hp ()+- '), 4.);
*drop Avg_Slurry_Rate;
Avg_Slurry_Rate_length=length (compress(Avg_Slurry_Rate,', bpm psi hp ()+- '));

AvgHHP= input(compress(Avg_HHP,',+-', 'a'), 8.);
drop Avg_HHP;
rename AvgHHP=Avg_HHP;

Max_Proppant_Conc_lb_gal= input(compress(Max_Proppant_Conc,'/,+-', 'a'), 10.2);
drop Max_Proppant_Conc;

If length(_30_50_Premium_White) le 8 then
_30_50_PremiumWhite_lbs= input(compress(Left(_30_50_Premium_White),',+-', 'a'), 4.); else
_30_50_PremiumWhite_lbs= input(compress(Left(_30_50_Premium_White),',+-', 'a'), 6.);
drop _30_50_Premium_White;

_20_40_Premium_White_lb= input(compress(Left(_20_40_Premium_White),',+-', 'a'), 6.);
drop _20_40_Premium_White;
*rename _20_40_PremiumWhite=_20_40_Premium_White;

Breakdown_Pressure_psi= input(compress(Left(Breakdown_Pressure),',+-', 'a'), 5.);
drop Breakdown_Pressure;

If length(compress(Left(ISIP),',+-', 'a')) le 5 then
ISIP_psi= input(compress(Left(ISIP),',+-', 'a'), 4.);else
ISIP_psi= input(compress(Left(ISIP),',+-', 'a'), 5.);
*drop ISIP;
ISIP_psi_length=length(compress(Left(ISIP),',+-', 'a'));

If length(compress(Left(_5_Min_Shut_In),',+- ', 'a')) le 5 then
_5_Min_Shut_In_psi= input(compress(Left(_5_Min_Shut_In),',+- ', 'a'), 4.);else
_5_Min_Shut_In_psi= input(compress(Left(_5_Min_Shut_In),',+- ', 'a'), 5.);
*drop _5_Min_Shut_In;
_5_Min_Shut_In_length=length(compress(Left(_5_Min_Shut_In),',+- ', 'a'));

Rate_At_Breakdown_bpm= input(compress(Left(Rate_At_Breakdown),',+-', 'a'), 4.);
drop Rate_At_Breakdown;

STP_Before_Acid_psi= input(compress(Left(STP_Before_Acid),',+-', 'a'), 4.);
drop STP_Before_Acid;

STP_Before_Acid_psi= input(compress(Left(STP_Before_Acid),',+-', 'a'), 4.);
drop STP_Before_Acid;

STP_After_Acid_psi= input(compress(Left(STP_After_Acid),',+-', 'a'), 4.);
drop STP_After_Acid;

Acid_Pressure_Drop_psi= input(compress(Left(Acid_Pressure_Drop),',+-', 'a'), 4.);
drop Acid_Pressure_Drop;

If length(compress(Proppant_Mass_Pumped,', bpm psi hp ()+- ')) le 7 then
Proppant_Mass_Pumped_lb= input(compress(Left(Proppant_Mass_Pumped),',+-', 'a'), 6.);else
Proppant_Mass_Pumped_lb= input(compress(Left(Proppant_Mass_Pumped),',+-', 'a'), 7.);
*drop Proppant_Mass_Pumped;
Proppant_Mass_Pumped_lb_length=length (compress(Proppant_Mass_Pumped,', bpm psi hp ()+- '));

Proppant_Mass_Counted_lb= input(compress(Left(Proppant_Mass_Counted),',+-', 'a'), 7.);
drop Proppant_Mass_Counted;

IF index(Pad_Volume, 'gal') ne 0 then
Pad_Volume_Bbls = input(compress(Left(substr(Pad_Volume, index(Pad_Volume, 'gal'))),',+-', 'a'), 5.); else
If length(compress(Left(Pad_Volume),',+- ', 'a'))le 4 then
Pad_Volume_Bbls= input(compress(Left(Pad_Volume),',+- ', 'a'), 3.);else
Pad_Volume_Bbls= input(compress(Left(Pad_Volume),',+-', 'a'), 4.);
drop Pad_Volume;


IF index(XL_Pad_Volume, 'gal') ne 0 then
XL_Pad_Volume_Bbls = input(compress(Left(substr(XL_Pad_Volume, index(XL_Pad_Volume, 'gal'))),',+-', 'a'), 5.); else
If length(compress(Left(Pad_Volume),',+- ', 'a'))le 4 then
XL_Pad_Volume_Bbls= input(compress(Left(XL_Pad_Volume),',+- ', 'a'), 3.);else
XL_Pad_Volume_Bbls= input(compress(Left(XL_Pad_Volume),',+-', 'a'), 4.);
drop XL_Pad_Volume;
XL_Pad_Volume_Bbls_length = length(compress(Left(XL_Pad_Volume),',+- ', 'a'));


IF index(Clean_Volume_for_PLF, 'gal') ne 0 then
Clean_Volume_for_PLF_Bbls = input(compress(Left(substr(Clean_Volume_for_PLF, index(Clean_Volume_for_PLF, 'gal'))),',+-', 'a'), 5.); else
Clean_Volume_for_PLF_Bbls= input(compress(Left(Clean_Volume_for_PLF),',+-', 'a'), 4.);
*drop Clean_Volume_for_PLF;
Clean_Volume_for_PLF_length = length(compress(Left(Clean_Volume_for_PLF),',+- ', 'a'));


IF index(Acid_15_Perc, 'gal') ne 0 then
Acid_15_Perc_Bbl = input(compress(Left(substr(Acid_15_Perc, index(Acid_15_Perc, 'gal'))),',+-', 'a'), 3.); else
if length (compress(Left(Acid_15_Perc),',+- ', 'a')) le 3 then
Acid_15_Perc_Bbl= input(compress(Left(Acid_15_Perc),',+-', 'a'), 2.); else
Acid_15_Perc_Bbl= input(compress(Left(Acid_15_Perc),',+-', 'a'), 3.);
drop Acid_15_Perc;
*Acid_15_Perc_length = length(compress(Left(Acid_15_Perc),',+- ', 'a'));


IF index(Slurry_Volume, 'gal') ne 0 then
Slurry_Volume_Bbl = input(compress(Left(substr(Slurry_Volume, index(Slurry_Volume, 'gal'))),',+-', 'a'), 6.); else
if length (compress(Left(Slurry_Volume),',+- ', 'a')) le 5 then
Slurry_Volume_Bbl = input(compress(Left(Slurry_Volume),',+- ', 'a'), 4.);else
Slurry_Volume_Bbl = input(compress(Left(Slurry_Volume),',+-', 'a'), 6.);
*drop Slurry_Volume;
Slurry_Volume_length = length(compress(Left(Slurry_Volume),',+- ', 'a'));


IF index(FR_66_Water, 'gal') ne 0 then
FR_66_Water_Bbls = input(compress(Left(substr(FR_66_Water, index(FR_66_Water, 'gal'))),',+-', 'a'), 4.);else
FR_66_Water_Bbls = input(compress(Left(FR_66_Water),',+-', 'a'), 4.);
drop FR_66_Water;

IF index(_25_Pound_Hybor_G, 'gal') ne 0 then
_25_Pound_Hybor_G_Bbls = input(compress(Left(substr(_25_Pound_Hybor_G, index(_25_Pound_Hybor_G, 'gal'))),',+-', 'a'), 5.);else
_25_Pound_Hybor_G_Bbls = input(compress(Left(_25_Pound_Hybor_G),',+-', 'a'), 5.);
drop _25_Pound_Hybor_G;

IF index(_25_Pound_Waterfrac_19_cP, 'gal') ne 0 then
_25_Pound_Waterfrac_19_cP_Bbls = input(compress(Left(substr(_25_Pound_Waterfrac_19_cP, index(_25_Pound_Waterfrac_19_cP, 'gal'))),',+-', 'a'), 5.);else
_25_Pound_Waterfrac_19_cP_Bbls = input(compress(Left(_25_Pound_Waterfrac_19_cP),',+-', 'a'), 5.);
drop _25_Pound_Waterfrac_19_cP;

IF index(_20_Pound_Waterfrac_G_14_cP, 'gal') ne 0 then
_20_Pound_Waterfrac_G_14_cP_Bbls = input(compress(Left(substr(_20_Pound_Waterfrac_G_14_cP, index(_20_Pound_Waterfrac_G_14_cP, 'gal'))),',+-', 'a'), 5.);
else
_20_Pound_Waterfrac_G_14_cP_Bbls = input(compress(Left(_20_Pound_Waterfrac_G_14_cP),',+-', 'a'), 5.);
drop _20_Pound_Waterfrac_G_14_cP;

if length (compress(Left(Final_FG),',+- ', 'a')) le 9 then
Final_FG_psi_ft= input(compress(Left(Final_FG),',+- ', 'a'), 4.3);else
Final_FG_psi_ft= input(compress(Left(Final_FG),',+- ', 'a'), 5.3);
drop Final_FG;
Final_FG_length=length (compress(Final_FG,', bpm psi hp ()+- '));


BH_Max_Proppant_Conc_lb_gal= input(compress(Left(BH_Max_Proppant_Conc),',+- ', 'a'), 4.3);
drop BH_Max_Proppant_Conc;

Ball_on_Seat_Bbls= input(compress(Left(Ball_on_Seat),',+- ', 'a'), 3.);
drop Ball_on_Seat;

IF index(_30_50_CRC, 'E') ne 0 then
_30_50_CRC_lbs = input(compress(Left(substr(_30_50_CRC,11)),',+-', 'a'), 5.); 
else
_30_50_CRC_lbs= input(compress(Left(substr(_30_50_CRC,1,8)),',+-', 'a'), 8.);
drop _30_50_CRC;

_20_40_CRC_lbs= input(compress(Left(_20_40_CRC),',+- ', 'a'), 5.);
drop _20_40_CRC;



run;



 

I have attached the data set that i was working with 

PROC Star
Posts: 7,417

Re: Splitting a string from unstructured data into columns

Now I remember when you initially posted your question AND that you did include an example dataset that included two sets of records.

 

The two sets are different! API_IPO_UWI='4201334737' has items and values separated by tab characters for all records except for iternval_num=5. On the other hand, API_IPO_UWI='4225533069' DOESN'T have the items and values separated by tab characters.

 

The following would only work for the records that have the items and values separated by tab characters and deletes all of the other records:

 

proc import datafile="c:\art\work_testdata.xlsx" out=have replace dbms=excel;
  getnames=yes;
  mixed=yes;
  sheet='TESTDATA';
  usedate=yes;
  scantime=yes;
run;

 

data need;
  set have;
  length var $80;
  if API_IPO_UWI='4201334737' and interval_num ne 5;
  _i=1;
  do while (scan(comments,_i,'0A'x) ne '');
    var=scan(comments,_i,'0A'x);
    item=scan(var,1,'09'x);
    value=scan(var,2,'09'x);
    output;
    _i+1;
  end;
run;

 

proc transpose data=need out=want;
  by interval_num;
  var value;
  id item;
run;

 

Even with the above, there are a number of differences between the first record and the other records in the set.

 

Art, CEO, AnalystFinder.com

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 21 replies
  • 649 views
  • 2 likes
  • 2 in conversation