Hello
I would appreciate any advice and help on how to tackle text processing task and would like to find out the following:
1. Is there a way in SAS to scan a file with non standard extension (file contains data that is made to be processed by an application) and import only portion that contains standard - human readable alpha numeric characters and drop all encoded symbols that should be used/decoded by the application.
2. is there a way to scan a file for a specific identifier "HEADER" and grab n charterers after it and put in a separate row keep everything in one column. --this would be the best.
As of now I have used the following:
1. Imported file with "all" extensions and used suggested columns by SAS (215 columns, 65370 rows; the largest lengh 32767) and delimiter used as coma. Columns have different length and data is getting lost--have tried different delimiters, and number of columns.
2. cleaned all from gibberish symbols and left only variables that contain "HEADER"
data data_an; set data_in; array chars {*} _character_; do _n_ = 1 to dim(chars); chars{_n_} = compbl(prxchange("s/[^0-9 ABCDEFGHIJKLMNOPQRSTUVWXYZ-]//",-1,chars{_n_})); if index(chars{_n_},"HEADER")=0 then chars{_n_}=""; end; run;
3. Next dropped all columns that do not have any record; Code found on this site--Thank you; The output table has single value with "HEADER" in each row;however, columns are random for example output matrix:
1 0 0 0
0 0 1 0
1 0 0 0
0 1 0 0
1- contains "HEADER" matrix dimensions 65k rows and 21 columns
ods select none; ods output nlevels=temp; proc freq data=data_an nlevels; tables _all_; run; ods select all; proc sql noprint; select tablevar into : drop separated by ' ' from temp where NNonMissLevels=0; quit; data data_an1; set data_an(drop=&drop); run;
4. Create a column vector and delete all empty. F: --columns with text, n row number.
proc transpose data=data_an1 out=data_an2; by n; var F:; run;
data text; set data_an2; if COL1='' then delete; run;
5. Extract portion after the "HEADER" and check if variable contains more than one instance of header--may need to create a loop to capture situation when more than "HEADER" instance in the string. Is there a better way?
data text_out(drop=COL1 _NAME_ n); set text;
text=substr(COL1,index(COL1, 'HEADER'),60);
*extract the first ID and Date that are separated by space; ID=scan(text, 2, ' '); DATE=scan(text, 3, ' ');
*check if there is another index presented;
index1=index(COL1, 'HEADER'); text1=index(substr(COL1,index1+8),'HEADER'); run;
The code is very slow single run takes about 20 min; I am looking to process 1000-3000 files.
Thank you
... View more