BookmarkSubscribeRSS Feed
val_nikolajevs
Obsidian | Level 7

Hello

I would appreciate any advice and help on how to tackle text processing task and would like to find out the following:

1. Is there a way in SAS to scan a file with non standard extension (file contains data that is made to be processed by an application) and import only portion that contains standard - human readable alpha numeric characters and drop all encoded symbols that should be used/decoded by the application.

2. is there a way to scan a file for a specific identifier "HEADER" and grab n charterers after it and put in a separate row keep everything in one column. --this would be the best.

 

As of now I have used the following:

1. Imported file with "all" extensions and used suggested columns by SAS (215 columns, 65370 rows; the largest lengh 32767) and delimiter used as coma. Columns have different length and data is getting lost--have tried different delimiters, and number of columns.

2. cleaned all from gibberish symbols and left only variables that contain "HEADER"

data data_an;
set data_in;
array chars {*} _character_;
do _n_ = 1 to dim(chars);
chars{_n_} = compbl(prxchange("s/[^0-9 ABCDEFGHIJKLMNOPQRSTUVWXYZ-]//",-1,chars{_n_}));
if index(chars{_n_},"HEADER")=0 then chars{_n_}="";
end;
run;

3. Next dropped all columns that do not have any record; Code found on this site--Thank you; The output table has  single value with "HEADER" in each row;however,  columns are random for example output matrix:

1 0 0 0

0 0 1 0

1 0 0 0

0 1 0 0

1- contains "HEADER" matrix dimensions 65k rows and  21 columns

 

 

ods select none;
ods output nlevels=temp;
proc freq data=data_an nlevels;
tables _all_;
run;
ods select all;
proc sql noprint;
select tablevar into : drop separated by ' '
from temp
where NNonMissLevels=0;
quit;
data data_an1;
set data_an(drop=&drop);
run;

 

4. Create a column vector and delete all empty. F: --columns with text, n row number.


proc transpose data=data_an1 out=data_an2;
by n;
var F:;
run;

 

data text;
set data_an2;
if COL1='' then delete;
run;

5. Extract portion after the "HEADER" and check if variable contains more than one instance of header--may need to create a loop to capture situation when more than "HEADER" instance in the string. Is there a better way?

 

data text_out(drop=COL1 _NAME_ n);
set text;

text=substr(COL1,index(COL1, 'HEADER'),60);

*extract the first ID and Date that are separated by space;
ID=scan(text, 2, ' ');
DATE=scan(text, 3, ' ');

*check if there is another index presented;

index1=index(COL1, 'HEADER');
text1=index(substr(COL1,index1+8),'HEADER');
run;

 

The code is very slow single run takes about 20 min; I am looking to process 1000-3000 files.

 

Thank you

10 REPLIES 10
Tom
Super User Tom
Super User

You can read any file with SAS. 

Do you know what the file structure actually is?  If not first step is to LOOK at the file.

For example this code will display the first 1,000 bytes of a file.

data _null_;
  infile "myfile.abc" lrecl=100 recfm=f obs=10;
  input;
  list;
run;

If you want to search for specific text as a trigger, such as the string 'HEADER' you mentioned the the INPUT statement can handle that.  Say you want the first 20 bytes that appear after every occurrence of HEADER in the file. 


data header ;
  infile 'myfile.abc' recfm=n ;
  input @ 'HEADER' string $char20. ;
run;

And once you figure out how to read one file of this type there are two ways to read multiples.  One is to make a list of the files and run the same steps over and over generating a separate dataset from each file.  The other is to read ALL of the files at  once into a single dataset.

val_nikolajevs
Obsidian | Level 7

Thank you.

Running the first step got the following output form which I only need "HEADER 000091503847 080822085601":

 

 

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1 HEADER 000091503847 080822085601 23742.22
2 02 14809.67 03 8932.552 04
3 A3 MVR FCS 20HEADTB000091503847 010322293000
4 0000000000 3F0D4F 080820220855210808202208545703702020304031320220
5 200110720210200

6 CHAR HEADTL000091503847 ....?...-.PSEM.. ..............
ZONE 2222222222224444543333333333332222222200003000205544002222222222222222222222222222222200000000000000
NUMR 00000000000085144C000091503847000000008080F780D0035D000000000000000000000000000000000000000000000000

7 CHAR ..................................................................
ZONE 0000000000000000000000000000000000000000000000000000000000000000002222222222222222222222222222222222
NUMR 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
2 The SAS System 09:23 Saturday, August 13, 2022

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

8 CHAR RGDATA1.202208080856A3 HDRT
ZONE 2222222222222222222222222222222222222222222222222222222222222222222254445431333333333333432222224455
NUMR 0000000000000000000000000000000000000000000000000000000000000000000027414112202208080856130000008424

9 CHAR ABÿÿ&....&.......001000004002EE A3R A3 030414002STDTAB..O....O.........šEE ............ÿ.ðß.
ZONE 44FF2000020000000333333333333442243522222432223333333335545440040000400000001094422001100000000FAFD0
NUMR 12FF60100600000000010000040025500132000001300003041400234441200F0100F00000002AA5500203810DD35D6FD0F3

10 CHAR ?üðÁ.ÿÿ.>ﯠ.…ÿÿ.0.ÿ÷ø_.þÿ..Ûà¨à..4h`€.üó.$¥. ...g..‚õàSTDTAB.. .... .......EE A3R ....
ZONE 3FFC1FF03EAA08FF038FFF51FF11DEAE0036680FF02A0A0816008FE554544002000020000000442243522222000022222222
NUMR FC01FFF3EFF015FFF0FF78F0EFA6B080834800AC304500119700250344412100010000000000550013200000304200000000

 

Running the second portion gives me an error:

ERROR: The '@"STRING"' INPUT/PUT statement option is inconsistent with binary mode I/O. The execution of the DATA STEP is being
terminated. Ho to fix that o tried to convert header to binary no luck sas hangs up and loses connection.

 

And you are correct using import does not produce good results eher have tried UTF-16 seems to be looks cleaner and loads much faster but not sure how to get data to normal format back.

 

filename test "\ID.HHF" encoding='utf-16' ;
proc import datafile=test out=work.dataset dbms=dlm replace;
delimiter='090a'x;
getnames=no;
datarow=1;
run;

Thank you

 

Tom
Super User Tom
Super User

Pasting the log notes into the forum as if it was your paragraphs of text has destroyed the layout.  Make sure to use the Insert Code or Insert SAS code button to get a pop-up for pasting/editing text and programs.

Here is my best reconstruction of what you posted. (although the spacing in the first 500 bytes is probably way off)

RULE:   ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
 1      HEADER 000091503847 080822085601 23742.22
 2      02 14809.67 03 8932.552 04
 3      A3 MVR FCS 20HEADTB000091503847 010322293000
 4      0000000000 3F0D4F 080820220855210808202208545703702020304031320220
 5      200110720210200
 6 CHAR             HEADTL000091503847        ....?...-.PSEM..                                ..............
   ZONE 2222222222224444543333333333332222222200003000205544002222222222222222222222222222222200000000000000
   NUMR 00000000000085144C000091503847000000008080F780D0035D000000000000000000000000000000000000000000000000
 7 CHAR ..................................................................
   ZONE 0000000000000000000000000000000000000000000000000000000000000000002222222222222222222222222222222222
   NUMR 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
 8 CHAR                                                                     RGDATA1.202208080856A3      HDRT
   ZONE 2222222222222222222222222222222222222222222222222222222222222222222254445431333333333333432222224455
   NUMR 0000000000000000000000000000000000000000000000000000000000000000000027414112202208080856130000008424
 9 CHAR ABÿÿ&....&.......001000004002EE  A3R     A3   030414002STDTAB..O....O.........šEE  ............ÿ.ðß.
   ZONE 44FF2000020000000333333333333442243522222432223333333335545440040000400000001094422001100000000FAFD0
   NUMR 12FF60100600000000010000040025500132000001300003041400234441200F0100F00000002AA5500203810DD35D6FD0F3
10 CHAR ?üðÁ.ÿÿ.>ï¯ .…ÿÿ.0.ÿ÷ø_.þÿ..Ûà¨à..4h`€.üó.$¥. ...g..‚õàSTDTAB.. .... .......EE  A3R     ....
   ZONE 3FFC1FF03EAA08FF038FFF51FF11DEAE0036680FF02A0A0816008FE554544002000020000000442243522222000022222222
   NUMR FC01FFF3EFF015FFF0FF78F0EFA6B080834800AC304500119700250344412100010000000000550013200000304200000000

If the file always just has HEADER as the first 6 bytes and you just want to read the next 20 bytes then what you asked is trivial.

data want;
  infile 'myfile.abc' recfm=n ;
  input header $6.  otherstuff $char20. ;
  output;
  stop;
run;

To see the actual pattern try running the LIST statement on a few of the files and see if you see the pattern to where things are placed in the file.

 

val_nikolajevs
Obsidian | Level 7

😞 that is the problem it is not in the first position it is random; I have found a work around:

 

data header ;
infile "\ID.HHF" recfm=n obs=1000 ;

input string : $char32767.;
text=compbl(prxchange("s/[^0-9 ABCDEFGHIJKLMNOPQRSTUVWXYZ-]//",-1,string));
if text='HEADER' then do;
count=1;
output;
end;
else if count<3 then do;
output;
count=count+1;
end;
else delete;
retain count;
run;

 

Puts everything in single column vector (was possible because there a spaces separating HEADER and values and infile with out delimiter treats each space as a delimiter), removes all weird characters, finds HEADER and grabs next two rows that exactly what I need it now just transposing to normal row/columns format.

THANK YOU for the infile/input direction!

Tom
Super User Tom
Super User

I doubt the HEADER text is in a random location.  You just haven't figured out the pattern.

If the file starts with variable length information there are two main ways that is stored.

One is to have some type of termination character, normally binary zero, '00'x in SAS syntax.

filename example temp;
data _null_;
  file example ;
  string='XXXXXXXXXXXX';
  do i=1 to 8 ;
    put string $varying10. i '00'x ' HEADER 9 10 20' ;
  end;
run;

data _null_;
 infile example;
 input;
 list;
run;

The other is to first specify the length.  So normally that would be stored as a binary integer value.  Could be 1, 2 or 4 bytes long perhaps.

filename example temp;
data _null_;
  file example ;
  string='XXXXXXXXXXXX';
  do i=1 to 8 ;
    put i pib2. string $varying10. i ' HEADER 9 10 20' ;
  end;
run;

data _null_;
 infile example;
 input;
 list;
run;
val_nikolajevs
Obsidian | Level 7

Tom --Thank you for the reply.

Disclosure, I do not have expedience with processing files as binary.

I could not figure out the structure and make it work as you suggested. However, with some modifications of the code with prxchange, I was able to get data that I was looking to extract.

I can see start stop for each  ID, when i see file in note pad it ends/starts with "....ÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿþÿHEADER..." . However, this sequence is not located at a fixed col it is random for each ID--as far I can tell.

I would marked this as a closed; Takes about 30-50 sec to process a file -- which is acceptable, for this one time task.

Once again thank you.

Val

Tom
Super User Tom
Super User

Glad you got it working.

 

The LIST statement is your friend.  It let's you see the file contents and exactly what the bytes contain.  Any place where there are characters that are not normal characters the LIST command will display the whole line with the two digit hex code for the characters underneath them.

Tom
Super User Tom
Super User

Sorry. You can only use the @ 'string' feature of INPUT when reading from files with records.

Either text that is broken into lines by end of line characters, the default. 

Or by reading fixed length records using LRECL= and RECFM=F.

Your example file does not appear to have end of line markers (at least not in the first 1000 bytes)

Also not clear if there is a fixed length record length either.

 

What you posted did not look like UTF-16.  That would have extra null bytes before every actual character.  I suspect it is just a binary format file where some of the fields have text and some of the fields are actually binary values.  Perhaps binary integer values, but perhaps other binary values.

 

Tom
Super User Tom
Super User

Note that your first step

 

As of now I have used the following:

1. Imported file with "all" extensions and used suggested columns by SAS (215 columns, 65370 rows; the largest lengh 32767) and delimiter used as coma. Columns have different length and data is getting lost--have tried different delimiters, and number of columns.

 

only makes sense if you expected the files to be delimited text files. 

But it sounds like you expect the files to be binary files with some areas of plain text.  So trying to use any "import" procedure is not going to have any value at all.

 

val_nikolajevs
Obsidian | Level 7

Really stuck on this one:

ERROR: The '@"STRING"' INPUT/PUT statement option is inconsistent with binary mode I/O. The execution of the DATA STEP is being
terminated.

it is not finding 'HEADER' and just puts everything in to a single column with specified $char20. length.

Have read the following on INPUT and it seems should work with this expression; but it does not.

input @'HEADER' string  $char20.;

 

https://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000144370.htm#a00...

What it could be?

Thank you.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1321 views
  • 0 likes
  • 2 in conversation