Solved: Reading raw data files when data is not in fixed columns

LAtwood · Posted 08-25-2011 10:02 AM

I have just recently found an error in the following attached code reading in a raw data file. We originally though the raw data file was in fixed columns, but just found out the data for afrpf and lbpf can sometime have a negative in front of the number and does not have a placeholder for the negative. In other words if afrpf contains a negative it then moves all data after it one space to the right which causes the code to pick up the wrong information. If both afrpf and lbpf contain negative all other information after is moved two spaces to the right.

I am only aware how to read in data that is in fixed columns so any help with this would be appreciated.

Thanks

Lori

DLing · Posted 08-30-2011 10:28 AM

As the other posters have noted, this is a really bad data structure to work with, and is causing anyone that needs to manipulate data enormous headache. Fixing the issue properly at source is the ideal solution, otherwise it will continue to be a maintenance nightmare. Whoever created the output should be made to read it back in in order to appreciate the consequences of poor design. Even if the data file can be changed to pure fixed position file, or a separator delimited file, there is still all of the existing instrumentation logfiles to deal with, so the problem doesn't go away entirely until the old files can be done away with.

The core issue, if I understand your situation correctly, is that when fields go negative, instead of prepending a minus sign and dropping the right most digit, thus keeping the same field width (probably didn't want to lose the field precision), the program writes a minus sign followed by the same number of columns, thus widening the field by 1. And this has happened at a few places in the file so far. This is really nasty since each time it happens, it will shove all the following field by 1, thus the minus sign won't necessarily all occur at the same position except the very first field. Approaches that scan for minus at pre-determined positions won't work well without enumerating all possible combinations.

SAS is extremely flexible in reading data files, so there's always a way , though it becomes increasing complex and fragile, and require more skills to maintain and diagnose problems. Here's a short program that illustrates an approach that should be helpful.

This first data step uses fixed column input. The first 2 records conform to the structure, and is read in OK. All following records cause problems.

data out;

input f1 4. f2 4. f3 4. f4 4. f5 4. f6 4. ;

cards;

0.1 0.2 0.3 0.4 0.5 0.6 OK

0.110.220.330.440.550.66 OK

0.11-0.220.330.440.550.66 f2 shifts by 1

0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

-0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

-0.11-0.220.33-0.440.55-0.66

;

NOTE: Invalid data for f4 in line 267 13-16.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

267 0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

f1=0.11 f2=-0.2 f3=20.3 f4=. f5=440 f6=550 _ERROR_=1 _N_=4

NOTE: Invalid data for f3 in line 268 9-12.

NOTE: Invalid data for f4 in line 268 13-16.

NOTE: Invalid data for f5 in line 268 17-20.

268 0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

f1=0.11 f2=-0.2 f3=. f4=. f5=. f6=0.55 _ERROR_=1 _N_=5

NOTE: Invalid data for f2 in line 269 5-8.

NOTE: Invalid data for f3 in line 269 9-12.

NOTE: Invalid data for f4 in line 269 13-16.

269 -0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

f1=-0.1 f2=. f3=. f4=. f5=0.44 f6=-0.5 _ERROR_=1 _N_=6

NOTE: Invalid data for f2 in line 270 5-8.

NOTE: Invalid data for f4 in line 270 13-16.

NOTE: Invalid data for f6 in line 270 21-24.

270 -0.11-0.220.33-0.440.55-0.66

f1=-0.1 f2=. f3=220 f4=. f5=0.44 f6=. _ERROR_=1 _N_=7

NOTE: The data set WORK.OUT has 7 observations and 6 variables.

Inspection of "out" shows the 3rd record is mis-interpreted as well even though there was no error messages. Thus a mechanism to deal with these "occasionally 1 wider field" is needed. Here's an approach, with the macro saving a tonne of repetitive typing.

The macro, at the current column pointer, will attempt to read a variable "name" according to "fmt". It checks the first position to see if it is a minus sign, if so, it will know it's -'ve, and input the following positions as "fmt". If not, it will back up 1 position and read with "fmt".

%macro read( name, fmt );

input _byte $char1. @ ;

if _byte ^= '-' then input +(-1) &name &fmt @ ;

else do;

input &name &fmt @ ;

&name = - &name;

end;

drop _byte;

%mend;

data out1;

%read( f1, 4. )

%read( f2, 4. )

%read( f3, 4. )

%read( f4, 4. )

%read( f5, 4. )

%read( f6, 4. )

cards;

0.1 0.2 0.3 0.4 0.5 0.6 OK

0.110.220.330.440.550.66 OK

0.11-0.220.330.440.550.66 f2 shifts by 1

0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

-0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

-0.11-0.220.33-0.440.55-0.66

;

This code does read the input stream properly. You can intersperse within your program like:

input <known fields where this won't happen for sure> @ ; *--- must hold line ;

@read( afrpf, 4. )

@read( lbpf, 4. )

... etc etc

Or you might read all fields this way if they all have this possibility. Hopefully this suits your needs and at least get you further along in dealing with this awkward data structure. As other posters have said though, it really is best to fix the issue properly, although in reality, it will probably cost $$$ to re-write the instrumentation logging program and if you can deal with the data, then it won't get fixed. Us poor buggers at the end of the chain has to suffer all of the poor design decisions made prior and clean up their messes.

Message was edited by: Daymond Ling (typo fixes)

View solution in original post

art297 · Posted 08-25-2011 10:10 AM

I think you will likely get a best answer if you also provide a small sample dataset that shows how the data actually appear.

LAtwood · Posted 08-25-2011 10:14 AM

I have attached the data also. Thanks

art297 · Posted 08-25-2011 10:35 AM

You can definitely input virtually any kind of data with SAS, but your file looks like it is quite formatted. Can you provide one more sample, namely for the second record in your attached file, what you are expecting the resulting file to look like.

LAtwood · Posted 08-25-2011 10:40 AM

The file 616344 includes a negative for the afrpf value and the file 547299 is what we expected the data files to look like (no negatives).

art297 · Posted 08-25-2011 11:00 AM

I still don't understand. The 3rd line of file 616344 appears as follows:

Wed 08-24-2011 14:34:40101001-50 -20 0 0 4 0 -90 -60 240 24301830302 300 3084 2996 -108 3217 0 3217 6 6 0 0 66 66 0 0 0 0 0 41.916.00.00 50 -50 -0.00.042 0.010.02491011000.00 0.0 0.0100

Which field(s) in the above record is not being read correctly, what SAS file variable(s) should the value be assigned to, and what value(s) should it have?

LAtwood · Posted 08-25-2011 11:32 AM

For the 616344 record this is what is happening when the code is ran:

day= Wed

sdate= 08-24-2011

stime= 14:34:40

pwnd1= 1

pwnd2= 0

punwnd1= 1

punwnd2= 0

unchp200= 0

rechp500= 1

w1zpf1= -5

w1zps1= 0

w1zpf2= -2

w1zps2= 0

w1zpf3= 0

w1zps3= 0

w1zpf4= 4

w1zps4= 0

w1zpf5= -9

w1zps5= 0

w1zpf6= -6

w1zps6= 0

w1zpf7= 24

w1zps7= 0

w2pf= 24

w2ps= 30

w3pf= 18

w3ps= 30

unctft1= 302

unctst1= 300

metft2= 3084

metst2= 2996

metft3= -108

metst3= 3217

rectft4= 0

rectst4= 3217

pwndf= 6

pwnds= 6

punwndf= 0

punwnds= 0

mlspdf= 66

mlspds= 66

entspf= 0

extspf= 0

entqf= 0

extqf= 0

footcntf= 0

unodf= 41.9

reodf= 16.0

sefv= 0.00

tmspdf= 50

bmspdf= -50

afrpf= -0. (should be 0.0) Root of the problem starts here, should not have a negative.

lbpf= 00. (should be 0.0)

trffv= 042 (should be 42)

scpfd= 0.0 (should be 0.01)

scpfo= 10.0 (should be 0.02)

srpv= 24 (should be 49)

entbridl= 9 (should be 1)

pass= 1 (should be 0)

extbridl=0 (should be 1)

reelor= 1 (should be 1, ends up being correct)

drywettm= 1 (should be 0)

stripoil= 0 (should be 0, ends up being correct)

elongset= 00.00 (should be 0.00)

oilrate= 0.0 (should be 0.0, ends up being correct)

thicknes= 0.010 (should be 0.0100)

DLing · Posted 08-25-2011 11:35 AM

With both data files, data look consistent up to column 159, with difference occuring at column 160.

If column 160 contains a '-' then that field is 1 char wider, and everything gets pushed right.

SAS is extremely flexible in handling inputs, so if I understand your problem and data correctly, then something like this should work:

input .... <up to column 159> @160 byte $char1. @; /* read up to 159, read single character @ 160, @ holds the line */

if byte = '-' then input @160 (afrpf or lbpf) 4. <followed by list of variables> ..... ;

else input @160 (afrpf or lbpf) 3. <followed by list of variables> ..... ;

(I'm not sure what field is at column 160, put the appropriate variable name there)

This method can be repeated as often as necessary to keep on shifting right as necessary.

Hope this helps. Good Luck.

SAS can read, with probability 1, anything! http://www2.sas.com/proceedings/sugi29/253-29.pdf

art297 · Posted 08-25-2011 12:12 PM

I'm not at all sure that I understand what the problem is, but it appears that if col 160 has a minus sign in it, that you simply want to skip that column .. otherwise you don't want to skip it.

If that is the case, then one way of fixing the problem would simply be to break the input into 3 parts rather than one. I modified your code as follows and obtained the data that you indicated was desired:

bmspdf 5. /* Bottom Mill Motor Speed Fdbk */

@;

input @160 testchar $char1. @;

if testchar ne "-" then do;

input @159 testchar $char1. @;

end;

input

afrpf 3. /* AntiFlutter Roll Posn Fdbk */

LAtwood · Posted 08-29-2011 04:12 PM

DLing,

I'm having issues repeating this method with another column. I can do it correctly using it on the afrpf value @160. I will attach the code and call it NEW. However, the variable right after afrpf which is lbpf sometimes has a negative also and I need to shift values over again. Any help would be appreciated.

Thank

art297 · Posted 08-29-2011 04:19 PM

Did you try the approach suggested by Howles? It might be exactly what you are looking for.

Howles · Posted 08-26-2011 09:01 AM

I might approach this by inspecting the _INFILE_ buffer at each location which might hold a minus sign, then inserting a blank as a placeholder if there is no minus sign. For example:

data want ;

input @ ;

do minuscol = 4 , 11 ;

if substr( _infile_ , minuscol , 1 ) NE '-'

then substr( _infile_ , minuscol ) =

cat( ' ' , substr( _infile_ , minuscol ) ) ;

end ;

input a 3. b 4. c 3. d 4. ;

cards ;

1.11.21.31.4

2.1-2.22.32.4

3.13.23.3-3.4

4.1-4.24.3-4.4

;

Comments

1. This problem as originally posted put a lot of work on the reader: unzipping files, examining long records, and sifting through a lot of code which did not address the issue. A simple, compact example focusing on the problem at hand will get a lot more attention.

2. If possible, educate the data provider. Files structured like this are substandard.

DLing · Posted 08-30-2011 10:28 AM

As the other posters have noted, this is a really bad data structure to work with, and is causing anyone that needs to manipulate data enormous headache. Fixing the issue properly at source is the ideal solution, otherwise it will continue to be a maintenance nightmare. Whoever created the output should be made to read it back in in order to appreciate the consequences of poor design. Even if the data file can be changed to pure fixed position file, or a separator delimited file, there is still all of the existing instrumentation logfiles to deal with, so the problem doesn't go away entirely until the old files can be done away with.

The core issue, if I understand your situation correctly, is that when fields go negative, instead of prepending a minus sign and dropping the right most digit, thus keeping the same field width (probably didn't want to lose the field precision), the program writes a minus sign followed by the same number of columns, thus widening the field by 1. And this has happened at a few places in the file so far. This is really nasty since each time it happens, it will shove all the following field by 1, thus the minus sign won't necessarily all occur at the same position except the very first field. Approaches that scan for minus at pre-determined positions won't work well without enumerating all possible combinations.

SAS is extremely flexible in reading data files, so there's always a way , though it becomes increasing complex and fragile, and require more skills to maintain and diagnose problems. Here's a short program that illustrates an approach that should be helpful.

This first data step uses fixed column input. The first 2 records conform to the structure, and is read in OK. All following records cause problems.

data out;

input f1 4. f2 4. f3 4. f4 4. f5 4. f6 4. ;

cards;

0.1 0.2 0.3 0.4 0.5 0.6 OK

0.110.220.330.440.550.66 OK

0.11-0.220.330.440.550.66 f2 shifts by 1

0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

-0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

-0.11-0.220.33-0.440.55-0.66

;

NOTE: Invalid data for f4 in line 267 13-16.

RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0

267 0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

f1=0.11 f2=-0.2 f3=20.3 f4=. f5=440 f6=550 _ERROR_=1 _N_=4

NOTE: Invalid data for f3 in line 268 9-12.

NOTE: Invalid data for f4 in line 268 13-16.

NOTE: Invalid data for f5 in line 268 17-20.

268 0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

f1=0.11 f2=-0.2 f3=. f4=. f5=. f6=0.55 _ERROR_=1 _N_=5

NOTE: Invalid data for f2 in line 269 5-8.

NOTE: Invalid data for f3 in line 269 9-12.

NOTE: Invalid data for f4 in line 269 13-16.

269 -0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

f1=-0.1 f2=. f3=. f4=. f5=0.44 f6=-0.5 _ERROR_=1 _N_=6

NOTE: Invalid data for f2 in line 270 5-8.

NOTE: Invalid data for f4 in line 270 13-16.

NOTE: Invalid data for f6 in line 270 21-24.

270 -0.11-0.220.33-0.440.55-0.66

f1=-0.1 f2=. f3=220 f4=. f5=0.44 f6=. _ERROR_=1 _N_=7

NOTE: The data set WORK.OUT has 7 observations and 6 variables.

Inspection of "out" shows the 3rd record is mis-interpreted as well even though there was no error messages. Thus a mechanism to deal with these "occasionally 1 wider field" is needed. Here's an approach, with the macro saving a tonne of repetitive typing.

The macro, at the current column pointer, will attempt to read a variable "name" according to "fmt". It checks the first position to see if it is a minus sign, if so, it will know it's -'ve, and input the following positions as "fmt". If not, it will back up 1 position and read with "fmt".

%macro read( name, fmt );

input _byte $char1. @ ;

if _byte ^= '-' then input +(-1) &name &fmt @ ;

else do;

input &name &fmt @ ;

&name = - &name;

end;

drop _byte;

%mend;

data out1;

%read( f1, 4. )

%read( f2, 4. )

%read( f3, 4. )

%read( f4, 4. )

%read( f5, 4. )

%read( f6, 4. )

cards;

0.1 0.2 0.3 0.4 0.5 0.6 OK

0.110.220.330.440.550.66 OK

0.11-0.220.330.440.550.66 f2 shifts by 1

0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

-0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

-0.11-0.220.33-0.440.55-0.66

;

This code does read the input stream properly. You can intersperse within your program like:

input <known fields where this won't happen for sure> @ ; *--- must hold line ;

@read( afrpf, 4. )

@read( lbpf, 4. )

... etc etc

Or you might read all fields this way if they all have this possibility. Hopefully this suits your needs and at least get you further along in dealing with this awkward data structure. As other posters have said though, it really is best to fix the issue properly, although in reality, it will probably cost $$$ to re-write the instrumentation logging program and if you can deal with the data, then it won't get fixed. Us poor buggers at the end of the chain has to suffer all of the poor design decisions made prior and clean up their messes.

Message was edited by: Daymond Ling (typo fixes)

LAtwood · Posted 08-30-2011 11:30 AM

You are a lifesaver! I wish there was a possibility to fix the root of the problem by fixing the data, but its just not a possibility right now. Thank you so much for helping me out on this.

Tom · Posted 08-30-2011 02:12 PM

If your input fields are all numbers then you can use data step logic to do it without macro code.

Create an array of the target variables and another with the expected length of each variable.

Then loop over the arrays keeping your own pointer into the input buffer. You can adjust the length depending on whether the value starts with a minus sign.

data out;

array var f1-f6 ;

array len (6) _temporary_ (6*4) ;

input @;

col=1;

do _n_ = 1 to dim(var);

neg=substr(_infile_,col,1)='-';

var(_n_)=input(substr(_infile_,col,len(_n_)+neg),best.);

col+len(_n_)+neg;

end;

drop col neg ;

put (var(*)) (best6.);

cards;

0.1 0.2 0.3 0.4 0.5 0.6 OK

0.110.220.330.440.550.66 OK

0.11-0.220.330.440.550.66 f2 shifts by 1

0.11-0.220.33-0.440.550.66 f2, f4 each adds 1

0.11-0.22-0.33-0.44-0.55-0.66 f2, f3, f4, f5, f6 all adds 1

-0.11-0.22-0.33-0.44-0.55-0.66 everything is wider by 1

-0.11-0.220.33-0.440.55-0.66

;

0.1 0.2 0.3 0.4 0.5 0.6

0.11 0.22 0.33 0.44 0.55 0.66

0.11 -0.22 0.33 0.44 0.55 0.66

0.11 -0.22 0.33 -0.44 0.55 0.66

0.11 -0.22 -0.33 -0.44 -0.55 -0.66

-0.11 -0.22 -0.33 -0.44 -0.55 -0.66

-0.11 -0.22 0.33 -0.44 0.55 -0.66

Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Reading raw data files when data is not in fixed columns

Re: Reading raw data files when data is not in fixed columns

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!