Help - Infile Multiple variable Length Fields

Caz · Posted 12-06-2011 04:15 AM

I have received a very bizarre text file that I need to read into SAS and have become stuck. Here is a flavour of what it is like

{1:ABCD}{2:10234}{4:40.45}

{1:TRHG}{2:38976}{3:{H132}}{4:567.43}

{1:LKMN}{2:23109}{4:10432.67}

{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}

{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}

I have started by using code such as

DATA pays;

INFILE DD TRUNCOVER SCANOVER;

INPUT @'{1:' Field1 £4.

@'{2:' Field2 £5.;

RUN;

This read the first 2 columns of data fine but I am now struggling to figure out how to do the rest and wonder if anyone out there can assist?

The results that I am seeking

Field1 Field2 Field3 Field4

ABCD 10234 40.45

TRHG 38976 {H132} 567.43

LKMN 23109 10432.67

HGKO 19876 {J987}{F102} 4320.12

DUHN 23098 {G123}{H987}{J234} 345.23

Peter_C · Posted 12-06-2011 05:33 AM

*

most people know of the INFILE option DLM= and use it with a constant like ',' '|' and '09'x.

What fewer appreciate is that this option can define a variable name.

This capablility allows you to change the delimiter each time you INPUT

;

data test( keep= field1-field4);

infile cards dlm=dlm truncover column=col;

call missing( col, len ) ;

dlm='{:}' ;

do while( col LE len ) ;

input var_ind @ ;

if not len then len = length(_infile_) ;

select (var_ind) ;

when (1) do ;

dlm= '}{' ;

input field1 :$10. @ ;

end;

when (2) do ;

dlm= '}{' ;

input field2 :32. @ ;

end;

when (3) do ;

dlm= ':' ;

input field3 :$100. @ ;

if col < len then input @+(-2) @ ;

*repositioning to the field number ;

substr( field3, length( field3) -2 ) = ' ' ;

* removing the trailing }{#: ;

end;

when (4) do ;

dlm= '}{:' ;

input field4 :32. @ ;

end;

otherwise put '???' _infile_ ;

end ;

dlm='{:}' ;

end ;

output ;

format field1 $6. field2 field4 best8. field3 $22. ;

cards ;

{1:ABCD}{2:10234}{4:40.45}

{1:TRHG}{2:38976}{3:{H132}}{4:567.43}

{1:LKMN}{2:23109}{4:10432.67}

{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}

{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}

;

Caz · Posted 12-06-2011 05:44 AM

Thank you very much Peter. It certainly works.

I did not realise it would appear so complex, now to get my head around it.......:smileyconfused:

Peter_C · Posted 12-06-2011 12:27 PM

Caz

the program is about as complicated as your data - the SELECT/WHEN group provide organisation for parsing and inputting the differing field types. ( cannot use { and} to delimit FIELD3 so I used the : after the next field indicator - might be interesting when 3 is last on a row 🙂

Perhaps some perl regex wizard might have some way of parsing the content into strings in a single statement, but I fail to see how that would provide the control to specify the informats for each field. You will still need use the field indicator in the data to pick out the apropriate informat.

peter

ps

any regex wizard want to get in touch, I'd be interested in a related issue

FriedEgg · Posted 12-06-2011 08:28 PM

That was a very fun problem. Very difficult for me to come up with the right regular expression.

data test;

length f $32 field1-field4 $30;

if _n_=1 then prx=prxparse('/(\{\d:.+?(?=(\{\d\:)|(\Z)))/');

retain prx;

input@;

start=1;

stop=length(_infile_);

call prxnext(prx,start,stop,_infile_,p,l);

do while(p>0);

f=substr(_infile_,p+1,l-2);

select(scan(f,1,':'));

when(1)field1=scan(f,2,':');

when(2)field2=scan(f,2,':');

when(3)field3=scan(f,2,':');

when(4)field4=scan(f,2,':');

otherwise;

end;

call prxnext(prx,start,stop,_infile_,p,l);

end;

output;

keep field1-field4;

cards;

{1:ABCD}{2:10234}{4:40.45}

{1:TRHG}{2:38976}{3:{H132}}{4:567.43}

{1:LKMN}{2:23109}{4:10432.67}

{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}

{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}

;

run;

Tom · Posted 12-06-2011 10:16 PM

Not that hard if you use truncover and read initially as strings and then add a little logic to remove the trailing characters and convert to numeric if wanted.

data x ;
  infile cards truncover ;
  input @1 @'1:' field1 $4.
        @1 @'2:' field2c $12.
        @1 @'3:' field3 $200.
        @1 @'4:' field4c $12.
  ;
  field2 = input(scan(field2c,1,'}'),12.);
  field4 = input(scan(field4c,1,'}'),12.);
  if index(field3,':') then field3=substr(field3,1,index(field3,':')-4);
  put (field1-field4) (=/);
cards;
{1:ABCD}{2:10234}{4:40.45}
{1:TRHG}{2:38976}{3:{H132}}{4:567.43}
{1:LKMN}{2:23109}{4:10432.67}
{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}
{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}
run;

field1=ABCD
field2=10234
field3=
field4=40.45
field1=TRHG
field2=38976
field3={H132}
field4=567.43
field1=LKMN
field2=23109
field3=
field4=10432.67
field1=HGKO
field2=19876
field3={J987}{F102}
field4=4320.12
field1=DUHN
field2=23098
field3={G123}{H987}{J234}
field4=345.23

You might need more logic if you have records with field 3 without a trailing field 4 to eliminate the extra close curly bracket.

You also might want to split the field 3 values into multiple records.

data x ;
  infile cards truncover ;
  input @1 @'1:' field1 $4.
        @1 @'2:' field2c $12.
        @1 @'3:' field3x $200.
        @1 @'4:' field4c $12.
  ;
  field2 = input(scan(field2c,1,'}'),12.);
  field4 = input(scan(field4c,1,'}'),12.);
  if index(field3x,':') then field3x=substr(field3x,1,index(field3x,':')-4);
  length field3 $4 ;
  put (field1 field2 field4) (=/);
  do i=1 by 1 until (field3=' ');
    field3=scan(field3x,i,'{}');
    if i=1 or field3 ne ' ' then do;
      output;
      put field3= ;
    end;
  end;
cards;
{1:ABCD}{2:10234}{4:40.45}
{1:TRHG}{2:38976}{3:{H132}}{4:567.43}
{1:LKMN}{2:23109}{4:10432.67}
{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}
{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}
run;

field1=ABCD
field2=10234
field4=40.45
field3=
field1=TRHG
field2=38976
field4=567.43
field3=H132
field1=LKMN
field2=23109
field4=10432.67
field3=
field1=HGKO
field2=19876
field4=4320.12
field3=J987
field3=F102
field1=DUHN
field2=23098
field4=345.23
field3=G123
field3=H987
field3=J234

Caz · Posted 12-07-2011 10:17 AM

Thank you very much Tom.

This is the easiest option for me to follow and naturally follows what I was exploring.

I have today taken this idea forward to the much bigger, nastier file that I have and it is working reasonably well.

What is the significance of the @1 on each of the input rows? Is that forcing it to stay on the one input row at a time?

Tom · Posted 12-07-2011 10:34 AM

The @1 will move the pointer back to column 1. This is to prevent if from missing fields because the pointer has moved past them. For example in the line without any {3:....} field the pointer would have moved to the end of the line so the {4:...} field would not be found.

If your file has mulitple copies of the same field (for example {1:XXXX}{1:YYYY} then you would need more complicated logic to read that.

Ksharp · Posted 12-07-2011 01:23 AM

Then The fourth input method(named input) would be helpful.

data _null_ ;
file x ;
retain max .;
input;
*get magic number;
max=max(max,input(scan(_infile_,-3,':{}','M'),best8.));
call symputx('max',max);
_infile_=prxchange('s/\}$//o',1,strip(_infile_));
_infile_=prxchange('s/(\{|\}\{)(?=\d+:)/ field/o',-1,_infile_);
_infile_=prxchange('s/:/=/o',-1,_infile_); 
put _infile_;
datalines;
{1:ABCD}{2:10234}{4:40.45}
{1:TRHG}{2:38976}{3:{H132}}{4:567.43}
{1:LKMN}{2:23109}{4:10432.67}
{1:HGKO}{2:19876}{3:{J987} {F102}}{4:4320.12}
{1:DUHN}{2:23098}{3:{G123} {H987}{J234}}{4:345.23}
run;
%put &max;

data x;
 infile x ;
 input (field1 - field&max.) (= $100.);
run;

Ksharp

Message was edited by: xia keshan

MikeZdeb · Posted 12-07-2011 12:00 PM

hi ... similar to Tom's idea ...

data x (drop=temp:);

infile datalines truncover;

input

@1 @'{1:' field1 $4.

@1 @'{2:' field2 $5.

@1 @'{3:' temp3 $100.

@1 @'{4:' temp4 $10.

;

if ^missing(temp3) then field3 = substr(temp3,1,find(temp3,'}}'));

field4 = input(compress(temp4,'.','kd'),10.);

datalines;

{1:ABCD}{2:10234}{4:40.45}

{1:TRHG}{2:38976}{3:{H132}}{4:567.43}

{1:LKMN}{2:23109}{4:10432.67}

{1:HGKO}{2:19876}{3:{J987}{F102}}{4:4320.12}

{1:DUHN}{2:23098}{3:{G123}{H987}{J234}}{4:345.23}

;

run;