DATA Step, Macro, Functions and more

How to read in a text data file with special structure

Reply
Occasional Contributor
Posts: 14

How to read in a text data file with special structure

Hey, all

I am trying to read in some data as attached. The data structure is using “1:” or “2:”, etc. to index columns. I used PROC IMPORT and the code I used is as below. However, the code cannot put the missings into the right columns when reading it. For example, on the 3rd row, ‘5:0.223514’ should be read into the 6th column instead of the 5th. Could you tell me if there is a way for me to handle such case?

Thanks.

proc import datafile="\\C:\data.txt"

      missover dbms=dlm out=work.a replace;

      getnames=no;

      delimiter=' ';

      guessingrows=1000;

run;

Attachment
PROC Star
Posts: 7,467

How to read in a text data file with special structure

Posted in reply to flyingsohigh

Your proc import contains a missover statement/option which isn't an acceptable option that I'm familiar with.

However, more importantly, there is nothing in your data file that indicates missing values.  There is simply a single space between all values.  If there were periods with one leading space before after the period to indicate that a value was missing, proc import works quite nicely.

Valued Guide
Posts: 2,177

How to read in a text data file with special structure

Posted in reply to flyingsohigh

import is not effective for this kind of structure

A data step  like

data special( compress=yes);

  infile  "C:\temp\data.txt"  lrecl=10000 dlm=' :' truncover column=c length=l ;

  input record_type @ ;

  array col(100) ;    * allows up to 100 items in the data ;

  do until( c >= l ) ; * when reached the end-of-line ;

     input     pointer @ ; 

     if not    pointer then leave ; *just in case ;

     input col(pointer) @ ;

  end ;

run ;

Super User
Super User
Posts: 7,039

How to read in a text data file with special structure

Posted in reply to flyingsohigh

That is not that hard to read using a data step.  Read it into variable/value pairs. Here is a program that can read your data. 

Using delimiter of both a space and a colon will make it easy to read the variable number and its value without having to worry about any string manipulation.

Not sure what that first value represents, so I just named it variable zero.  I added ROW variable uniquely identify the rows.

Then you can use PROC TRANSPOSE to convert it back to a rectagular format.

The last step will both order the variables in the data vector and also creates all of the variables not actually named. 

The example data seems to mainly be variable 1 to 11 but there is also others such as 26 and 43.

data vert ;

retain maxvar 0;

infile cards truncover dlm=' :';

length var value 8 ;

input value @ ;

row +1 ;

var=0; output;

do until (var= .);

   input var value @;

   if var ne . then output;

   if var > maxvar then do ;

      maxvar=var;

      call symputx('MAXVAR',maxvar);

   end;

end;

cards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

run;

proc transpose data=vert out=horiz prefix=var;

  by row;

  id var ;

  var value ;

run;

data horiz;

  array var var0 - var&maxvar ;

  set horiz ;

  drop _name_;

run;

Super User
Posts: 10,020

Re: How to read in a text data file with special structure

Posted in reply to flyingsohigh

Or you can use the fourth input method - named input.

data _null_;
 infile 'c:\data.txt' length=len lrecl=2000 end=last;
 file 'c:\data_new.txt';
 retain max_num 0;
 input row $varying2000. len;
 max_num=max(max_num,input(scan(row,-2,' :'),best8.) );
 row='var0='||strip(tranwrd(strip(translate(row,'=',':')) ,' ',' var'));
 put row;
 if last then call symputx('max_num',max_num);
run;
%put &max_num;
data want;
 infile 'c:\data_new.txt' lrecl=2000;
 input (var0 - var&max_num) (=);
run;

Ksharp

Message was edited by: xia keshan

Respected Advisor
Posts: 3,799

Re: How to read in a text data file with special structure

Ksharp your idea to use named input is interesting but you made it so complicated. We both end up with lots of irrelevant variables :smileyshocked:

filename FT15F001 temp lrecl=1024;

options validvarname=any;

data test;

   infile FT15F001;

   input id @;

   _infile_ = translate(_infile_,'=',':');

   input ('1'n-'43'n)(=);

   rename '1'n-'43'n = F1-F43;

   ParmCards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

   run;

options validvarname=v7;

proc print;

   run;

Valued Guide
Posts: 2,177

Re: How to read in a text data file with special structure

Posted in reply to data_null__

data_null_;

for that kind of reason, as my data step prepares 100 of these variables, I suggested using the compress=yes option. With just 8 rows, there was no gain from compression, but if I repeat the data 100-fold the compression saves 27% (a lot less than I had hoped for)

Alternatively, it might be optimal to normalize to just the 4 columns  input_row, record_type (1/2), pointer( 1 ... 43), value,

data normalized ;

   infile "C:\temp\data.txt" lrecl=10000 dlm=' :' truncover column=c length=l ;

  input_row +1 ;

   input record_type @ ;

   do until( c >= l ) ; * when reached the end-of-line ;

      input pointer @ ;

      if not pointer then leave ; *just in case ;

      input value @ ;

      output ;

   end ;

run ;

with the sample data provided, the code generates just 94 obs with these 4 columns

peterC

Super User
Posts: 10,020

Re: How to read in a text data file with special structure

Posted in reply to data_null__

Null.

Yes. Your code is better and simpler. But You missed the most important thing, it is the number of variables inputed. In your code, you need to pre-define the number of variables , but how can you know the number of variables we should input  without browsing the whole file?.

So it is necessary to scan the whole file firstly to decide the number of  variables, just as my and Tom's code.

Tom's code is good ,but his is not suited for large file for the sake of proc transpose, think about it , data.txt has one million obs and each obs has 50 variables, then the dataset after Tom's code processing will has fifty million obs ,then use proc transpose. That will be a nightmare.

Regards.

Ksharp

Smiley Happy

Respected Advisor
Posts: 3,799

Re: How to read in a text data file with special structure

You miss the point of my program.  I was not addressing the number of variables read just the issue of how to properly use NAMED input. You don't need to rewrite the entire file and make it bigger in the process and you have done, that will be a nightmare.  To address the variable name issue we just need to read the file and create an array.  Consider this program.

filename FT15F001 temp lrecl=1024;

data info(index=(num/unique));

   do num=1 to 11;

      output;

      end;

   run;

data info;

   if 0 then modify info;

   infile FT15F001;

   input;

   num = input(scan(_infile_,-2,' :'),F8.);

   output;

   _error_ = 0;

   ParmCards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

   run;

options validvarname=any;

proc transpose data=info out=array(drop=_name_);

   var num;

   id num;

   run;

data test;

   if 0 then set array;

   array _v

  • _numeric_;

       infile FT15F001;

       call missing(of _all_);

       input id @;

       _infile_ = translate(_infile_,'=',':');

       input (_v

  • )(=);
  •    run;

    proc print;

       run;

    ** Rename;

    filename FT16F001 temp;

    data _null_;

       file FT16F001;

       set info;

       length namelit $35 newname $32;

       namelit = nliteral(strip(input(num,16.)));

       newname = cats('F',num);

       put +9 namelit '='  newname;

       run;

    proc datasets;

       modify test;

       rename %inc FT16F001/source2; ;

       run;

       quit;

    options validvarname=v7;

    proc print data=test;

       run;

    Ask a Question
    Discussion stats
    • 8 replies
    • 251 views
    • 1 like
    • 6 in conversation