Re: How to read in a text data file with special structure

flyingsohigh · Posted 08-31-2011 04:06 PM

Hey, all

I am trying to read in some data as attached. The data structure is using “1:” or “2:”, etc. to index columns. I used PROC IMPORT and the code I used is as below. However, the code cannot put the missings into the right columns when reading it. For example, on the 3^rd row, ‘5:0.223514’ should be read into the 6^th column instead of the 5^th. Could you tell me if there is a way for me to handle such case?

Thanks.

proc import datafile="\\C:\data.txt"

missover dbms=dlm out=work.a replace;

getnames=no;

delimiter=' ';

guessingrows=1000;

run;

art297 · Posted 08-31-2011 04:35 PM

Your proc import contains a missover statement/option which isn't an acceptable option that I'm familiar with.

However, more importantly, there is nothing in your data file that indicates missing values. There is simply a single space between all values. If there were periods with one leading space before after the period to indicate that a value was missing, proc import works quite nicely.

Peter_C · Posted 08-31-2011 04:36 PM

import is not effective for this kind of structure

A data step like

data special( compress=yes);

infile "C:\temp\data.txt" lrecl=10000 dlm=' :' truncover column=c length=l ;

input record_type @ ;

array col(100) ; * allows up to 100 items in the data ;

do until( c >= l ) ; * when reached the end-of-line ;

input pointer @ ;

if not pointer then leave ; *just in case ;

input col(pointer) @ ;

end ;

run ;

Tom · Posted 08-31-2011 06:34 PM

That is not that hard to read using a data step. Read it into variable/value pairs. Here is a program that can read your data.

Using delimiter of both a space and a colon will make it easy to read the variable number and its value without having to worry about any string manipulation.

Not sure what that first value represents, so I just named it variable zero. I added ROW variable uniquely identify the rows.

Then you can use PROC TRANSPOSE to convert it back to a rectagular format.

The last step will both order the variables in the data vector and also creates all of the variables not actually named.

The example data seems to mainly be variable 1 to 11 but there is also others such as 26 and 43.

data vert ;

retain maxvar 0;

infile cards truncover dlm=' :';

length var value 8 ;

input value @ ;

row +1 ;

var=0; output;

do until (var= .);

input var value @;

if var ne . then output;

if var > maxvar then do ;

maxvar=var;

call symputx('MAXVAR',maxvar);

end;

cards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

run;

proc transpose data=vert out=horiz prefix=var;

by row;

id var ;

var value ;

run;

data horiz;

array var var0 - var&maxvar ;

set horiz ;

drop _name_;

run;

Ksharp · Posted 09-01-2011 04:21 AM

Or you can use the fourth input method - named input.

data _null_;
 infile 'c:\data.txt' length=len lrecl=2000 end=last;
 file 'c:\data_new.txt';
 retain max_num 0;
 input row $varying2000. len;
 max_num=max(max_num,input(scan(row,-2,' :'),best8.) );
 row='var0='||strip(tranwrd(strip(translate(row,'=',':')) ,' ',' var'));
 put row;
 if last then call symputx('max_num',max_num);
run;
%put &max_num;
data want;
 infile 'c:\data_new.txt' lrecl=2000;
 input (var0 - var&max_num) (=);
run;

Ksharp

Message was edited by: xia keshan

data_null__ · Posted 09-01-2011 09:24 AM

Ksharp your idea to use named input is interesting but you made it so complicated. We both end up with lots of irrelevant variables :smileyshocked:

filename FT15F001 temp lrecl=1024;

options validvarname=any;

data test;

infile FT15F001;

input id @;

_infile_ = translate(_infile_,'=',':');

input ('1'n-'43'n)(=);

rename '1'n-'43'n = F1-F43;

ParmCards;