BookmarkSubscribeRSS Feed
flyingsohigh
Calcite | Level 5

Hey, all

I am trying to read in some data as attached. The data structure is using “1:” or “2:”, etc. to index columns. I used PROC IMPORT and the code I used is as below. However, the code cannot put the missings into the right columns when reading it. For example, on the 3rd row, ‘5:0.223514’ should be read into the 6th column instead of the 5th. Could you tell me if there is a way for me to handle such case?

Thanks.

proc import datafile="\\C:\data.txt"

      missover dbms=dlm out=work.a replace;

      getnames=no;

      delimiter=' ';

      guessingrows=1000;

run;

8 REPLIES 8
art297
Opal | Level 21

Your proc import contains a missover statement/option which isn't an acceptable option that I'm familiar with.

However, more importantly, there is nothing in your data file that indicates missing values.  There is simply a single space between all values.  If there were periods with one leading space before after the period to indicate that a value was missing, proc import works quite nicely.

Peter_C
Rhodochrosite | Level 12

import is not effective for this kind of structure

A data step  like

data special( compress=yes);

  infile  "C:\temp\data.txt"  lrecl=10000 dlm=' :' truncover column=c length=l ;

  input record_type @ ;

  array col(100) ;    * allows up to 100 items in the data ;

  do until( c >= l ) ; * when reached the end-of-line ;

     input     pointer @ ; 

     if not    pointer then leave ; *just in case ;

     input col(pointer) @ ;

  end ;

run ;

Tom
Super User Tom
Super User

That is not that hard to read using a data step.  Read it into variable/value pairs. Here is a program that can read your data. 

Using delimiter of both a space and a colon will make it easy to read the variable number and its value without having to worry about any string manipulation.

Not sure what that first value represents, so I just named it variable zero.  I added ROW variable uniquely identify the rows.

Then you can use PROC TRANSPOSE to convert it back to a rectagular format.

The last step will both order the variables in the data vector and also creates all of the variables not actually named. 

The example data seems to mainly be variable 1 to 11 but there is also others such as 26 and 43.

data vert ;

retain maxvar 0;

infile cards truncover dlm=' :';

length var value 8 ;

input value @ ;

row +1 ;

var=0; output;

do until (var= .);

   input var value @;

   if var ne . then output;

   if var > maxvar then do ;

      maxvar=var;

      call symputx('MAXVAR',maxvar);

   end;

end;

cards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

run;

proc transpose data=vert out=horiz prefix=var;

  by row;

  id var ;

  var value ;

run;

data horiz;

  array var var0 - var&maxvar ;

  set horiz ;

  drop _name_;

run;

Ksharp
Super User

Or you can use the fourth input method - named input.

data _null_;
 infile 'c:\data.txt' length=len lrecl=2000 end=last;
 file 'c:\data_new.txt';
 retain max_num 0;
 input row $varying2000. len;
 max_num=max(max_num,input(scan(row,-2,' :'),best8.) );
 row='var0='||strip(tranwrd(strip(translate(row,'=',':')) ,' ',' var'));
 put row;
 if last then call symputx('max_num',max_num);
run;
%put &max_num;
data want;
 infile 'c:\data_new.txt' lrecl=2000;
 input (var0 - var&max_num) (=);
run;

Ksharp

Message was edited by: xia keshan

data_null__
Jade | Level 19

Ksharp your idea to use named input is interesting but you made it so complicated. We both end up with lots of irrelevant variables :smileyshocked:

filename FT15F001 temp lrecl=1024;

options validvarname=any;

data test;

   infile FT15F001;

   input id @;

   _infile_ = translate(_infile_,'=',':');

   input ('1'n-'43'n)(=);

   rename '1'n-'43'n = F1-F43;

   ParmCards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

   run;

options validvarname=v7;

proc print;

   run;

Peter_C
Rhodochrosite | Level 12

data_null_;

for that kind of reason, as my data step prepares 100 of these variables, I suggested using the compress=yes option. With just 8 rows, there was no gain from compression, but if I repeat the data 100-fold the compression saves 27% (a lot less than I had hoped for)

Alternatively, it might be optimal to normalize to just the 4 columns  input_row, record_type (1/2), pointer( 1 ... 43), value,

data normalized ;

   infile "C:\temp\data.txt" lrecl=10000 dlm=' :' truncover column=c length=l ;

  input_row +1 ;

   input record_type @ ;

   do until( c >= l ) ; * when reached the end-of-line ;

      input pointer @ ;

      if not pointer then leave ; *just in case ;

      input value @ ;

      output ;

   end ;

run ;

with the sample data provided, the code generates just 94 obs with these 4 columns

peterC

Ksharp
Super User

Null.

Yes. Your code is better and simpler. But You missed the most important thing, it is the number of variables inputed. In your code, you need to pre-define the number of variables , but how can you know the number of variables we should input  without browsing the whole file?.

So it is necessary to scan the whole file firstly to decide the number of  variables, just as my and Tom's code.

Tom's code is good ,but his is not suited for large file for the sake of proc transpose, think about it , data.txt has one million obs and each obs has 50 variables, then the dataset after Tom's code processing will has fifty million obs ,then use proc transpose. That will be a nightmare.

Regards.

Ksharp

Smiley Happy

data_null__
Jade | Level 19

You miss the point of my program.  I was not addressing the number of variables read just the issue of how to properly use NAMED input. You don't need to rewrite the entire file and make it bigger in the process and you have done, that will be a nightmare.  To address the variable name issue we just need to read the file and create an array.  Consider this program.

filename FT15F001 temp lrecl=1024;

data info(index=(num/unique));

   do num=1 to 11;

      output;

      end;

   run;

data info;

   if 0 then modify info;

   infile FT15F001;

   input;

   num = input(scan(_infile_,-2,' :'),F8.);

   output;

   _error_ = 0;

   ParmCards;

2 1:0.392696 2:0.347222 3:0.30303 4:0.0479599 5:0.255814 6:0.241534 7:0.980315 8:0.870079 9:0.354331 10:0.72048 11:1 43:1

2 1:0.377189 2:0.102778 3:0.0757576 4:0.154617 5:0.267442 6:0.213433 7:0.866142 8:0.901575 9:0.574803 10:0.671685 11:1 26:1

2 1:0.535268 2:0.988889 3:0.181818 5:0.223514 6:0.808908 7:0.791339 8:0.862205 9:0.610236 10:0.560017 11:1 38:1

2 1:0.454227 2:0.852778 3:0.181818 4:0.0214746 5:0.23385 6:0.392862 7:0.73622 8:0.917323 9:0.732283 10:0.863237 11:1 43:1

1 1:0.566283 2:0.136111 3:0.19697 4:0.0608447 5:0.223514 6:0.871013 7:0.885827 8:0.830709 9:0.46063 10:0.503694 11:1 38:1

2 1:0.54027 2:0.155556 3:0.272727 4:0.0680029 5:0.249354 6:0.78165 7:0.901575 8:0.787402 9:0.385827 10:0.588875 11:1 38:1

2 1:0.33917 2:0.0194444 3:0.181818 5:0.223514 6:0.222425 7:0.807087 8:0.850394 9:0.582677 10:0.628607 11:1 30:1

2 1:0.630315 2:0.330556 3:0.106061 4:0.115963 5:0.236434 6:0.935788 7:0.917323 8:0.92126 9:0.523622 10:0.356336 11:1 43:1

   run;

options validvarname=any;

proc transpose data=info out=array(drop=_name_);

   var num;

   id num;

   run;

data test;

   if 0 then set array;

   array _v

  • _numeric_;

       infile FT15F001;

       call missing(of _all_);

       input id @;

       _infile_ = translate(_infile_,'=',':');

       input (_v

  • )(=);
  •    run;

    proc print;

       run;

    ** Rename;

    filename FT16F001 temp;

    data _null_;

       file FT16F001;

       set info;

       length namelit $35 newname $32;

       namelit = nliteral(strip(input(num,16.)));

       newname = cats('F',num);

       put +9 namelit '='  newname;

       run;

    proc datasets;

       modify test;

       rename %inc FT16F001/source2; ;

       run;

       quit;

    options validvarname=v7;

    proc print data=test;

       run;

    sas-innovate-2024.png

    Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

     

    Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    How to Concatenate Values

    Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 8 replies
    • 1616 views
    • 1 like
    • 6 in conversation