BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yura2301
Quartz | Level 8

Hi,

I need export data from text file that can be more-less big(for example 200 kb), delimiter should be some set of characters(for example '<test>').

So if I'll have file with text: "111<test>22222 3333<test>444<test>" result should be one column table with data:

111

22222 3333

4444

I use sas 9.1.3 and in this version dlmstr option isn't presented , so can I somehow optimal read such files and create one column table?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Here is a way.

data x;

infile 'c:\x.txt' recfm=n;

input x $char1. @@;

run;

data temp;

set x ;

if cat(lag6(x),lag5(x),lag4(x),lag3(x),lag2(x),lag1(x))='<test>' then group+1;

run;

proc transpose data=temp out=want(keep=col:) ;

by group;

var x;

run;

data want(keep=want);

set want;

want=tranwrd(cat(of col:),'<test>',' ');

run;

Ksharp

消息编辑者为:xia keshan

View solution in original post

8 REPLIES 8
mkeintz
PROC Star

Yura:

If you were on a UNIX system, I would declare a FILENAME statement with a "pipe" parameter that would read this data in through AWK or SED or similar to change all "<text>" to, say "!" (or any other character not in the data).   Then you could use "dlm='!'" on an infile statement.

Absent that, try:

data want (keep=field);

   input ;                                                      ** Fill the _INFILE_ automatic var **;
   length text $32767  field $40;
   text=tranwrd(_infile_,"<text>",'!');           ** Make a single-character delimiter **;

  do w=1 by 1 while (scan(text,w,'!') ^= ' ');

     field=scan(text,w,'!');                        ** Use the delimiter with a SCAN function **;
     output;

  end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FriedEgg
SAS Employee

data test;

length x $ 10;

infile cards dlm='2c'x;

input @;

_infile_=prxchange('s/\<test\>/,/',-1,_infile_); *alter the input buffer to change dlmstr to dlm;

input x @@;

if ^missing(x) then output;

cards;

111<test>22222 3333<test>444<test>

;

run;

111

22222 3333

444

Yura2301
Quartz | Level 8

Hi Fried,

Thanks for your answer, but looks like your code also works correct only if file line less than 32767 chars, I actually tried your code and it works ok on small files, but my file is more then 32767 and it already contains '2c'x delimiters inside, so I just use another delimiter that doesn't exists in file, but anyway - on files bigger then 32767 looks like it doesn't work.

Thanks!

Yura2301
Quartz | Level 8

filename _infile_ "&Path\data.txt";

data readFromFile;

      infile _infile_ lrecl=32767;

      input;

      length text $32767 field $32767;

      text=tranwrd(_infile_,'<test>,'~');

      do w=1 by 1 while(scan(text,w,'~')^='');

            field=scan(text,w,'~');

            lenf=length(field);

            output;

      end;

run;

      data test2;

            infile _infile_ dsd lrecl=1000000 pad;

            input txt1 : $32767. @@;

            row=_n_;

      run;

And result table "test2" will have many rows, depends on file size and special symbols in data etc.,

and then I can just work(scan,substr,merge strings etc.) with these "test2" table to achive needed result, but I'm not sure if it optimal solution in my case.

May be there is some option that allow to use sas functions that works with strings that are longer then 32767?


Thanks!


Ksharp
Super User

Here is a way.

data x;

infile 'c:\x.txt' recfm=n;

input x $char1. @@;

run;

data temp;

set x ;

if cat(lag6(x),lag5(x),lag4(x),lag3(x),lag2(x),lag1(x))='<test>' then group+1;

run;

proc transpose data=temp out=want(keep=col:) ;

by group;

var x;

run;

data want(keep=want);

set want;

want=tranwrd(cat(of col:),'<test>',' ');

run;

Ksharp

消息编辑者为:xia keshan

Yura2301
Quartz | Level 8

Hi Ksharp,

I caught the idea, I didn't try all your just part of it( till transpose) plus some simple char concatenations  so in the end I achive needed goal.

So thanks!

mkeintz
PROC Star

Here's a technique (untested) that might simplify the programming.  It's meant to work as long as none of your fields contains a '<' character.  The trick here is using the "@ 'est>" pointer control in the INPUT statement.


I've modified this note to account for the fact that the first field in each line is not preceded by '<test>'.

data ;
  infile ..... dlm='<'  lrecl=1000000  length=len column=col;

  /* COL above is the column pointer after the most recent INPUT statement */

  length field $200;

  input field @;

  do while (col<len); 

    output;

    input @ 'est>' field @;

  end;

  output;

run;

If the infile is a single long line, then you can simplify to

data ;
  infile ..... dlm='<'  lrecl=1000000  ;

  length field $200;

  if _n_=1 then input field @@;

  else input @ 'est>' field @@;

run;

The first example uses a trailing single "@", telling SAS to release the current input line when the end of the DATA step is encountered (thereby removing the "lost card" message of an earlier version using double "@@").  The second example uses a trailing double "@@" telling sas NOT to drop the input line.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
FriedEgg
SAS Employee

Gave this a bit more thought.  I'm not incredibly pleased with the following, but it appears to get the job done.  I testing with a file of several MB of data all on a single line.

Process flow:

1) Read in a binary stream from the file 'in' 256 bytes at a time.

2) search in a loop for delimited strings and substring them out until reaching the end of stream.

3) concatenate remainder from previous try that did not end in a dlmstr and repeat.

data test;

length infile buffer $ 512;

if _n_=1 then do;

  dlmstr='<test>';

  _prx=prxparse( '/(' || dlmstr || ')|(.)/' );

  retain _prx dlmstr;

  call missing(buffer);

end;

else if n>0 then buffer=substr(infile,length(infile)+1-n);

infile in recfm=n lrecl=256;

input infile $256.;

infile=strip(buffer) || infile;

start=1;

stop=length(infile);

n=0;

retain n infile;

call prxnext(_prx,start,stop,infile,pos,len);

do while(pos > 0);

  if len=length(dlmstr) then do;

   x=substr(infile,pos-n,n);

   n=0;

   output;

  end;

  else n++1;

  call prxnext(_prx,start,stop,infile,pos,len);

end;

keep x;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2300 views
  • 6 likes
  • 4 in conversation