Solved: Re: Proc Transpose Non-Systematic Data

JAR · Posted 05-28-2022 10:51 AM

Hi,
I have a subtitle file, which I want to convert into data:

As few entries have multiple lines, I seek some help.
Thanks in advance,

Jijil Ramakrishnan

Tom · Posted 05-28-2022 12:44 PM

Seems pretty simple as the structure looks very regular (systematic actually).

Just keep reading comment lines until you hit an empty line.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) / ;
  do nrows=1 by 1 until(_infile_=' ' or eof);
    comment=catx('|',comment,_infile_);
    input;
  end;
run;

Result:

options nocenter ls=132 ps=60;

proc report data=want split='|' headline;
  column id start end nrows comment;
  define comment / width=80 flow;
run;

View solution in original post

ballardw · Posted 05-28-2022 12:08 PM

First a comment about your Want data set? You really do want to repeat the Index, and likely the start and end time values on each observation after reading the data. Otherwise any SORT is going to leave the text "orphaned" from the proper information about index for any of the multiple line bits.

The below example has had the , in the time values changed to . as I am not going to mess with my system settings to deal with foreign language conventions. The code should work with your data pointing to your text file with an infile statement. Or copy some of your example text and replace the datalines I used for testing.

data example;
  informat index 8. starttime endtime time15. text $100.;
  format starttime endtime time12.3;
  retain index starttime endtime;
  input @;
  if input(_infile_,?? 8.) then do;     
      index =input(_infile_,8.);
      input;
      input starttime text endtime;
      input ;
      text=_infile_;
      output;
  end;
  else if anyalpha(_infile_) then do;
      text=_infile_;
      output;
      input;
  end;
  else input;

datalines;
1
00:00:49.260 --> 00:00:50.327
Gerald Tate's here.

2
00:00:50.395 --> 00:00:51.729
He wants to know         
what's happening to his deal.

3
00:00:51.797 --> 00:00:53.264
Go get Harvey.

4
00:00:54.793 --> 00:00:58.793
== sync, corrected by <font color="#00ff00">elderman</font> ==
;

If you haven't used the trailing @ on input it holds the input buffer so it can be examined in the SAS automatic variable _infile_. So the code checks to see if a line is a valid number and if so assumes that is the index value, then reads the time values. Reuse of the TEXT variable to read the --> instead of dealing with fancier parsing of that line. Also assumes one line of text.

The check for the number value using the input function includes the ?? to suppress invalid data messages that would occur with the second text line in those sets such as on index=2.

JAR · Posted 05-28-2022 12:26 PM

@ballardw
You are right, the Index must increment for every line of text.
How can I change the "2" into "3".

Please advise (or correct the code).
Thank you!
Jijil

ballardw · Posted 05-28-2022 09:18 PM

I would say that you do NOT want change that "Index". Why? The second or other subsequent lines of text belong in the same "group".

This might be one of the times where combining different records so all that text is a single value make sense.

JAR · Posted 06-02-2022 01:25 AM

True!

Tom · Posted 05-28-2022 12:44 PM

Seems pretty simple as the structure looks very regular (systematic actually).

Just keep reading comment lines until you hit an empty line.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) / ;
  do nrows=1 by 1 until(_infile_=' ' or eof);
    comment=catx('|',comment,_infile_);
    input;
  end;
run;

Result:

options nocenter ls=132 ps=60;

proc report data=want split='|' headline;
  column id start end nrows comment;
  define comment / width=80 flow;
run;

JAR · Posted 06-02-2022 01:24 AM

Thank you! This is so concise and efficient!

Ksharp · Posted 05-29-2022 06:09 AM

data have;
infile "c:\temp\subtitle.txt" encoding='utf-8' termstr=crlf length=len;
input temp $varying200. len;
if missing(temp) then do;group+1;delete;end;
run;

data have2;
 set have;
 by group;
 if first.group then n=0;
 n+1;
 if n>2 then n=3;
run;

data have3;
do until(last.n);
 set have2;
 by group n;
 length want $ 200;
 want=catx(' ',want,temp);
end;
drop temp;
run;

proc transpose data=have3 out=have4 prefix=var;
by group;
id n;
var want;
run;

data want;
 set have4(rename=(var1=index var3=text));
 start_time=scan(var2,1,'-> ');
 end_time=scan(var2,-1,'-> ');
 drop group _name_ var2;
run;

JAR · Posted 06-02-2022 01:24 AM

Thank you KShap!

Ksharp · Posted 06-02-2022 03:50 AM

Sorry. Tom,

I have to say Tom's code lost the last record.

Tom · Posted 06-02-2022 10:26 AM

That just needs a trivial change avoid reading past the end of the file.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) ;
  if eof then _infile_=' ';
  else input ;
  do nrows=1 by 1 while(_infile_ ne ' ');
    comment=catx('|',comment,_infile_);
    if eof then _infile_=' ';
    else input;
  end;
run;

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!