BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Hi,
I have a subtitle file, which I want to convert into data:

JAR_1-1653749336257.pngJAR_2-1653749382904.png

 

As few entries have multiple lines, I seek some help. 
Thanks in advance,

Jijil Ramakrishnan

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Seems pretty simple as the structure looks very regular (systematic actually). 

Just keep reading comment lines until you hit an empty line.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) / ;
  do nrows=1 by 1 until(_infile_=' ' or eof);
    comment=catx('|',comment,_infile_);
    input;
  end;
run;

Result:

options nocenter ls=132 ps=60;

proc report data=want split='|' headline;
  column id start end nrows comment;
  define comment / width=80 flow;
run;

Tom_0-1653756272160.png

 

View solution in original post

10 REPLIES 10
ballardw
Super User

First a comment about your Want data set? You really do want to repeat the Index, and likely the start and end time values on each observation after reading the data. Otherwise any SORT is going to leave the text "orphaned" from the proper information about index for any of the multiple line bits.

 

The below example has had the , in the time values changed to . as I am not going to mess with my system settings to deal with foreign language conventions. The code should work with your data pointing to your text file with an infile statement. Or copy some of your example text and replace the datalines I used for testing.

data example;
  informat index 8. starttime endtime time15. text $100.;
  format starttime endtime time12.3;
  retain index starttime endtime;
  input @;
  if input(_infile_,?? 8.) then do;     
      index =input(_infile_,8.);
      input;
      input starttime text endtime;
      input ;
      text=_infile_;
      output;
  end;
  else if anyalpha(_infile_) then do;
      text=_infile_;
      output;
      input;
  end;
  else input;

datalines;
1
00:00:49.260 --> 00:00:50.327
Gerald Tate's here.

2
00:00:50.395 --> 00:00:51.729
He wants to know         
what's happening to his deal.

3
00:00:51.797 --> 00:00:53.264
Go get Harvey.

4
00:00:54.793 --> 00:00:58.793
== sync, corrected by <font color="#00ff00">elderman</font> ==
;

If you haven't used the trailing @ on input it holds the input buffer so it can be examined in the SAS automatic variable _infile_. So the code checks to see if a line is a valid number and if so assumes that is the index value, then reads the time values. Reuse of the TEXT variable to read the --> instead of dealing with fancier parsing of that line. Also assumes one line of text.

The check for the number value using the input function includes the ?? to suppress invalid data messages that would occur with the second text line in those sets such as on index=2.

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

@ballardw 
You are right, the Index must increment for every line of text. 
How can I change the "2" into "3". 

JAR_0-1653755178231.png

Please advise (or correct the code). 
Thank you!
Jijil

ballardw
Super User

I would say that you do NOT want change that "Index". Why? The second or other subsequent lines of text belong in the same "group".

This might be one of the times where combining different records so all that text is a single value make sense.

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

True!

Tom
Super User Tom
Super User

Seems pretty simple as the structure looks very regular (systematic actually). 

Just keep reading comment lines until you hit an empty line.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) / ;
  do nrows=1 by 1 until(_infile_=' ' or eof);
    comment=catx('|',comment,_infile_);
    input;
  end;
run;

Result:

options nocenter ls=132 ps=60;

proc report data=want split='|' headline;
  column id start end nrows comment;
  define comment / width=80 flow;
run;

Tom_0-1653756272160.png

 

JAR
Obsidian | Level 7 JAR
Obsidian | Level 7
Thank you! This is so concise and efficient!
Ksharp
Super User
data have;
infile "c:\temp\subtitle.txt" encoding='utf-8' termstr=crlf length=len;
input temp $varying200. len;
if missing(temp) then do;group+1;delete;end;
run;

data have2;
 set have;
 by group;
 if first.group then n=0;
 n+1;
 if n>2 then n=3;
run;

data have3;
do until(last.n);
 set have2;
 by group n;
 length want $ 200;
 want=catx(' ',want,temp);
end;
drop temp;
run;

proc transpose data=have3 out=have4 prefix=var;
by group;
id n;
var want;
run;

data want;
 set have4(rename=(var1=index var3=text));
 start_time=scan(var2,1,'-> ');
 end_time=scan(var2,-1,'-> ');
 drop group _name_ var2;
run;
JAR
Obsidian | Level 7 JAR
Obsidian | Level 7

Thank you KShap!

Ksharp
Super User

Sorry. Tom,

I have to say Tom's code lost the last record.

 

Ksharp_0-1654156203938.png

 

Tom
Super User Tom
Super User

That just needs a trivial change avoid reading past the end of the file.

data want;
  infile have truncover end=eof;
  length id 8 start end $12 nrows 8 comment $300 ;
  input id / (start 2*end) (:) ;
  if eof then _infile_=' ';
  else input ;
  do nrows=1 by 1 while(_infile_ ne ' ');
    comment=catx('|',comment,_infile_);
    if eof then _infile_=' ';
    else input;
  end;
run;

 

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3259 views
  • 1 like
  • 4 in conversation