a question regarding combine lines in a txt file.
file contents as below (movie subtitles). I want to combine the subtitles, those English words and sentences in each paragraph into 1 line, instead of now showing either 1, 2 or 3 lines separably.
could you please advise which method is feasible in Python? many thanks.
00:00:23,343 --> 00:00:25,678
Been a while since I was up here
in front of you.
00:00:25,762 --> 00:00:28,847
Maybe I'll do us all a favour
and just stick to the cards.
00:00:31,935 --> 00:00:34,603
There's been speculation that I was
involved in the events that occurred
on the freeway and the rooftop...
00:00:36,189 --> 00:00:39,233
Sorry, Mr Stark, do you
honestly expect us to believe that
00:00:39,317 --> 00:00:42,903
that was a bodyguard
in a suit that conveniently appeared,
00:00:42,987 --> 00:00:45,698
despite the fact
that you sorely despise bodyguards?
00:00:45,782 --> 00:00:46,907
00:00:46,991 --> 00:00:51,662
And this mysterious bodyguard
was somehow equipped
A simple solution based on the 4 types of lines you can have:
You can just loop over each line, classifying them, and then act accordingly.
In fact, the "action" for a non-text not-empty line (timeline and numeric) is the same. Thus:
import re with open('yourfile.txt') as f: exampleText = f.read() new = '' for line in exampleText.split('\n'): if line == '': new += '\n\n' elif re.search('[a-zA-Z]', line): # check if there is text new += line + ' ' else: new += line + '\n'
>>> print(new) 1 00:00:23,343 --> 00:00:25,678 Been a while since I was up here in front of you. 2 00:00:25,762 --> 00:00:28,847 Maybe I'll do us all a favour and just stick to the cards. ...
indicates any of the characters inside
a-zindicates the range of characters a-z
A-Zindicates the range of characters A-Z
The pattern seems to be:
I would write a loop that reads lines 1) and 2), and then a nested loop that reads lines 3) until it finds a blank line. This nested loop could join those lines into a single line.
Still working on the 1st line..rest is what you expected.
with open('/home/cam/Documents/1.txt','rb') as f: f_out=open('mytxt','w+') lines=f.readlines() new_lines=[line.strip() if line == '\n' else line for line in lines] #print new_lines space_index=[i for i, x in enumerate(new_lines) if x == ""] new_list=+space_index for i in range(len(new_list)): try: mylist=new_lines[new_list[i]:new_list[i+1]] except IndexError, e: mylist=new_lines[new_list[i]:] mylist=mylist[1:] mylist1=[i.strip() for i in mylist] mylist1 = " ".join(mylist1[2:]) final=mylist1[:3] finallines=[i+"\n" for i in final] print finallines for i in finallines: f_out.write(i)
import re with open('yourfile.txt') as f: exampleText = f.read()
re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', re.sub('([^0-9])\n', '\g<1> ', exampleText))
The first replacement replaces all text ending with a newline with the text ending with a space:
tmp = re.sub('([^0-9])\n', '\g<1> ', exampleText)
The previous replacement means we lose the newline at the end of the last part of the texts. Then the second replacement adds a newline in front of these numeric lines:
re.sub('\n([0-9]+)\n', '\n\n\g<1>\n', tmp)