I'm developing a regex that can fetch text from subtitle file that may be any language and sometimes contains an Unicode characters
String str="1
00:00:25,690 --> 00:00:44,410
As you can see he is no longer 1 year old, he is 12 years old now.
2
00:00:44,410 --> 00:00:58,120
He helps with the baby girl
";
Fetching eaching slot using ragex :
((^1\n|(\\n\\d+\n))(\\d{2}:\\d{2}:\\d{2},\\d{3}.*\\d{2}:\\d{2}:\\d{2},\\d{3}))[\\p{P}\\p{L}\\p{P}*-,;'\"\\s]+
But found recently that subtitle text slots can contains numbers so how to cover all possibilities of having any character any language any Unicode characters and any number in between.
Tried adding \p{N}
But fails. It is now including the timing and subtitle order as well :
sometimes like : blah blah blah.400:00:44,410
Is it a way to update regex to match numbers found in the text slot as well but not part of subtitle timing numbers.
.srt
specification is so simple, you shouldn't write a big, possibly broken RegExp to parse it.
As of Java 8, you can use \R
to match any newline.
So split your .srt
file with "\\R\\R"
to get subtitle blocks.
For each subtitle block, split around "\\R"
with a maximum of 3 elements.
You get a String[]
with :
Done!
=> [["1", "00:00:23,480 --> 00:00:27,920", "AM RANDE DER NACHT"],
["2", "00:02:22,570 --> 00:02:24,060", "- Salü.\r\n- Monsieur."],
["3", "00:02:25,300 --> 00:02:26,890", "- Panne?\r\n- Hm."],
["4", "00:02:29,840 --> 00:02:31,830", "Und wieviel brauchst du?"],
["5", "00:02:32,340 --> 00:02:34,000", "Von was, Monsieur?"],
["6", "00:02:34,120 --> 00:02:35,140", "Na ja, Sprit."],
["7", "00:02:36,210 --> 00:02:38,230", "Es äh... es liegt nicht am Sprit."],
["8", "00:02:38,490 --> 00:02:40,710", "Es ist, glaub ich, die Kerze."],
["9", "00:02:42,220 --> 00:02:43,980", "Was für 'ne Kerze brauchst du?"],
["10", "00:02:45,390 --> 00:02:47,800", "Äh, 'ne Kerze eben. Für 'n Moped."]]
You have one mistake in your character class: the -
between *
and ,
means a range and not the char -
. You can escape it or put it at the beginning/end of the character class.
Fixing this and adding \p{N}
gives us [\p{P}\p{L}\p{P}*,;'"\s\p{N}-]+
which is almost perfect but fails because it doesn't include >
.
[\p{P}\p{L}\p{P}*,;'"\s\p{N}>-]+
will be perfect, see demo