当前位置: 动力学知识库 > 问答 > 编程问答 >

java - Regex that accepts any compilation of any letters any languages,symbols or numbers

问题描述:

I'm developing a regex that can fetch text from subtitle file that may be any language and sometimes contains an Unicode characters

String str=

"1

00:00:25,690 --> 00:00:44,410

As you can see he is no longer 1 year old, he is 12 years old now.

2

00:00:44,410 --> 00:00:58,120

He helps with the baby girl

";

Fetching eaching slot using ragex :

((^1\n|(\\n\\d+\n))(\\d{2}:\\d{2}:\\d{2},\\d{3}.*\\d{2}:\\d{2}:\\d{2},\\d{3}))[\\p{P}\\p{L}\\p{P}*-,;'\"\\s]+

But found recently that subtitle text slots can contains numbers so how to cover all possibilities of having any character any language any Unicode characters and any number in between.

Tried adding \p{N}

But fails. It is now including the timing and subtitle order as well :

sometimes like : blah blah blah.400:00:44,410

Is it a way to update regex to match numbers found in the text slot as well but not part of subtitle timing numbers.

网友答案:

.srt specification is so simple, you shouldn't write a big, possibly broken RegExp to parse it.

As of Java 8, you can use \R to match any newline.

So split your .srt file with "\\R\\R" to get subtitle blocks.

For each subtitle block, split around "\\R" with a maximum of 3 elements. You get a String[] with :

  • id
  • t1 --> t2
  • text in any language, possibly with newlines and numbers inside.

Done!

=> [["1", "00:00:23,480 --> 00:00:27,920", "AM RANDE DER NACHT"],
 ["2", "00:02:22,570 --> 00:02:24,060", "- Salü.\r\n- Monsieur."],
 ["3", "00:02:25,300 --> 00:02:26,890", "- Panne?\r\n- Hm."],
 ["4", "00:02:29,840 --> 00:02:31,830", "Und wieviel brauchst du?"],
 ["5", "00:02:32,340 --> 00:02:34,000", "Von was, Monsieur?"],
 ["6", "00:02:34,120 --> 00:02:35,140", "Na ja, Sprit."],
 ["7", "00:02:36,210 --> 00:02:38,230", "Es äh... es liegt nicht am Sprit."],
 ["8", "00:02:38,490 --> 00:02:40,710", "Es ist, glaub ich, die Kerze."],
 ["9", "00:02:42,220 --> 00:02:43,980", "Was für 'ne Kerze brauchst du?"],
 ["10", "00:02:45,390 --> 00:02:47,800", "Äh, 'ne Kerze eben. Für 'n Moped."]]
网友答案:

You have one mistake in your character class: the - between * and , means a range and not the char -. You can escape it or put it at the beginning/end of the character class.

Fixing this and adding \p{N} gives us [\p{P}\p{L}\p{P}*,;'"\s\p{N}-]+ which is almost perfect but fails because it doesn't include >.

[\p{P}\p{L}\p{P}*,;'"\s\p{N}>-]+ will be perfect, see demo

分享给朋友:
您可能感兴趣的文章:
随机阅读: