string - Java library that finds sentence boundaries -


Does anyone know of a Java library that finds the boundary boundaries? I'm wondering if this is a smart StringTokenizer implementation has been my experience with that sentence Terminators will know all about the use of the languages ​​

Here BreakIterator :.

Using the example: I have the following Japanese:

  今日 は パ ソ ン を 買 っ た高性能 の マ ク 早 い い!と て も 適 い い い.  

In Assi, looks like this:

  \ ufeff \ u4eca \ u65e5 \ u306f \ u30d1 \ u30bd \ u30b3 \ u30f3 \ u3092 \ u8cb7 \ U3063 \ U305f \ U3002 \ U9ad8 \ U6027 \ U80fd \ U306e \ U30de \ U30c3 \ U30af \ U306f \ U65e9 \ U3044 \ Uff0l \ U3068 \ U3066 \ U3082 \ U5feb \ U9069 \ U3067 \ U3059 \ U3002  

here is part of the sample that I changed: static void sentenceExamples () {

  location currentLocale = new location ( "go", "JP"); Break Eaterator Sentitiere = Break Iterator. Gatescentence Instances (Current Local); String some text = "今日 は パ ソ ン を 買 っ た. 高性能 の マ ッ ク 早 い! と て も 快 適 で い.";  

When I look at the border index, I see this:

  0 | 13. 24 | 32  

But these indicators are not compatible with any term terminator.

You wrote:

I am wondering if it is a Smart stringTokener implementation will be implemented which can use languages ​​that know all the sentence terminators.

Here is a basic problem that the sentence terminator is dependent on the context, consider:

How Jones calculated 5! Without recurring?

It should be recognized as a sentence, but if you have the possible sentences split on the terminator, you will find three sentences.

So this is another complex problem, which can initially be thought of, this machine can be contacted using the techniques of learning. You can see the project in class specifically for example.


Comments