当前位置: 动力学知识库 > 问答 > 编程问答 >

ambiguity - JavaCC Ambiguities: How do I tell the parser to chose a certain match from the the list of "longer matches"?

问题描述:

For some input, the parser presents a "Possible kinds of longer matches : { <EXPRESSION>, <TEXT> }", but for some odd reason it chooses the wrong one.

This is the source:


SKIP :

{

" "

| "\r"

| "\t"

| "\n"

}

TOKEN :

{

< DOT : "." >

| < LBRACE : "{" >

| < RBRACE : "}" >

| < LBRACKET: "[" >

| < RBRACKET: "]" >

| < #LETTER : [ "a"-"z" ] >

| < #DIGIT : [ "0"-"9" ] >

| < #IDENTIFIER: < LETTER > (< LETTER >)* >

| < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>

| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >

}

void q0() :

{Token token = null;}

{

(

< LBRACE > expression() < RBRACE >

| ( token = < TEXT >

{

getTextTokens().add( token.image );

}

)

)* < EOF >

}

void expression() :

{Token token = null;}

{

< EXPRESSION >

}


If we try to parse "a.bc.d" using this grammar it would say " FOUND A <EXPRESSION> MATCH (a.bc.d) "

My question is why did it choose to parse the input as an <EXPRESSION> instead of <TEXT>?

Also, how can I force the parser to choose the right path? I have tried countless LOOKAHEAD scenarios with no success.

The right path is for instance <TEXT> when using "a.bc.d" as input, and <EXPRESSION> for "{a.bc.d}".

Thanks in advance.

网友答案:

From the JavaCC FAQ:

If more than one regular expression describes the longest possible prefix, then the regular expression that comes first in the .jj file is used.

So a preference can be established by ordering ambiguous definitions accordingly.

网友答案:

If expressions only appear within { braces }, only expressions (and white space) appear in braces, and braces are only used to delimit expressions, then you can do something like the following. See question 3.11 in the faq, if you are not familiar with lexical states.

// The following abbreviations hold in any state.
TOKEN : {
  < #LETTER : [ "a"-"z" ] >
| < #DIGIT : [ "0"-"9" ] >
| < #IDENTIFIER: < LETTER > (< LETTER >)* >
}

// Skip white space in either state
<DEFAULT,INBRACES> SKIP : { " "  | "\r" | "\t" | "\n" }

// The following are recognized in the default state.
// A left brace forces a switch to the INBRACES state.
<DEFAULT> TOKEN : {
  < DOT : "." >
| < LBRACE : "{" > : INBRACES
| < LBRACKET: "[" >
| < RBRACKET: "]" >
| < TEXT : (( < DOT >)* ( < LETTER > )+ (< DOT >)*)* >
}

// A right brace forces a switch to the DEFAULT state.
<DEFAULT, INBRACES > TOKEN {
  < RBRACE : "}"  > : DEFAULT
}

// Expressions are only recognized in the INBRACES state.
<INBRACES> TOKEN : {
  < EXPRESSION : (< IDENTIFIER> < DOT > < IDENTIFIER> < DOT > < IDENTIFIER> ((< DOT > < IDENTIFIER> )* | < LBRACKET > (< DIGIT>)* < RBRACKET >)*)*>
}

It looks a bit dodgy that DOT is defined in one state and used in another. However, I think that it works fine.

分享给朋友:
您可能感兴趣的文章:
随机阅读: