We have an application that uses a 9 node MarkLogic cluster - 3 E-nodes and 6 D-nodes. We upgraded to MarkLogic 7.0-5.1 a few months back. Sinced last few weeks, we have seen an issue that happens during heavy load on the site. All of a sudden , the E-node threads get exhausted and the site goes down. MarkLogic sometimes recovers from this on its own but there have been instances when one of the hosts needed to be restarted. If we search the logs for the time around which the outage happened, a common pattern that we see in the E-nodes is:
Notice: XXX-XDBC-9000: XDMP-NEWSTAMP: fn:doc($uri)[$uri]/doc:document -- Timestamp too new for forest YYY
It usually mentions just one forest (YYY in the example above, and may be another in next outage). In the log for the D-node containing this forest, a few similar logs can be seen:
Error: LockTask::run: ZZZ Error: XDMP-NEWSTAMP: Timestamp too new for forest YYY (14470707006887600)
It looks like the problem happens when lots of content is being loaded/updated in MarkLogic and a lot of read queries hit MarkLogic as well.
Question: I know XDMP-NEWSTAMP indicates that transactions running on earlier timestamps did not commit and transactions running on the later timestamps receive this error. I also know that the error is retry-able. However, I am trying to find out if it really suggest that simultaneous read and write queries is the root cause or there is something else going on.
Note that we have a Java app on top of the MarkLogic cluster using XCC which is at the receiving end of this issue. We also have a merge blackout in place over the weekdays to prevent merges beyond 10GB.
Any help is really appreciated!!