public class ElasticsearchCommitter extends AbstractMappedCommitter
Commits documents to Elasticsearch. Since version 4.0.0, this committer relies on Elasticsearch REST API. If you wish to use the Elasticsearch Transport Client, use an older version (the Transport Client will eventually be deprecated by Elastic).
Despite being a subclass
of AbstractMappedCommitter
, setting an idTargetField
is not supported (it is always "id").
Based on your Elasticsearch version, having dots in field names can have different consequences. Some versions will not accept them and generate errors, while Elasticsearch 5 and up supports them, but they are treated as objects, which may not always be what you want.
If your version of Elasticsearch does not handle dots the way you expect,
make sure you do not submit fields with dots. A good strategy
is to convert dots to another character (like underscore).
This can be accomplished by setting a dotReplacement
.
In addition, if you are using a Norconex Collector with the Norconex Importer, you can rename the problematic fields with RenameTagger. You can also make sure only the fields you are interested in are making their way to Elasticsearch by using KeepOnlyTagger. If your dot represents a nested object, keep reading.
As of this writing, Elasticsearch 5 or higher have a 512 bytes
limitation on its "_id" field.
By default, an error (from Elasticsearch) will result from trying to submit
documents with an invalid ID. As of 4.1.0, you can get around this by
setting setFixBadIds(boolean)
to true
. It will
truncate references that are too long and append a hash code to it
representing the truncated part. This approach is not 100%
collision-free (uniqueness), but it should safely cover the vast
majority of cases.
Since 4.1.0, it is possible to provide a regular expression
that will identify one or more fields containing a JSON object rather
than a regular string (setJsonFieldsPattern(String)
). For example,
this is a useful way to store nested objects. While very flexible,
it can be challenging to come up with the JSON structure. You may
want to consider custom code to do so, or if you are using Norconex
Importer, one approach could be to use the
ScriptTagger.
For this to work properly, make sure you define your Elasticsearch
field mappings on your index/type beforehand.
Basic authentication is supported for password-protected clusters.
The password
can optionally be
encrypted using EncryptionUtil
(or command-line "encrypt.bat"
or "encrypt.sh").
In order for the password to be decrypted properly, you need
to specify the encryption key used to encrypt it. The key can be stored
in a few supported locations and a combination of
passwordKey
and passwordKeySource
must be specified to properly
locate the key. The supported sources are:
passwordKeySource |
passwordKey |
---|---|
key |
The actual encryption key. |
file |
Path to a file containing the encryption key. |
environment |
Name of an environment variable containing the key. |
property |
Name of a JVM system property containing the key. |
Since 4.1.0, it is possible to specify timeout values (in milliseconds), applied when data is sent to Elasticsearch.
Since 4.1.0, the typeName
configuration is now optional.
As of Elasticsearch 7.0, specifying types in bulk requests is
deprecated. Simply omit specifying a value for it if you are using
Elasticsearch 7 or higher.
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"> <nodes> (Comma-separated list of Elasticsearch node URLs. Defaults to http://localhost:9200) </nodes> <indexName>(Name of the index to use)</indexName> <typeName>(Name of the type to use. Deprecated since Elasticsearch v7)</typeName> <ignoreResponseErrors>[false|true]</ignoreResponseErrors> <discoverNodes>[false|true]</discoverNodes> <dotReplacement> (Optional value replacing dots in field names) </dotReplacement> <jsonFieldsPattern> (Optional regular expression to identify fields containing JSON objects instead of regular strings) </jsonFieldsPattern> <connectionTimeout>(milliseconds)</connectionTimeout> <socketTimeout>(milliseconds)</socketTimeout> <fixBadIds> [false|true](Forces references to fit into Elasticsearch _id field.) </fixBadIds> <!-- Use the following if authentication is required. --> <username>(Optional user name)</username> <password>(Optional user password)</password> <!-- Use the following if password is encrypted. --> <passwordKey>(the encryption key or a reference to it)</passwordKey> <passwordKeySource>[key|file|environment|property]</passwordKeySource> <sourceReferenceField keep="[false|true]"> (Optional name of field that contains the document reference, when the default document reference is not used. The reference value will be mapped to the Elasticsearch ID field. Once re-mapped, this metadata source field is deleted, unless "keep" is set totrue
.) </sourceReferenceField> <sourceContentField keep="[false|true]"> (If you wish to use a metadata field to act as the document "content", you can specify that field here. Default does not take a metadata field but rather the document content. Once re-mapped, the metadata source field is deleted, unless "keep" is set totrue
.) </sourceContentField> <targetContentField> (Target repository field name for a document content/body. Default is "content".) </targetContentField> <commitBatchSize> (max number of documents to send to Elasticsearch at once) </commitBatchSize> <queueDir>(optional path where to queue files)</queueDir> <queueSize>(max queue size before committing)</queueSize> <maxRetries>(max retries upon commit failures)</maxRetries> <maxRetryWait>(max delay in milliseconds between retries)</maxRetryWait> </committer>
XML configuration entries expecting millisecond durations
can be provided in human-readable format (English only), as per
DurationParser
(e.g., "5 minutes and 30 seconds" or "5m30s").
The following example uses the minimum required settings, on the local host.
<committer class="com.norconex.committer.elasticsearch.ElasticsearchCommitter"> <indexName>some_index</indexName> </committer>
Modifier and Type | Field and Description |
---|---|
static int |
DEFAULT_CONNECTION_TIMEOUT |
static String |
DEFAULT_ES_CONTENT_FIELD |
static int |
DEFAULT_MAX_RETRY_TIMEOUT
Deprecated.
since 4.1.1
|
static String |
DEFAULT_NODE |
static int |
DEFAULT_SOCKET_TIMEOUT |
DEFAULT_COMMIT_BATCH_SIZE
DEFAULT_QUEUE_DIR, filesCommitting
DEFAULT_QUEUE_SIZE, queueSize
Constructor and Description |
---|
ElasticsearchCommitter()
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected void |
close() |
void |
commit() |
protected void |
commitBatch(List<ICommitOperation> batch) |
protected org.elasticsearch.client.RestClient |
createRestClient() |
protected org.elasticsearch.client.sniff.Sniffer |
createSniffer(org.elasticsearch.client.RestClient client) |
boolean |
equals(Object obj) |
int |
getConnectionTimeout()
Gets Elasticsearch connection timeout.
|
String |
getDotReplacement()
Gets the character used to replace dots in field names.
|
String |
getIndexName()
Gets the index name.
|
String |
getJsonFieldsPattern()
Gets the regular expression matching fields that contains a JSON
object for its value (as opposed to a regular string).
|
int |
getMaxRetryTimeout()
Deprecated.
since 4.1.1
|
String[] |
getNodes()
Gets Elasticsearch cluster node URLs.
|
String |
getPassword()
Gets the password.
|
EncryptionKey |
getPasswordKey()
Gets the password encryption key.
|
int |
getSocketTimeout()
Gets Elasticsearch socket timeout.
|
String |
getTypeName()
Gets the type name.
|
String |
getUsername()
Gets the username.
|
int |
hashCode() |
boolean |
isDiscoverNodes()
Whether automatic discovery of Elasticsearch cluster nodes should be
enabled.
|
boolean |
isFixBadIds()
Gets whether to fix IDs that are too long for Elasticsearch
ID limitation (512 bytes max).
|
boolean |
isIgnoreResponseErrors()
Whether to ignore response errors.
|
protected void |
loadFromXml(XMLConfiguration xml) |
protected void |
saveToXML(XMLStreamWriter writer) |
void |
setConnectionTimeout(int connectionTimeout)
Sets Elasticsearch connection timeout.
|
void |
setDiscoverNodes(boolean discoverNodes)
Sets whether automatic discovery of Elasticsearch cluster nodes should be
enabled.
|
void |
setDotReplacement(String dotReplacement)
Sets the character used to replace dots in field names.
|
void |
setFixBadIds(boolean fixBadIds)
Sets whether to fix IDs that are too long for Elasticsearch
ID limitation (512 bytes max).
|
void |
setIgnoreResponseErrors(boolean ignoreResponseErrors)
Sets whether to ignore response errors.
|
void |
setIndexName(String indexName)
Sets the index name.
|
void |
setJsonFieldsPattern(String jsonFieldsPattern)
Sets the regular expression matching fields that contains a JSON
object for its value (as opposed to a regular string).
|
void |
setMaxRetryTimeout(int maxRetryTimeout)
Deprecated.
since 4.1.1
|
void |
setNodes(String... nodes)
Sets cluster node URLs.
|
void |
setPassword(String password)
Sets the password.
|
void |
setPasswordKey(EncryptionKey passwordKey)
Sets the password encryption key.
|
void |
setSocketTimeout(int socketTimeout)
Sets Elasticsearch socket timeout.
|
void |
setTypeName(String typeName)
Sets the type name.
|
void |
setUsername(String username)
Sets the username.
|
String |
toString() |
getSourceContentField, getSourceReferenceField, getTargetContentField, getTargetReferenceField, isKeepSourceContentField, isKeepSourceReferenceField, loadFromXML, prepareCommitAddition, saveToXML, setKeepSourceContentField, setKeepSourceReferenceField, setSourceContentField, setSourceReferenceField, setTargetContentField, setTargetReferenceField
commitAddition, commitComplete, commitDeletion, getCommitBatchSize, getMaxRetries, getMaxRetryWait, setCommitBatchSize, setMaxRetries, setMaxRetryWait
getInitialQueueDocCount, getQueueDir, prepareCommitDeletion, queueAddition, queueRemoval, setQueueDir
add, getQueueSize, remove, setQueueSize
public static final String DEFAULT_NODE
public static final String DEFAULT_ES_CONTENT_FIELD
public static final int DEFAULT_CONNECTION_TIMEOUT
public static final int DEFAULT_SOCKET_TIMEOUT
@Deprecated public static final int DEFAULT_MAX_RETRY_TIMEOUT
public String[] getNodes()
public void setNodes(String... nodes)
nodes
- Elasticsearch cluster nodespublic String getIndexName()
public void setIndexName(String indexName)
indexName
- the index namepublic String getTypeName()
public void setTypeName(String typeName)
typeName
- type namepublic String getJsonFieldsPattern()
null
.public void setJsonFieldsPattern(String jsonFieldsPattern)
jsonFieldsPattern
- regular expressionpublic boolean isIgnoreResponseErrors()
true
the errors are logged instead.true
when ignoring response errorspublic void setIgnoreResponseErrors(boolean ignoreResponseErrors)
false
, an exception is
thrown if the Elasticsearch response contains an error.
When true
the errors are logged instead.ignoreResponseErrors
- true
when ignoring response
errorspublic boolean isDiscoverNodes()
true
if enabledpublic void setDiscoverNodes(boolean discoverNodes)
discoverNodes
- true
if enabledpublic String getUsername()
public void setUsername(String username)
username
- the usernamepublic String getPassword()
public void setPassword(String password)
password
- the passwordpublic EncryptionKey getPasswordKey()
null
if the password is not
encrypted.EncryptionUtil
public void setPasswordKey(EncryptionKey passwordKey)
passwordKey
- password keyEncryptionUtil
public String getDotReplacement()
null
(does not replace dots).null
public void setDotReplacement(String dotReplacement)
dotReplacement
- replacement character or null
public int getConnectionTimeout()
public void setConnectionTimeout(int connectionTimeout)
connectionTimeout
- millisecondspublic int getSocketTimeout()
public void setSocketTimeout(int socketTimeout)
socketTimeout
- milliseconds@Deprecated public int getMaxRetryTimeout()
@Deprecated public void setMaxRetryTimeout(int maxRetryTimeout)
maxRetryTimeout
- millisecondspublic boolean isFixBadIds()
true
,
long IDs will be truncated and a hash code representing the
truncated part will be appended.true
to fix IDs that are too longpublic void setFixBadIds(boolean fixBadIds)
true
,
long IDs will be truncated and a hash code representing the
truncated part will be appended.fixBadIds
- true
to fix IDs that are too longpublic void commit()
commit
in interface ICommitter
commit
in class AbstractFileQueueCommitter
protected void close()
protected void commitBatch(List<ICommitOperation> batch)
commitBatch
in class AbstractBatchCommitter
protected org.elasticsearch.client.RestClient createRestClient()
protected org.elasticsearch.client.sniff.Sniffer createSniffer(org.elasticsearch.client.RestClient client)
protected void saveToXML(XMLStreamWriter writer) throws XMLStreamException
saveToXML
in class AbstractMappedCommitter
XMLStreamException
protected void loadFromXml(XMLConfiguration xml)
loadFromXml
in class AbstractMappedCommitter
public int hashCode()
hashCode
in class AbstractMappedCommitter
public boolean equals(Object obj)
equals
in class AbstractMappedCommitter
public String toString()
toString
in class AbstractMappedCommitter
Copyright © 2013–2021 Norconex Inc.. All rights reserved.