public class ExternalTagger extends Object implements IDocumentTagger, IXMLConfigurable
Extracts metadata from a document using an external application to do so.
This tagger relies heavily on the mechanics of
ExternalTransformer
, with a few differences:
${OUTPUT}
token (since taggers do not
modify cnotent).
setInputDisabled(boolean)
.
Refer to ExternalTransformer
class for documentation.
To use an external application to change a file content consider using
ExternalTransformer
instead.
<tagger class="com.norconex.importer.handler.tagger.impl.ExternalTagger"> <restrictTo caseSensitive="[false|true]" field="(name of header/metadata field name to match)"> (regular expression of value to match) </restrictTo> <!-- multiple "restrictTo" tags allowed (only one needs to match) --> <command inputDisabled="[false|true]"> c:\Apps\myapp.exe ${INPUT} ${INPUT_META} ${OUTPUT_META} ${REFERENCE} </command> <metadata inputFormat="[json|xml|properties]" outputFormat="[json|xml|properties]"> <!-- pattern only used when no output format is specified --> <pattern field="(target field name)" fieldGroup="(field name match group index)" valueGroup="(field value match group index)" caseSensitive="[false|true]"> (regular expression) </pattern> <!-- repeat pattern tag as needed --> </metadata> <environment> <variable name="(environment variable name)"> (environment variable value) </variable> <!-- repeat variable tag as needed --> </environment> <tempDir> (Optional directory where to store temporary files used for transformation.) </tempDir> </tagger>
The following example invokes an external application that accepts a document to transform and outputs a file containing the new metadata information.
<tagger class="com.norconex.importer.handler.tagger.impl.TaggerTransformer" > <command>/path/tag/app ${INPUT} ${OUTPUT_META}</command> </tagger>
ExternalTransformer
Modifier and Type | Field and Description |
---|---|
static String |
META_FORMAT_JSON |
static String |
META_FORMAT_PROPERTIES |
static String |
META_FORMAT_XML |
static String |
TOKEN_INPUT |
static String |
TOKEN_INPUT_META |
static String |
TOKEN_OUTPUT_META |
static String |
TOKEN_REFERENCE |
Constructor and Description |
---|
ExternalTagger() |
Modifier and Type | Method and Description |
---|---|
void |
addEnvironmentVariable(String name,
String value)
Adds an environment variables to the list of previously
assigned variables (if any).
|
void |
addEnvironmentVariables(Map<String,String> environmentVariables)
Adds the environment variables, keeping environment variables previously
assigned.
|
void |
addMetadataExtractionPattern(String field,
String pattern)
Adds a metadata extraction pattern that will extract the whole text
matched into the given field.
|
void |
addMetadataExtractionPattern(String field,
String pattern,
int valueGroup)
Adds a metadata extraction pattern, which will extract the value from
the specified group index upon matching.
|
void |
addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Adds a metadata extraction pattern that will extract matching field
names/values.
|
boolean |
equals(Object other) |
String |
getCommand()
Gets the command to execute.
|
Map<String,String> |
getEnvironmentVariables()
Gets environment variables.
|
List<RegexFieldExtractor> |
getMetadataExtractionPatterns()
Gets metadata extraction patterns.
|
String |
getMetadataInputFormat()
Gets the format of the metadata input file sent to the external
application.
|
String |
getMetadataOutputFormat()
Gets the format of the metadata output file from the external
application.
|
File |
getTempDir()
Gets directory where to store temporary files used for transformation.
|
int |
hashCode() |
boolean |
isInputDisabled()
Gets whether to send the document content or not, regardless
whether ${INPUT} token is part of the command or not.
|
void |
loadFromXML(Reader in) |
void |
saveToXML(Writer out) |
void |
setCommand(String command)
Sets the command to execute.
|
void |
setEnvironmentVariables(Map<String,String> environmentVariables)
Sets the environment variables.
|
void |
setInputDisabled(boolean inputDisabled)
Sets whether to send the document content or not, regardless
whether ${INPUT} token is part of the command or not.
|
void |
setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
Sets metadata extraction patterns.
|
void |
setMetadataInputFormat(String metadataInputFormat)
Sets the format of the metadata input file sent to the external
application.
|
void |
setMetadataOutputFormat(String metadataOutputFormat)
Sets the format of the metadata output file from the external
application.
|
void |
setTempDir(File tempDir)
Sets directory where to store temporary files used for transformation.
|
void |
tagDocument(String reference,
InputStream input,
ImporterMetadata metadata,
boolean parsed)
Tags a document with extra metadata information.
|
String |
toString() |
public static final String TOKEN_INPUT
public static final String TOKEN_INPUT_META
public static final String TOKEN_OUTPUT_META
public static final String TOKEN_REFERENCE
public static final String META_FORMAT_JSON
public static final String META_FORMAT_XML
public static final String META_FORMAT_PROPERTIES
public boolean isInputDisabled()
true
to prevent sending the input contentpublic void setInputDisabled(boolean inputDisabled)
inputDisabled
- true
to prevent sending the
input contentpublic String getCommand()
public void setCommand(String command)
command
- the commandpublic List<RegexFieldExtractor> getMetadataExtractionPatterns()
public void addMetadataExtractionPattern(String field, String pattern)
field
- target field to store the matching pattern.pattern
- the patternpublic void addMetadataExtractionPattern(String field, String pattern, int valueGroup)
field
- target field to store the matching pattern.pattern
- the patternvalueGroup
- which pattern group to return.public void addMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic void setMetadataExtractionPatterns(RegexFieldExtractor... patterns)
patterns
- extraction patternpublic Map<String,String> getEnvironmentVariables()
null
if using the current
process environment variablespublic void setEnvironmentVariables(Map<String,String> environmentVariables)
null
to use
the current process environment variables (default).environmentVariables
- environment variablespublic void addEnvironmentVariables(Map<String,String> environmentVariables)
null
to
setEnvironmentVariables(Map)
.environmentVariables
- environment variablespublic void addEnvironmentVariable(String name, String value)
null
name has no effect while null
values are converted to empty strings.name
- environment variable namevalue
- environment variable valuepublic String getMetadataInputFormat()
${INPUT}
token
is part of the command.public void setMetadataInputFormat(String metadataInputFormat)
${INPUT}
token
is part of the command.metadataInputFormat
- format of the metadata input filepublic String getMetadataOutputFormat()
${OUTPUT}
token
is part of the command.public void setMetadataOutputFormat(String metadataOutputFormat)
null
for relying metadata extraction
patterns instead.
Only applicable when the ${OUTPUT}
token
is part of the command.metadataOutputFormat
- format of the metadata output filepublic File getTempDir()
public void setTempDir(File tempDir)
tempDir
- temporary directorypublic void tagDocument(String reference, InputStream input, ImporterMetadata metadata, boolean parsed) throws ImporterHandlerException
IDocumentTagger
tagDocument
in interface IDocumentTagger
reference
- document reference (e.g. URL)input
- documentmetadata
- document metadataparsed
- whether the document has been parsed already or not (a
parsed document should normally be text-based)ImporterHandlerException
- problem tagging the documentpublic void loadFromXML(Reader in) throws IOException
loadFromXML
in interface IXMLConfigurable
IOException
public void saveToXML(Writer out) throws IOException
saveToXML
in interface IXMLConfigurable
IOException
Copyright © 2009–2021 Norconex Inc.. All rights reserved.