Since Crate Data has sensible defaults, there is no configuration needed at all for basic operation.
Crate Data is mainly configured via a configuration file, which is located at config/crate.yml. The vanilla configuration file distributed with the package has all available settings as comments in it along with the according default value.
The location of the config file can be specified upon startup like this:
sh$ ./bin/crate -Des.config=/path/to/config.yml
Any option can be configured either by the config file or as system property. If using system properties the required prefix ‘es.’ will be ignored.
For example, configuring the cluster name by using system properties will work this way:
sh$ ./bin/crate -Des.cluster.name=cluster
This is exactly the same as setting the cluster name in the config file:
cluster.name = cluster
Settings will get applied in the following order where the latter one will overwrite the prior one:
- internal defaults
- system properties
- options from config file
- command-line properties
The name of the Crate Data cluster the node should join to.
The IP address Crate Data will bind itself to. This setting sets both the network.bind_host and network.publish_host values.
This setting determines to which address crate should bind itself to. To only bind to localhost, set it to any local address or _local_.
This setting is used by a Crate Data node to publish its own address to the rest of the cluster. Per default it is the first non local address.
To explicitly bind crate to a specific interface use the interface name between underscores. For example _eth0_. This resolves to the ip address of this interface. With _eth0:ipv{4,6}_ you explicitly listen on an ipv6 or ipv4 address.
Per default Crate Data will listen to the first free port in the range of 4200-4300 for HTTP Requests. If there is no crate process already running on that machine, it will usually be 4200.
For internal cluster communication crate uses TCP and listens on the first free port in the range of 4300-4400.
All current applied cluster settings can be read by querying the sys.cluster.settings column. Most cluster settings can be changed at runtime using the SET/RESET statement. This is documented at each setting.
A boolean indicating whether or not to collect statistical information about the cluster.
The number of jobs kept in the sys.jobs_log table on each node for performance analytics. Older entries will be deleted when the jobs_log_size is reached. A single SQL statement results in a job to be executed on the cluster. A higher number results in more expressive results but also in more occupied RAM. Setting it to 0 disables collecting job information.
The number of operations to keep in the sys.operations_log table on each node for performance analytics. Older entries will be deleted when the operations_log_size is reached. A job consists of one or many operations. A higher number results in more expressive results but also in more occupied RAM. Setting it to 0 disables collecting operation information.
none: No minimum data availability is required. The node may shut down even if records are missing after shutdown.
primaries: At least all primary shards need to be availabe after the node has shut down. Replicas may be missing.
full: All records and all replicas need to be available after the node has shut down. Data availability is full.
Note
This option is ignored if there is only 1 node in a cluster!
true: The graceful stop command allows shards to be reallocated before shutting down the node in order to ensure minimum data availability set with min_availability.
false: The graceful stop command will fail if the cluster would need to reallocate shards in order to ensure the minimum data availability set with min_availability.
Note
Make sure you have enough nodes and enough disk space for the reallocation.
Defines the maximum waiting time in milliseconds for the reallocation process to finish. The force setting will define the behaviour when the shutdown process runs into this timeout.
The timeout expects a time value either as a long or double or alternatively as a string literal with a time suffix (ms, s, m, h, d, w)
Defines whether graceful stop should force stopping of the node if it runs into the timeout which is specified with the cluster.graceful_stop.timeout setting.
Set to ensure a node sees N other master eligible nodes to be considered operational within the cluster. It’s recommended to set it to a higher value than 1 when running more than 2 nodes in the cluster.
Set the time to wait for ping responses from other nodes when discovering. Set this option to a higher value on a slow or congested network to minimize discovery failures.
Time a node is waiting for responses from other nodes to a published cluster state.
all allows all shard allocations, the cluster can allocate all kinds of shards.
none allows no shard allocations at all. No shard will be moved or created.
primaries only primaries can be moved or created. This includes existing primary shards.
new_primaries allows allocations for new primary shards only. This means that for example a newly added node will not allocate any replicas. However it is still possible to allocate new primary shards for new indices. Whenever you want to perform a zero downtime upgrade of your cluster you need to set this value before gracefully stopping the first node and reset it to all after starting the last updated node.
Allow to control when rebalancing will happen based on the total state of all the indices shards in the cluster. Defaulting to indices_all_active to reduce chatter during initial recovery.
Define how many concurrent rebalancing tasks are allowed cluster wide.
Define the number of initial recoveries of primaries that are allowed per node. Since most times local gateway is used, those should be fast and we can handle more of those per node without creating load.
Cluster allocation awareness allows to configure shard and replicas allocation across generic attributes associated with nodes.
Define node attributes which will be used to do awareness based on the allocation of a shard and its replicas. For example, let’s say we have defined an attribute rack_id and we start 2 nodes with node.rack_id set to rack_one, and deploy a single index with 5 shards and 1 replica. The index will be fully deployed on the current nodes (5 shards and 1 replica each, total of 10 shards). | Now, if we start two more nodes, with node.rack_id set to rack_two, shards will relocate to even the number of shards across the nodes, but a shard and its replica will not be allocated in the same rack_id value. | The awareness attributes can hold several values
Attributes on which shard allocation will be forced. * is a placeholder for the awareness attribute, which can be defined using the cluster.routing.allocation.awareness.attributes setting. Let’s say we configured an awareness attribute zone and the values zone1, zone2 here, start 2 nodes with node.zone set to zone1 and create an index with 5 shards and 1 replica. The index will be created, but only 5 shards will be allocated (with no replicas). Only when we start more shards with node.zone set to zone2 the replicas will be allocated.
All these values are relative to one another. The first three are used to compose a three separate weighting functions into one. The cluster is balanced when no allowed action can bring the weights of each node closer together by more then the fourth setting. Actions might not be allowed, for instance, due to forced awareness or allocation filtering.
Defines the weight factor for shards allocated on a node (float). Raising this raises the tendency to equalize the number of shards across all nodes in the cluster.
Defines a factor to the number of shards per index allocated on a specific node (float). Increasing this value raises the tendency to equalize the number of shards per index across all nodes in the cluster.
Defines a weight factor for the number of primaries of a specific index allocated on a node (float). Increasing this value raises the tendency to equalize the number of primary shards across all nodes in the cluster.
Minimal optimization value of operations that should be performed (non negative float). Increasing this value will cause the cluster to be less aggressive about optimising the shard balance.
Allow to control the allocation of all shards based on include/exclude filters. E.g. this could be used to allocate all the new shards on the nodes with specific IP addresses or custom attributes.
Place new shards only on nodes where one of the specified values matches the attribute. e.g.: cluster.routing.allocation.include.zone: “zone1,zone2”
Place new shards only on nodes where none of the specified values matches the attribute. e.g.: cluster.routing.allocation.exclude.zone: “zone1”
Used to specify a number of rules, which all MUST match for a node in order to allocate a shard on it. This is in contrast to include which will include a node if ANY rule matches.
Prevent shard allocation on nodes depending of the disk usage.
Defines the lower disk threshold limit for shard allocations. New shards will not be allocated on nodes with disk usage greater than this value. It can also be set to an absolute bytes value (like e.g. 500mb) to prevent the cluster from allocating new shards on node with less free disk space than this value.
Defines the higher disk threshold limit for shard allocations. The cluster will attempt to relocate existing shards to another node if the disk usage on a node rises above this value. It can also be set to an absolute bytes value (like e.g. 500mb) to relocate shards from nodes with less free disk space than this value.
By default, the cluster will retrieve information about the disk usage of the nodes every 30 seconds. This can also be changed by setting the cluster.info.update.interval setting.
Limits the number of open concurrent streams when recovering a shard from a peer.
Specifies the chunk size used to copy the shard data from the source shard. It is compressed if indices.recovery.compress is set to true.
Specifies how many transaction log lines should be transfered between shards in a single request during the recovery process. If indices.recovery.translog_size is reached first, value is ignored for this request.
Specifies how much data of the transaction log should be transfered betweem shards in a single request during the recovery process. If indices.recovery.translog_op is reached first, value is ignored for this request.
Define if transferred data should be compressed during the recovery process. Setting it to false may lower the pressure on the CPU while resulting in more data being transfererd over the network.
Specifies the maximum number of bytes that can be transferred during shard recovery per seconds. Limiting can be disabled by setting it to 0. Similiar to indices.recovery.concurrent_streams this setting allows to control the network usage of the recovery process. Higher values may result in higher network utilization, but also faster recovery process.
Allows to throttle merge (or all) processes of the store module.
If throttling is enabled by indices.store.throttle.type, this setting specifies the maximum bytes per second a store module process can operate with.
The field data circuit breaker allows estimation of needed memory required for loading field data into memory. If a certain limit is reached an exception is raised.
Specifies the JVM heap limit for the fielddata breaker.
A constant that all field data estimations are multiplied with to determine a final estimation.
Every node holds several thread pools to improve how threads are managed within a node. There are several pools, but the important ones include:
- index: For index/delete operations, defaults to fixed
- search: For count/search operations, defaults to fixed
- bulk: For bulk operations, defaults to fixed
- refresh: For refresh operations, defaults to cache
fixed holds a fixed size of threads to handle the requests. It also has a queue for pending requests if no threads are available.
cache will spawn a thread if there are pending requests (unbounded).
If the type of a threadpool is set to fixed there are a few optional settings.
Number of threads.
Size of the queue for pending requests. A value of -1 sets it to unbounded.
Defines how often the cluster collect metadata information (e.g. disk usages etc.) if no concrete event is triggered.
Crate Data comes, out of the box, with Log4j 1.2.x. It tries to simplify log4j configuration by using YAML to configure it. The logging configuration file is at config/logging.yml.
The yaml file is used to prepare a set of properties used for logging configuration using the PropertyConfigurator but without the tediously repeating log4j prefix. Here is a small example of a working logging configuration:
rootLogger: INFO, console
logger:
# log action execution errors for easier debugging
action: DEBUG
appender:
console:
type: console
layout:
type: consolePattern
conversionPattern: "[%d{ISO8601}][%-5p][%-25c] %m%n"
And here is a snippet of the generated properties ready for use with log4j. You get the point.
log4j.rootLogger=INFO, console
log4j.logger.action=DEBUG
log4j.appender.console=org.elasticsearch.common.logging.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.conversionPattern=[%d{ISO8601}][%-5p][%-25c] %m%n
...
Specifies the home directory of the installation, it is used to find default file paths like e.g. config/crate.yml or the default data directory location. This variable is usally defined at the by-distribution shipped start-up script. In most cases it is the parent directory of the directory containing the bin/crate executable.
CRATE_HOME: | Home directory of Crate Data installation. Used to refer to default config files, data locations, log files, etc. All configured relative paths will use this directory as a parent. |
---|
This variable specifies the amount of memory that can be used by the JVM. This should be set to at least 50% of the machines memory.
Certain operations in Crate require a lot of records to be hold in memory at a time. If the amount of heap that can be allocated by the JVM is too low these operations would fail with an OutOfMemory exception.
The value of the environment variable can be suffixed with g or m. For example:
CRATE_HEAP_SIZE=4g
或是邮件反馈可也:
askdama[AT]googlegroups.com
订阅 substack 体验古早写作:
关注公众号, 持续获得相关各种嗯哼: