发布于 2015-09-14 14:53:23 | 343 次阅读 | 评论: 0 | 来源: 网络整理
Replica sets automate most administrative tasks associated with database replication. Nevertheless, several operations related to deployment and systems management require administrator intervention remain. This document provides an overview of those tasks, in addition to a collection of troubleshooting suggestions for administers of replica sets.
也可以参考
The following tutorials provide task-oriented instructions for specific administrative tasks related to replica set operation.
All replica sets have a single primary and one or more secondaries. Replica sets allow you to configure secondary members in a variety of ways. This section describes these configurations.
注解
A replica set can have up to 12 members, but only 7 members can have votes. For configuration information regarding non-voting members, see Non-Voting Members.
警告
The rs.reconfig() shell method can force the current primary to step down, which causes an election. When the primary steps down, the mongod closes all client connections. While this typically takes 10-20 seconds, attempt to make these changes during scheduled maintenance periods. To successfully reconfigure a replica set, a majority of the members must be accessible.
也可以参考
The Elections section in the 副本集基本概念 document, and the Election Internals section in the 副本集的内部和行为 document.
The secondary-only configuration prevents a secondary member in a replica set from ever becoming a primary in a failover. You can set secondary-only mode for any member of the set except the current primary.
For example, you may want to configure all members of a replica sets located outside of the main data centers as secondary-only to prevent these members from ever becoming primary.
To configure a member as secondary-only, set its priority value to 0. Any member with a priority equal to 0 will never seek election and cannot become primary in any situation. For more information on priority levels, see Member Priority.
注解
When updating the replica configuration object, address all members of the set using the index value in the array. The array index begins with 0. Do not confuse this index value with the value of the _id field in each document in the members array.
The _id rarely corresponds to the array index.
As an example of modifying member priorities, assume a four-member replica set. Use the following sequence of operations in the mongo shell to modify member priorities:
cfg = rs.conf()
cfg.members[0].priority = 2
cfg.members[1].priority = 1
cfg.members[2].priority = 0.5
cfg.members[3].priority = 0
rs.reconfig(cfg)
This reconfigures the set, with the following priority settings:
注解
If your replica set has an even number of members, add an arbiter to ensure that members can quickly obtain a majority of votes in an election for primary.
注解
The current primary cannot be assigned a priority of 0. If you want to prevent the current primary from being elected primary again, you must demote it using rs.stepDown() and then set the appropriate priority with the rs.reconfig() method.
也可以参考
Delayed members copy and apply operations from the primary’s oplog with a specified delay. If a member has a delay of one hour, then the latest entry in this member’s oplog will not be more recent than one hour old, and the state of data for the member will reflect the state of the set an hour earlier.
Example
If the current time is 09:52 and the secondary is a delayed by an hour, no operation will be more recent than 08:52.
Delayed members may help recover from various kinds of human error. Such errors may include inadvertently deleted databases or botched application upgrades. Consider the following factors when determining the amount of slave delay to apply:
Delayed members must have a priority set to 0 to prevent them from becoming primary in their replica sets. Also these members should be hidden to prevent your application from seeing or querying this member.
To configure a replica set member with a one hour delay, use the following sequence of operations in the mongo shell:
cfg = rs.conf()
cfg.members[0].priority = 0
cfg.members[0].slaveDelay = 3600
rs.reconfig(cfg)
After the replica set reconfigures, the first member of the set in the members array will have a priority of 0 and cannot become primary. The slaveDelay value delays both replication and the member’s oplog by 3600 seconds (1 hour). Setting slaveDelay to a non-zero value also sets hidden to true for this replica set so that it does not receive application queries in normal operations.
警告
The length of the secondary slaveDelay must fit within the window of the oplog. If the oplog is shorter than the slaveDelay window, the delayed member cannot successfully replicate operations.
也可以参考
slaveDelay, Replica Set Reconfiguration, Oplog, Changing Oplog Size in this document, and the 更改Oplog大小 tutorial.
Arbiters are special mongod instances that do not hold a copy of the data and thus cannot become primary. Arbiters exist solely to participate in elections.
注解
Because of their minimal system requirements, you may safely deploy an arbiter on a system with another workload, such as an application server or monitoring member.
警告
Do not run arbiter processes on a system that is an active primary or secondary of its replica set.
Arbiters never receive the contents of any collection but do have the following interactions with the rest of the replica set:
Credential exchanges that authenticate the arbiter with the replica set. All MongoDB processes within a replica set use keyfiles. These exchanges are encrypted.
MongoDB only transmits the authentication credentials in a cryptographically secure exchange, and encrypts no other exchange.
Exchanges of replica set configuration data and of votes. These are not encrypted.
If your MongoDB deployment uses SSL, then all communications between arbiters and the other members of the replica set are secure. See the documentation for 使用SSL连接MongoDB for more information. As with all MongoDB components, run arbiters on secure networks.
To add an arbiter, see Adding an Arbiter.
You may choose to change the number of votes that each member has in elections for primary. In general, all members should have only 1 vote to prevent intermittent ties, deadlock, or the wrong members from becoming primary. Use replica set priorities to control which members are more likely to become primary.
To disable a member’s ability to vote in elections, use the following command sequence in the mongo shell.
cfg = rs.conf()
cfg.members[3].votes = 0
cfg.members[4].votes = 0
cfg.members[5].votes = 0
rs.reconfig(cfg)
This sequence gives 0 votes to the fourth, fifth, and sixth members of the set according to the order of the members array in the output of rs.conf(). This setting allows the set to elect these members as primary but does not allow them to vote in elections. If you have three non-voting members, you can add three additional voting members to your set. Place voting members so that your designated primary or primaries can reach a majority of votes in the event of a network partition.
注解
In general and when possible, all members should have only 1 vote. This prevents intermittent ties, deadlocks, or the wrong members from becoming primary. Use Replica Set Priorities to control which members are more likely to become primary.
也可以参考
2.0 新版功能.
Chained replication occurs when a secondary member replicates from another secondary member instead of from the primary. This might be the case, for example, if a secondary selects its replication target based on ping time and if the closest member is another secondary.
Chained replication can reduce load on the primary. But chained replication can also result in increased replication lag, depending on the topology of the network.
Beginning with version 2.2.2, you can use the chainingAllowed setting in 副本集配置 to disable chained replication for situations where chained replication is causing lag. For details, see Manage Chained Replication.
This section gives overview information on a number of replica set administration procedures. You can find documentation of additional procedures in the replica set tutorials section.
Before adding a new member to an existing replica set, do one of the following to prepare the new member’s data directory:
Make sure the new member’s data directory does not contain data. The new member will copy the data from an existing member.
If the new member is in a recovering state, it must exit and become a secondary before MongoDB can copy all data as part of the replication process. This process takes time but does not require administrator intervention.
Manually copy the data directory from an existing member. The new member becomes a secondary member and will catch up to the current state of the replica set after a short interval. Copying the data over manually shortens the amount of time for the new member to become current.
Ensure that you can copy the data directory to the new member and begin replication within the window allowed by the oplog. If the difference in the amount of time between the most recent operation and the most recent operation to the database exceeds the length of the oplog on the existing members, then the new instance will have to perform an initial sync, which completely resynchronizes the data, as described in Resyncing a Member of a Replica Set.
Use db.printReplicationInfo() to check the current state of replica set members with regards to the oplog.
For the procedure to add a member to a replica set, see 将成员添加到一个副本集.
You may remove a member of a replica set at any time; however, for best results always shut down the mongod instance before removing it from a replica set.
在 2.2 版更改: Before 2.2, you had to shut down the mongod instance before removing it. While 2.2 removes this requirement, it remains good practice.
To remove a member, use the rs.remove() method in the mongo shell while connected to the current primary. Issue the db.isMaster() command when connected to any member of the set to determine the current primary. Use a command in either of the following forms to remove the member:
rs.remove("mongo2.example.net:27017")
rs.remove("mongo3.example.net")
This operation disconnects the shell briefly and forces a re-connection as the replica set renegotiates which member will be primary. The shell displays an error even if this command succeeds.
You can re-add a removed member to a replica set at any time using the procedure for adding replica set members. Additionally, consider using the replica set reconfiguration procedure to change the host value to rename a member in a replica set directly.
Use this procedure to replace a member of a replica set when the hostname has changed. This procedure preserves all existing configuration for a member, except its hostname/location.
You may need to replace a replica set member if you want to replace an existing system and only need to change the hostname rather than completely replace all configured options related to the previous member.
Use rs.reconfig() to change the value of the host field to reflect the new hostname or port number. rs.reconfig() will not change the value of _id.
cfg = rs.conf()
cfg.members[0].host = "mongo2.example.net:27019"
rs.reconfig(cfg)
To change the value of the priority in the replica set configuration, use the following sequence of commands in the mongo shell:
cfg = rs.conf()
cfg.members[0].priority = 0.5
cfg.members[1].priority = 2
cfg.members[2].priority = 2
rs.reconfig(cfg)
The first operation uses rs.conf() to set the local variable cfg to the contents of the current replica set configuration, which is a document. The next three operations change the priority value in the cfg document for the first three members configured in the members array. The final operation calls rs.reconfig() with the argument of cfg to initialize the new configuration.
注解
When updating the replica configuration object, address all members of the set using the index value in the array. The array index begins with 0. Do not confuse this index value with the value of the _id field in each document in the members array.
The _id rarely corresponds to the array index.
If a member has priority set to 0, it is ineligible to become primary and will not seek election. Hidden members, delayed members, and arbiters all have priority set to 0.
All members have a priority equal to 1 by default.
The value of priority can be any floating point (i.e. decimal) number between 0 and 1000. Priorities are only used to determine the preference in election. The priority value is used only in relation to other members. With the exception of members with a priority of 0, the absolute value of the priority value is irrelevant.
Replica sets will preferentially elect and maintain the primary status of the member with the highest priority setting.
警告
Replica set reconfiguration can force the current primary to step down, leading to an election for primary in the replica set. Elections cause the current primary to close all open client connections.
Perform routine replica set reconfiguration during scheduled maintenance windows.
也可以参考
The Replica Reconfiguration Usage example revolves around changing the priorities of the members of a replica set.
For a description of arbiters and their purpose in replica sets, see Arbiters.
To prevent tied elections, do not add an arbiter to a set if the set already has an odd number of voting members.
Because arbiters do not hold a copies of collection data, they have minimal resource requirements and do not require dedicated hardware.
Create a data directory for the arbiter. The mongod uses this directory for configuration information. It will not hold database collection data. The following example creates the /data/arb data directory:
mkdir /data/arb
Start the arbiter, making sure to specify the replica set name and the data directory. Consider the following example:
mongod --port 30000 --dbpath /data/arb --replSet rs
In a mongo shell connected to the primary, add the arbiter to the replica set by issuing the rs.addArb() method, which uses the following syntax:
rs.addArb("<hostname>:<port>")
For example, if the arbiter runs on m1.example.net:30000, you would issue this command:
rs.addArb("m1.example.net:30000")
To override the default sync target selection logic, you may manually configure a secondary member’s sync target for pulling oplog entries temporarily. The following operations provide access to this functionality:
Only modify the default sync logic as needed, and always exercise caution. rs.syncFrom() will not affect an in-progress initial sync operation. To affect the sync target for the initial sync, run rs.syncFrom() operation before initial sync.
If you run rs.syncFrom() during initial sync, MongoDB produces no error messages, but the sync target will not change until after the initial sync operation.
注解
replSetSyncFrom and rs.syncFrom() provide a temporary override of default behavior. If:
the mongod instance restarts,
the connection to the sync target closes, or
在 2.4 版更改: The sync target falls more than 30 seconds behind another member of the replica set;
then, the mongod instant will revert to the default sync logic and target.
2.2.2 新版功能.
MongoDB enables chained replication by default. This procedure describes how to disable it and how to re-enable it.
To disable chained replication, set the chainingAllowed field in 副本集配置 to false.
You can use the following sequence of commands to set chainingAllowed to false:
Copy the configuration settings into the cfg object:
cfg = rs.config()
Take note of whether the current configuration settings contain the settings sub-document. If they do, skip this step.
警告
To avoid data loss, skip this step if the configuration settings contain the settings sub-document.
If the current configuration settings do not contain the settings sub-document, create the sub-document by issuing the following command:
cfg.settings = { }
Issue the following sequence of commands to set chainingAllowed to false:
cfg.settings.chainingAllowed = false
rs.reconfig(cfg)
To re-enable chained replication, set chainingAllowed to true. You can use the following sequence of commands:
cfg = rs.config()
cfg.settings.chainingAllowed = true
rs.reconfig(cfg)
注解
If chained replication is disabled, you still can use replSetSyncFrom to specify that a secondary replicates from another secondary. But that configuration will last only until the secondary recalculates which member to sync from.
The following is an overview of the procedure for changing the size of the oplog. For a detailed procedure, see 更改Oplog大小.
When a secondary’s replication process falls behind so far that primary overwrites oplog entries that the secondary has not yet replicated, that secondary cannot catch up and becomes “stale.” When that occurs, you must completely resynchronize the member by removing its data and performing an initial sync.
To do so, use one of the following approaches:
Restart the mongod with an empty data directory and let MongoDB’s normal initial syncing feature restore the data. This is the more simple option, but may take longer to replace the data.
Restart the machine with a copy of a recent data directory from another member in the replica set. This procedure can replace the data more quickly but requires more manual steps.
This procedure relies on MongoDB’s regular process for initial sync. This will restore the data on the stale member to reflect the current state of the set. For an overview of MongoDB initial sync process, see the Syncing section.
To resync the stale member:
Stop the stale member’s mongod instance. On Linux systems you can use mongod --shutdown Set --dbpath to the member’s data directory, as in the following:
mongod --dbpath /data/db/ --shutdown
Delete all data and sub-directories from the member’s data directory. By removing the data dbpath, MongoDB will perform a complete resync. Consider making a backup first.
Restart the mongod instance on the member. For example:
mongod --dbpath /data/db/ --replSet rsProduction
At this point, the mongod will perform an initial sync. The length of the initial sync may process depends on the size of the database and network connection between members of the replica set.
Initial sync operations can impact the other members of the set and create additional traffic to the primary, and can only occur if another member of the set is accessible and up to date.
This approach uses a copy of the data files from an existing member of the replica set, or a back of the data files to “seed” the stale member.
The copy or backup of the data files must be sufficiently recent to allow the new member to catch up with the oplog, otherwise the member would need to perform an initial sync.
注解
In most cases you cannot copy data files from a running mongod instance to another, because the data files will change during the file copy operation. Consider the 系统备份策略 documentation for several methods that you can use to capture a consistent snapshot of a running mongod instance.
After you have copied the data files from the “seed” source, start the mongod instance and allow it to apply all operations from the oplog until it reflects the current state of the replica set.
In most cases, the most effective ways to control access and to secure the connection between members of a replica set depend on network-level access control. Use your environment’s firewall and network routing to ensure that traffic only from clients and other replica set members can reach your mongod instances. If needed, use virtual private networks (VPNs) to ensure secure connections over wide area networks (WANs.)
Additionally, MongoDB provides an authentication mechanism for mongod and mongos instances connecting to replica sets. These instances enable authentication but specify a shared key file that serves as a shared password.
1.8 新版功能: Added support authentication in replica set deployments.
在 1.9.1 版更改: Added support authentication in sharded replica set deployments.
To enable authentication add the following option to your configuration file:
keyFile = /srv/mongodb/keyfile
注解
You may chose to set these run-time configuration options using the --keyFile (or mongos --keyFile) options on the command line.
Setting keyFile enables authentication and specifies a key file for the replica set members to use when authenticating to each other. The content of the key file is arbitrary but must be the same on all members of the replica set and on all mongos instances that connect to the set.
The key file must be less one kilobyte in size and may only contain characters in the base64 set. The key file must not have group or “world” permissions on UNIX systems. Use the following command to use the OpenSSL package to generate “random” content for use in a key file:
openssl rand -base64 753
注解
Key file permissions are not checked on Windows systems.
This section describes common strategies for troubleshooting replica sets.
也可以参考
To display the current state of the replica set and current state of each member, run the rs.status() method in a mongo shell connected to the replica set’s primary. For descriptions of the information displayed by rs.status(), see 副本集状态参考.
注解
The rs.status() method is a wrapper that runs the replSetGetStatus database command.
Replication lag is a delay between an operation on the primary and the application of that operation from the oplog to the secondary. Replication lag can be a significant issue and can seriously affect MongoDB replica set deployments. Excessive replication lag makes “lagged” members ineligible to quickly become primary and increases the possibility that distributed read operations will be inconsistent.
To check the current length of replication lag:
In a mongo shell connected to the primary, call the db.printSlaveReplicationInfo() method.
The returned document displays the syncedTo value for each member, which shows you when each member last read from the oplog, as shown in the following example:
source: m1.example.net:30001
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
= 7475 secs ago (2.08hrs)
source: m2.example.net:30002
syncedTo: Tue Oct 02 2012 11:33:40 GMT-0400 (EDT)
= 7475 secs ago (2.08hrs)
注解
The rs.status() method is a wrapper around the replSetGetStatus database command.
Monitor the rate of replication by watching the oplog time in the “replica” graph in the MongoDB Monitoring Service. For more information see the documentation for MMS.
Possible causes of replication lag include:
Network Latency
Check the network routes between the members of your set to ensure that there is no packet loss or network routing issue.
Use tools including ping to test latency between set members and traceroute to expose the routing of packets network endpoints.
Disk Throughput
If the file system and disk device on the secondary is unable to flush data to disk as quickly as the primary, then the secondary will have difficulty keeping state. Disk-related issues are incredibly prevalent on multi-tenant systems, including vitalized instances, and can be transient if the system accesses disk devices over an IP network (as is the case with Amazon’s EBS system.)
Use system-level tools to assess disk status, including iostat or vmstat.
Concurrency
In some cases, long-running operations on the primary can block replication on secondaries. For best results, configure write concern to require confirmation of replication to secondaries, as described in 写关注. This prevents write operations from returning if replication cannot keep up with the write load.
Use the database profiler to see if there are slow queries or long-running operations that correspond to the incidences of lag.
Appropriate Write Concern
If you are performing a large data ingestion or bulk load operation that requires a large number of writes to the primary, particularly with unacknowledged write concern, the secondaries will not be able to read the oplog fast enough to keep up with changes.
To prevent this, require write acknowledgment or journaled write concern after every 100, 1,000, or an another interval to provide an opportunity for secondaries to catch up with the primary.
For more information see:
All members of a replica set must be able to connect to every other member of the set to support replication. Always verify connections in both “directions.” Networking topologies and firewall configurations prevent normal and required connectivity, which can block replication.
Consider the following example of a bidirectional test of networking:
Example
Given a replica set with three members running on three separate hosts:
Test the connection from m1.example.net to the other hosts with the following operation set m1.example.net:
mongo --host m2.example.net --port 27017
mongo --host m3.example.net --port 27017
Test the connection from m2.example.net to the other two hosts with the following operation set from m2.example.net, as in:
mongo --host m1.example.net --port 27017
mongo --host m3.example.net --port 27017
You have now tested the connection between m2.example.net and m1.example.net in both directions.
Test the connection from m3.example.net to the other two hosts with the following operation set from the m3.example.net host, as in:
mongo --host m1.example.net --port 27017
mongo --host m2.example.net --port 27017
If any connection, in any direction fails, check your networking and firewall configuration and reconfigure your environment to allow these connections.
A larger oplog can give a replica set a greater tolerance for lag, and make the set more resilient.
To check the size of the oplog for a given replica set member, connect to the member in a mongo shell and run the db.printReplicationInfo() method.
The output displays the size of the oplog and the date ranges of the operations contained in the oplog. In the following example, the oplog is about 10MB and is able to fit about 26 hours (94400 seconds) of operations:
configured oplog size: 10.10546875MB
log length start to end: 94400 (26.22hrs)
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)
The oplog should be long enough to hold all transactions for the longest downtime you expect on a secondary. At a minimum, an oplog should be able to hold minimum 24 hours of operations; however, many users prefer to have 72 hours or even a week’s work of operations.
For more information on how oplog size affects operations, see:
注解
You normally want the oplog to be the same size on all members. If you resize the oplog, resize it on all members.
To change oplog size, see Changing Oplog Size in this document or see the 更改Oplog大小 tutorial.
Replica sets feature automated failover. If the primary goes offline or becomes unresponsive and a majority of the original set members can still connect to each other, the set will elect a new primary.
While failover is automatic, replica set administrators should still understand exactly how this process works. This section below describe failover in detail.
In most cases, failover occurs without administrator intervention seconds after the primary either steps down, becomes inaccessible, or becomes otherwise ineligible to act as primary. If your MongoDB deployment does not failover according to expectations, consider the following operational errors:
In many senses, rollbacks represent a graceful recovery from an impossible failover and recovery situation.
Rollbacks occur when a primary accepts writes that other members of the set do not successfully replicate before the primary steps down. When the former primary begins replicating again it performs a “rollback.” Rollbacks remove those operations from the instance that were never replicated to the set so that the data set is in a consistent state. The mongod program writes rolled back data to a BSON file that you can view using bsondump, applied manually using mongorestore.
You can prevent rollbacks using a replica acknowledged write concern. These write operations require not only the primary to acknowledge the write operation, sometimes even the majority of the set to confirm the write operation before returning.
enabling write concern.
也可以参考
The Elections section in the 副本集基本概念 document, and the Election Internals section in the 副本集的内部和行为 document.
Consider the following error in mongod output and logs:
replSet error fatal couldn't query the local local.oplog.rs collection. Terminating mongod after 30 seconds.
<timestamp> [rsStart] bad replSet oplog entry?
Often, an incorrectly typed value in the ts field in the last oplog entry causes this error. The correct data type is Timestamp.
Check the type of the ts value using the following two queries against the oplog collection:
db = db.getSiblingDB("local")
db.oplog.rs.find().sort({$natural:-1}).limit(1)
db.oplog.rs.find({ts:{$type:17}}).sort({$natural:-1}).limit(1)
The first query returns the last document in the oplog, while the second returns the last document in the oplog where the ts value is a Timestamp. The $type operator allows you to select BSON type 17, is the Timestamp data type.
If the queries don’t return the same document, then the last document in the oplog has the wrong data type in the ts field.
Example
If the first query returns this as the last oplog entry:
{ "ts" : {t: 1347982456000, i: 1},
"h" : NumberLong("8191276672478122996"),
"op" : "n",
"ns" : "",
"o" : { "msg" : "Reconfig set", "version" : 4 } }
And the second query returns this as the last entry where ts has the Timestamp type:
{ "ts" : Timestamp(1347982454000, 1),
"h" : NumberLong("6188469075153256465"),
"op" : "n",
"ns" : "",
"o" : { "msg" : "Reconfig set", "version" : 3 } }
Then the value for the ts field in the last oplog entry is of the wrong data type.
To set the proper type for this value and resolve this issue, use an update operation that resembles the following:
db.oplog.rs.update( { ts: { t:1347982456000, i:1 } },
{ $set: { ts: new Timestamp(1347982456000, 1)}})
Modify the timestamp values as needed based on your oplog entry. This operation may take some period to complete because the update must scan and pull the entire oplog into memory.
The duplicate key on local.slaves error, occurs when a secondary or slave changes its hostname and the primary or master tries to update its local.slaves collection with the new name. The update fails because it contains the same _id value as the document containing the previous hostname. The error itself will resemble the following.
exception 11000 E11000 duplicate key error index: local.slaves.$_id_ dup key: { : ObjectId('<object ID>') } 0ms
This is a benign error and does not affect replication operations on the secondary or slave.
To prevent the error from appearing, drop the local.slaves collection from the primary or master, with the following sequence of operations in the mongo shell:
use local
db.slaves.drop()
The next time a secondary or slave polls the primary or master, the primary or master recreates the local.slaves collection.
Members on either side of a network partition cannot see each other when determining whether a majority is available to hold an election.
That means that if a primary steps down and neither side of the partition has a majority on its own, the set will not elect a new primary and the set will become read only. To avoid this situation, attempt to place a majority of instances in one data center with a minority of instances in a secondary facility.
See