Search This Blog

Thursday, October 13, 2011

Infinite Replication Loop

Infinite Replication Loop:

Last week I helped 2 different customers with infinite replication loops. I decided to write a blog post about these infinite loop of binary log statements in MySQL Replication. To explain what they are, how to identify them… and how to fix them.


An infinite replication loop is one or more transactions that are replicating over and over in a Multi Master Replication relationship. This happens because those transactions belong to a server-id which is different from the two actual masters.


Let’ s have a look first at the normal flow:


Replication flow: [1]


When you setup MySQL replication, you define the global variable server_id, this value must be unique for each host. In the illustration above, arrows represent the replication flow, so 1 is replicating to 2 and to 3, and 2 is replicating to 1.


When your active master (having more masters receiving simultaneous writes causes problems and is not beneficial in 99% of all cases) commits a transaction, the DML statements from that transaction are written to the binary log. They are called binary log events. A binary log event contains the server-id where it was executed. Then that event is replicated to the slaves (the I/O slave thread filters out only the events with a different server_id value. It is also possible to change that behaviour by enabling replicate_same_server_id but this not today’s focus[2]).



As we are in multi-master replication, the other master should also replicate events :



Closer look:



 


Infinite Replication Loop flow:


If you have master-master with slaves or if you want to backup one master, you need log_slave_updates[3] to be enabled. This is for example the case if you re using MMM with slaves.


But if for one reason that I explain later, one or more binary events with a different server-id than those of the two current masters are still in the relay log, they are just applied. Then written with their server-id on the binary log, and the slaves (included the other master) will replicate them… and write them in their binary log… and then the loop is created.


Look at the following process, notice that the server-id has changed on our server 2:


And that’s the loop ! The event for server-id 2 gets replicated over and over…


How can you create loops ?



  • You restore a backup from another machine and you forget to change the server-id in my.cnf.

  • You change the server-id interactively, this is an action you should never perform on a slave. “SET GLOBAL server_id =…”

  • You promote a slave as master to replace one of the masters but the replication was still catching up… so some events from the replaced master still had to be executed on the slaves.

  • And recently I saw a weird problem, my.cnf was correct on all the machines, but the value was not ok in show global variables… it was always 1 !! There was a hidden character in my.cnf that was unvalidating the value and then it was set to 1.


 


How to identify an infinite replication loop ?


When you have an infinite replication loop, the value of Seconds_behind_master will have many randomly looking values. The value of Seconds_behind_master is calculated form the timestamp included in the last executed sql statement. As some of them are looping, the timestamp of those statements does not increase, causing a flip of values from 0 to the time those looping statement were first inserted.


Another way to see it is by monitoring your MySQL server with cacti, you will immediately notice the problem. Look at those graphs:



You can see the CPU usage and the replication lag… once again cacti is very helpful !


 


How to fix then ?


There are several possibilities to fix this problem. They are also linked to the situation.



  • If you know that you have only one or two statements in the loop, you can try to skip the statements by stopping the slave, set global SQL_SLAVE_SKIP_COUNTER and start the slave again. This method is not really recommended because no writes should happen during the process.

  • You can change again the server_id on the passive master, but that could also create another loop in case the replication of events with the current server_id didn’t fetch up.

  • There is also the possibility to use IGNORE_SERVER_IDS from CHANGE MASTER[4]

  • The last solution, my favorite one, is to stop the replication on the active master. Then wait until the passive master has executed the looping statements and pick up a position from SHOW MASTER STATUS on the master still replicating, change the master with the write role where replication was stopped and it is fixed ! Of course you must be sure that there are no direct writes to the passive master during the process. As I mentioned before, if both your masters receive writes simultaneously (two active masters) you should review your setup and this method won’t work.


Example on MMM:


1. check the mmm mode and put it to passive to disable any automatic failover during the process because high availability tools like mmm can cause failover if you stop replication


# mmm_control @production set_passive (to avoid automatic failover)


2. stop replication on the master having writer role (mysql1)


3. wait until some statements are replicated


4. change the replication master position on the active master to a newer position from the passive master (get position from show master status\G on mysql 4)


 


Illustration of the recommended solution :


Server 1 is the active master (receive the writes)


Server 4 is the passive master


Server 3 is a normal slave


Step1 Stop the replication on the active master (server 1):


mysql> stop slave;



Step 2: there are writes on the active master, the loop statements are also replicated from the relay-log but they aren’t not replicated again.



 


Step 3: the binary event with the wrong server-id are performed on the two slaves (mysql 3 and mysql 4) and the new events (with server-id 1) are also replicated.



Step 4: the event with server-id 2 on mysql 4 (the passive master) is written in the binary log and the new events are also performed on the two slaves



Step 5: the new statements (with server-id 1) are also saved in the binary-log of mysql 4 and the master position is increased…



Step 6: the slave position on the active master (mysql 1) is changed to the position returned by SHOW MASTER STATUS on mysql 4 (this position will increase every time writes are performed on the active master, so just take a position when you are sure that the looping statements have been executed) and slave is restarted on mysql 1


 



Note: you can also check all the statements that were in the loop using mysqlbinlog.


For example what was the loop of events from server-id 2?


# mysqlbinlog mysql-relay-bin.000x | grep ‘id 2′ | sort | uniq


 


Special thanks to Kenny[5] who helped me on the cases and to have reviewed this post.


[1] http://dev.mysql.com/doc/refman/5.1/en/replication-implementation-details.html


[2] http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html#option_mysqld_replicate-same-server-id


[3] http://dev.mysql.com/doc/refman/5.1/en/replication-options-slave.html#option_mysqld_log-slave-updates


[4] http://dev.mysql.com/doc/refman/5.1/en/change-master-to.html


[5] http://www.percona.com/about-us/our-team/kenny-gryp/

No comments:

Post a Comment