I analyzed three Big Data migration strategies performed at Twitter, Facebook and Netflix. As well as the one we did at eHarmony. And found some commonality amongst them which I distilled down to a repeatable strategy. This strategy consists of a five step process that migrates Big Data from a source to a target database:
- Replication to Staging
- Bulk Copy to Target
- Downtime at Source
- Merge Staging with Bulk Copy
- Go Live on Target
Let’s go over some key points of the RCDML process flow:
- The Replication starts right before the Bulk Copy.
- Replication stream is directed into the Staging Database.
- Bulk Copy is pushed directly into the Target Database.
- The reason for the above split is to ensure the checkpoints and validations are isolated to each data set.
- Optional Downtime begins immediately after the Bulk Copy, and it ends with Merge.
- Validation should happen before and after the Merge ensuring that record counts match.
- During the Merge, replicated data from staging overwrites the bulk copied data.
- Go Live.
My conclusion is that RCDML process can work at any scale. In other words – the downtime will be proportional to the amount of data generated by the live changes and not the actual Big Data. That’s because the Bulk Copy takes place outside of the downtime period.
Obviously – the Bulk Copy should be completed as fast as possible to keep the replicated data set small. This reduces the time it takes to Merge the two data sets together later. However, since most Big Data Platforms have a high performance Bulk Loading tools – we can parallelize them and achieve a good throughout keeping the replication window small.
There is also something else very interesting … You see, I used to think that Replication always belonged to the DBA team. And that DBAs should just pick an off the shelf replication tool and bolt it on to the source database. However, what I learned is actually quite the opposite, so let’s talk about this in the next chapter.
I prefer to develop a queue based Replication process and place it directly into the main Data Service and channel all writes to the data layer through it. This creates an embedded high availability infrastructure that automatically evolves together with the Data Service and never lags behind.
The Replication Dispatch is where you plug-in the replication queues. Each queue can be autonomously switched Off to shut down the data flow to the database during a maintenance window. Upon completion of the maintenance the queue is turned On to process the replication backlog.
An alternative method would be to bolt-on a separate replication mechanism to an existing production database outside of the main data path. However be aware of the pitfall in doing this! Your development team will most likely be distant from implementation and maintenance of this bolt-on replication. Instead, it will be handed over to the DBA/Infrastructure team. And subtle nuances will slip through the cracks potentially causing data issues down the line.
Your DBA/Infrastructure team will also be fully dependent on the third party support services for this replication tool. And this will drain your team of the invaluable problem solving skills and instead turn them into remote hands for the replication vendor.
I believe that a successful replication process should be developed in-house and placed directly in the hands of the core data services team. And it should be maintained by this development team and not the Infrastructure/DBA team.
Regardless on the Replication method you choose – it should be turned On right before the Bulk Copy begins to ensure some overlap exists in the data. This overlap will be resolved during the Merge process.
I hope that I convinced you that the replication should be handled by the core data services dev team. But, what about Bulk Copy? Who should be responsible for the Bulk Copy? DBAs or Developers? Let’s find out!
It would be impossible to cover all of the bulk copy tools here because there are so many database platforms you might be migrating from and to. Instead, I’d like to focus on the fundamental principles of getting a consistent Big Data snapshot and moving it from database A to database B.
However, I still think it’s valuable to present a concrete visual example to cover these principles. And since my Big Data Migration Strategy experience is with an Oracle RDBMS on a SAN based shared storage – let’s use that as a baseline for this chapter:
Before we go over the above Bulk Copy process flow, I’d like to emphasize that Replication has to be On before the Bulk Copy begins. This ensures that there is some overlap in the two data sets which we resolve during the Merge phase of the RCDML process.
The objective is to copy a very Big Data set from a Source Database PROD to a target database TARGET. And the fundamental principles to follow are as follows:
- Consistent Data Set
- Low Impact on Source
- Transform only on Target
- One Hop Data Transfer
- Isolate Bulk Copy and Replication
Consistent Data Set
It’s imperative to extract a fully consistent data set from the source database. This will not only make the data validation phase easier, but actually possible. Because you need to be able to trust your extraction process. And a basic verification that the number of rows between source and target match is both simple and invaluable.
Low Impact on Source
The bulk copy tools are fully capable of taking 100% I/O bandwidth at the source database. For this reason it’s important to find the right balance of throughput and I/O utilization by throttling the bulk copy process.
I prefer to monitor response times directly at the source database storage subsystem. The goal here is to not let it go above accepted baseline such as 10ms. Which brings us to the second interesting point …
I used to think that Bulk Copy belong to the ETL development team. Today, I think it really belongs to the DBA team. And it’s because DBAs are closer to the storage and database and can clearly see the impact the Bulk Copy tools have on these components.
Sure, we can have DBAs and Developers work together on this, but, in reality, unless they sit in the same room – it never works. There needs to be 100% ownership established for anything to be accomplished properly.
So instead of creating a finger pointing scenario it’s best to ensure that the fingers are always pointed back at us. If there is no one else to blame – stuff gets magically done!
Transform only on Target
It’s often necessary to transform the data during migration. After all, how often do you get a chance to touch every single row and fix the data model?
There are two places where this transformation can be accomplished – during extract or reload. I think it’s best to leave the extracted data set in exact same state as it was found in the source database. And instead – do all transformation during reload, directly on the target database.
Doing reload on the target database sets up an iterative development cycle of the transformation rules. And removes the need for a fresh source extract for each cycle. It will also make the data validation process possible because extracted data will mirror the source.
One Hop Data Transfer
Physically moving the Big Data is a monumental task. And for this reason the unload processshould deposit the data someplace where the reload process can read it directly. Setting up this One Hop Data Transfer is best through a Network Attached Storage (NAS) via a 10 Gigabit Ethernet network which is now widely available.
Isolate Bulk Copy and Replication
It’s very important to measure each data set independently using a simple count. And then, count the overlap between them. This will make the validation of the Merge process as simple as adding the two and subtracting the third (overlap).
On the other hand, if the replication is a bolt-on process reading transactional logs from the primary database, then the only option is to shutdown and do the merge during a real outage.
That’s because the writes and reads in this case go directly into the primary database and not through a switched data service. And as such, require a hard-stop, configuration change to point to a different database. That is, if the objective is to switch to the target database following the merge.
Next, it’s most often faster and easier to delete the overlap from the Bulk Copy set and then simply insert/append the Replication set into it.
Finally, a validation of the counts either gives the green light for the Go-Live or makes this run just another rehearsal.
Speaking of rehearsals – there should be at least three, with the final rehearsal one week within the Go-Live schedule.
Having the last rehearsal as close to production date as possible ensures that the data growth variance is accounted for in the final schedule. And it pins the migration process in the team’s collective memory.
Specifically, if the replication is queued, switched and built into the core data service, then going live is as simple as setting the new database as primary in the dispatch configuration. This also means there was no downtime during the merge.
On the other hand, if the replication is a bolt-on-read-transaction-logs type, then the Downtime is already active. And we need to update configuration files for each individual service pointing them to the new database.
- There are 5 phases of the RCDML Big Data migration process, and with careful planning it’s possible to make it work at any scale.
- The two most critical components of any Big Data migration strategy are Replication and Bulk Copy, and giving the responsibility for these components to the right team can directly effect the Downtime and Go Live schedule.
If you found this article helpful and would like to receive more like it as soon as I release them make sure to sign up to my newsletter below:SUBSCRIBE
Vitaliy Mogilevskiy January 31, 2016
Posted In: Operations