diff --git a/website/blog/2023-4-1-incrementals-pt1.md b/website/blog/2023-4-1-incrementals-pt1.md new file mode 100644 index 000000000..40869b6e5 --- /dev/null +++ b/website/blog/2023-4-1-incrementals-pt1.md @@ -0,0 +1,139 @@ +--- +slug: incremental-backups-pt1 +title: "Speeding up Microsoft 365 backups with delta tokens" +description: "Recent additions to Corso have reduced the duration of backups after the +first backup by taking advantage of Microsoft’s delta query API. Doing so allows +Corso to retrieve only the changes to the user’s data since the last backup +instead of having to retrieve all items with the Graph API. However, +implementing backups in this manner required us to play a few tricks with the +Corso implementation, so we thought we’d share them here." +authors: amartinez +tags: [corso, microsoft 365, backups] +date: 2023-4-1 +image: ./images/incremental-encoder.jpg +--- + +![By © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=75914553](./images/incremental-encoder.jpg) + +Full Microsoft 365 backups can take a long time, especially since Microsoft +throttles the number of requests an application can make in a given window of +time. Recent additions to Corso have reduced the duration of backups after the +first backup by taking advantage of Microsoft’s delta query API. Doing so allows +Corso to retrieve only the changes to the user’s data since the last backup +instead of having to retrieve all items with the Graph API. However, +implementing backups in this manner required us to play a few tricks with the +Corso implementation, so we thought we’d share them here. + + + +## Background + +Before we dive into the details of how incremental backups work, it’s useful to +have some knowledge of how delta queries work in the Microsoft Graph API and how +data is laid out in Corso backups. + +### Microsoft delta queries + +Microsoft provides a delta query API that allows developers to get only the +changes to the endpoint since the last query was made. The API represents the +idea of the “last query” with an opaque token that is returned when the set of +items is done being listed. For example, if a developer wants to get a delta +token for a specific email folder, the developer would first list all the items +in the folder using the delta endpoint. On the final page of item results from +the endpoint, the Graph API would return a token that could be used to retrieve +future updates. + +All returned tokens represent a point in time and are independent from each +other. This means that getting token a1 at time t1, making some changes, and +then getting another token a2 at time t2 would give distinct tokens. Requesting +the changes from token a1 would always give the changes made after time t1 +including those after time t2. Requesting changes from token a2 would give only +the changes made after time t2. Tokens eventually expire though, so waiting a +long time between backups (e.x. weeks) may cause all items to be enumerated +again. + +## Corso full backups, incremental backups, and backup layout + +Before we get into the nuts and bolts of how Corso uses the Microsoft delta +query API, it’s important to first define what’s in a backup and the terminology +we’ll be using throughout this post. + +### Backup layout + +Internally, a single Corso backup consists of three main parts: a kopia manifest +that Corso uses as the root object of the backup (BackupModel), a kopia snapshot +of indexing information for Corso, and a kopia snapshot of the item data in the +backup. The BackupModel contains summary information about the status of the +backup (did it have errors, how many items did it backup, etc) and pointers to +the two snapshots that contain information. + +The snapshot with indexing information contains the data output during a +`corso backup details` command and is used to filter the set of restored items +during restore commands. The snapshot contains one entry for every backed up +M365 item in the backup. + +The snapshot of item data contains the raw bytes that Corso backed up from M365. +Internally, Corso uses a file hierarchy in kopia that closely mirrors the layout +of the data in M365. For example, if the user has a file in the OneDrive folder +`work/important` then Corso creates a kopia path +`/onedrive//files//root/work/important` for that +file. + +Corso also stores a few extra bits of metadata to help with incremental backups. +Most importantly, it stores the Graph API’s delta tokens retrieved during the +backup process as well as a mapping relating the current M365 folder IDs to +their paths. This information is stored with different path prefixes (ex. uses +`onedriveMetadata` instead of `onedrive`) to make it easy to separate out from +backed up item data. + +### Terminology + +*Full backups* are backups where all of the data being backed up is fetched from +M365 with the Graph API. These backups may take a long time to complete (we’ve +seen backups that run for 20+ hours) due to throttling imposed by Microsoft 365. +For the purposes of this blog, *incremental backups* are backups where Corso +fetches only a subset of items from M365. Ideally Corso would fetch only the +items that change, though there may be reasons it needs to fetch more data. + +Whether Corso does a full backup or an incremental backup, the resulting Corso +backup has a listing of all items stored in M365 (what we refer to as *indexing +information*). This means there’s no “chaining” between backups and restoring an +item from a backup requires only accessing information contained in or +referenced directly by the backup passed in to the restore command. This makes +backups independent from each other once they’ve been created, so we’ll refer to +them as *independent backups* for the rest of this post. + +Both independent backups and chained backups have the same information. Having +independent backups generally creates more complexity when making a new backup +while chained backups generally have more complexity during restore and backup +deletion. Independent backups have more complexity when creating the backup as +indexing information and item data references for deduplicated data may need to +be sourced from previous backups. Chained backups have more complex restore as +multiple backups may need to be searched for the item being restored. They also +have more complex backup deletion as an item’s data can only be deleted if no +backups in any chain refer to it. The figure below gives a high-level overview +of the differences between independent backups and chained backups. + +![an image of an independent backup](./images/independent_backups.png) +*both images below show how data would be stored if the user backed up two files on their first backup and then made a* +*new file and updated file1 before taking a second backup* +![an image of a chained backup](./images/chained_backups.png) + +Although having a full listing of all items present at the time of the backup in +each backups sounds wasteful, Corso takes advantage of the data deduplication +provided by kopia to only store one copy of the underlying data for backed up +items. What this really means is each Corso backup has a complete set of +*indexing information*. This gives Corso the best of both worlds; allowing +completed backups to have independent indexing information and life cycles from +each other while still minimizing the amount of item data stored. + +> 💡 In part 2 of our series, we’ll cover Incremental backups in action. + +--- + +## Try Corso Today + +Corso implements compression, deduplication *and* incremental backups to give +you the best backup performance. Check +[our quickstart guide](https://corsobackup.io/docs/quickstart/) to see how easy +it is to get started. diff --git a/website/blog/authors.yml b/website/blog/authors.yml index e9b5149d1..54eceab71 100644 --- a/website/blog/authors.yml +++ b/website/blog/authors.yml @@ -21,3 +21,9 @@ gmatev: title: Head of Product url: https://github.com/gmatev image_url: https://github.com/gmatev.png + +amartinez: + name: Ashlie Martinez + title: Product Engineer + url: https://github.com/ashmrtn + image_url: ./images/ashlie.png diff --git a/website/blog/images/ashlie.png b/website/blog/images/ashlie.png new file mode 100644 index 000000000..0f529a06c Binary files /dev/null and b/website/blog/images/ashlie.png differ diff --git a/website/blog/images/chained_backups.png b/website/blog/images/chained_backups.png new file mode 100644 index 000000000..3e4a7c7e8 Binary files /dev/null and b/website/blog/images/chained_backups.png differ diff --git a/website/blog/images/incremental-encoder.jpg b/website/blog/images/incremental-encoder.jpg new file mode 100644 index 000000000..59351a7e6 Binary files /dev/null and b/website/blog/images/incremental-encoder.jpg differ diff --git a/website/blog/images/independent_backups.png b/website/blog/images/independent_backups.png new file mode 100644 index 000000000..27a9a77da Binary files /dev/null and b/website/blog/images/independent_backups.png differ