corso/website/blog/2023-05-12-incrementals-pt1.md
Nočnica Mellifera ecb1fdc40a
Incrementals blog(s), part 1 and 2 (#3053)
<!-- PR description-->
These will both go out on the 10th, they have 2 different dates to make
sure the blog orders them correctly
---

#### Does this PR need a docs update or release note?

- [ ]  Yes, it's included
- [ ] 🕐 Yes, but in a later PR
- [x]  No

#### Type of change

<!--- Please check the type of change your PR introduces: --->
- [ ] 🌻 Feature
- [ ] 🐛 Bugfix
- [x] 🗺️ Documentation
- [ ] 🤖 Supportability/Tests
- [ ] 💻 CI/Deployment
- [ ] 🧹 Tech Debt/Cleanup

#### Issue(s)

<!-- Can reference multiple issues. Use one of the following "magic
words" - "closes, fixes" to auto-close the Github issue. -->
* #<issue>

#### Test Plan

<!-- How will this be tested prior to merging.-->
- [ ] 💪 Manual
- [ ]  Unit test
- [ ] 💚 E2E

---------

Co-authored-by: Niraj Tolia <ntolia@users.noreply.github.com>
2023-05-15 16:05:17 -07:00

159 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
slug: incremental-backups-pt1
title: "Speeding up Microsoft 365 backups with delta tokens"
description: "Recent additions to Corso have reduced the duration of backups after the
first backup by taking advantage of Microsofts delta query API. Doing so allows
Corso to retrieve only the changes to the users data since the last backup
instead of having to retrieve all items with the Graph API. However,
implementing backups in this manner required us to play a few tricks with the
Corso implementation, so we thought wed share them here."
authors: amartinez
tags: [corso, microsoft 365, backups]
date: 2023-05-12
image: ./images/incremental-encoder.jpg
---
<!-- vale Vale.Spelling = NO -->
![By © Raimond Spekking / CC BY-SA 4.0 (via Wikimedia Commons), CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=75914553](./images/incremental-encoder.jpg)
<!-- vale Vale.Spelling = YES -->
Full Microsoft 365 backups can take a long time, especially since Microsoft
throttles the number of requests an application can make in a given window of
time. Recent additions to Corso have reduced the duration of backups after the
first backup by taking advantage of Microsofts delta query API. Doing so allows
Corso to retrieve only the changes to the users data since the last backup
instead of having to retrieve all items with the Graph API. However,
implementing backups in this manner required us to play a few tricks with the
Corso implementation, so we thought wed share them here.
<!-- truncate -->
## Background
Before we dive into the details of how incremental backups work, its useful to
have some knowledge of how delta queries work in the Microsoft Graph API and how
data is laid out in Corso backups.
### Microsoft delta queries
Microsoft provides a delta query API that allows developers to get only the
changes to the endpoint since the last query was made. The API represents the
idea of the “last query” with an opaque token that's returned when the set of
items is done being listed. For example, if a developer wants to get a delta
token for a specific email folder, the developer would first list all the items
in the folder using the delta endpoint. On the final page of item results from
the endpoint, the Graph API would return a token that could be used to retrieve
future updates.
All returned tokens represent a point in time and are independent from each
other. This means that getting token a1 at time t1, making some changes, and
then getting another token a2 at time t2 would give distinct tokens. Requesting
the changes from token a1 would always give the changes made after time t1
including those after time t2. Requesting changes from token a2 would give only
the changes made after time t2. Tokens eventually expire though, so waiting a
long time between backups (for example, a few days) may cause all items to be
enumerated again. See Nica's
[previous blog post on how backup frequency](https://corsobackup.io/blog/how-often-should-you-run-microsoft-365-backups)
can affect performance.
## Corso full backups, incremental backups, and backup layout
Before we get into the nuts and bolts of how Corso uses the Microsoft delta
query API, its important to first define whats in a backup and the terminology
well be using throughout this post.
### Kopia
Corso makes extensive use of [Kopia](https://github.com/kopia/kopia) to
implement our Microsoft 365 backup functionality. Kopia is a fast and secure
open-source backup/restore tool that allows you to create encrypted snapshots of
your data and save the snapshots to remote or cloud storage of your choice.
### Backup layout
Internally, a single Corso backup consists of three main parts: a kopia manifest
that Corso uses as the root object of the backup (BackupModel), a kopia index
for Corso, and a kopia data backup. The BackupModel contains summary information
about the status of the backup (did it have errors, how many items did it
backup, etc) and pointers to the two snapshots that contain information.
The kopia index contains the data output during a
`corso backup details` command and is used to filter the set of restored items
during restore commands. The index contains one entry for every backed up
Microsoft 365 item in the backup.
The data backup contains the raw bytes that Corso backed up from Microsoft 365.
Internally, Corso uses a file hierarchy in kopia that closely mirrors the layout
of the data in Microsoft 365. For example, if the user has a file in the OneDrive folder
`work/important` then Corso creates a kopia path
`<tenant ID>/onedrive/<user ID>/files/<drive ID>/root/work/important` for that
file.
Corso also stores a few extra bits of metadata in the data snapshot to help with
incremental backups. Most importantly, it stores the Graph APIs delta tokens
retrieved during the backup process as well as a mapping relating the current
Microsoft 365 folder IDs to their paths. This information is stored with
different path prefixes (ex. uses `onedriveMetadata` instead of `onedrive`) to
make it straightforward to separate out from backed up item data.
### Terminology
*Full backups* are backups where all of the data being backed up is fetched from
Microsoft 365 with the Graph API. These backups may take a long time to complete (weve
seen backups of accounts with extremely large amounts of data run for 20+ hours) due to throttling imposed by Microsoft 365.
For the purposes of this blog, *incremental backups* are backups where Corso
fetches only a subset of items from Microsoft 365. Ideally Corso would fetch only the
items that change, though there may be reasons it needs to fetch more data.
Whether Corso does a full backup or an incremental backup, the resulting Corso
backup has a listing of all items stored in Microsoft 365 (what we refer to as *indexing
information*). This means theres no “chaining” between backups and restoring an
item from a backup requires only accessing information contained in or
referenced directly by the backup passed in to the restore command. This makes
backups independent from each other once theyve been created, so well refer to
them as *independent backups* for the rest of this post.
Both independent backups and chained backups have the same information. Having
independent backups generally creates more complexity when making a new backup
while chained backups generally have more complexity during restore and backup
deletion. Independent backups have more complexity when creating the backup as
indexing information and item data references for deduplicated data may need to
be sourced from previous backups. Chained backups have more complex restore as
multiple backups may need to be searched for the item being restored. They also
have more complex backup deletion as an items data can only be deleted if no
backups in any chain refer to it. The figure below gives a high-level overview
of the differences between independent backups and chained backups.
![an image of an independent backup](./images/independent_backups.png)
*both images below show how data would be stored if the user backed up two files on their first backup and then made a*
*new file and updated file1 before taking a second backup*
![an image of a chained backup](./images/chained_backups.png)
*both images below show how data would be stored if the user backed up two files on their first backup and then made a*
*new file and updated file1 before taking a second backup*
Although having a full listing of all items present at the time of the backup in
each backups sounds wasteful, Corso takes advantage of the data deduplication
provided by kopia to only store one copy of the underlying bulk data in the data
snapshot for backed up items. What this really means is each Corso backup has a
complete set of *indexing information*. This gives Corso the best of both
worlds; allowing completed backups to have independent indexing information and
life cycles from each other while still minimizing the amount of item data
stored.
Understanding how Microsoft provides information on item updates is a key part
of Corso's ability to provide fast, high-performance backups that still
accurately reflect all updates. If you have feedback, questions, or want more information, please join us on the [Corso Discord](https://discord.gg/63DTTSnuhT).
> 💡 In
> [part 2 of our series](2023-05-13-incrementals-pt2.md),
> well cover Incremental backups in action, and how Corso manages state and
> merges updates to the hierarchy.
---
## Try Corso Today
Corso implements compression, deduplication *and* incremental backups to give
you the best backup performance. Check
[our quickstart guide](http://localhost:3000/docs/quickstart/) to see how to get started.