corso/website/docs/setup/fault-tolerance.md
Niraj Tolia 1fd640a754
Add docs page for fault-tolerance (#2313)
## Description

Add docs on Coso's fault tolerance and reliability

## Does this PR need a docs update or release note?

- [x]  Yes, it's included

## Type of change

- [x] 🗺️ Documentation
2023-01-31 01:53:02 +00:00

49 lines
2.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Fault tolerance
Given the millions of objects found in a typical Microsoft 365 tenant,
Corso is optimized for high-performance processing, hardened to
tolerate transient failures and, most importantly, able to restart backups.
Corsos fault-tolerance architecture is motivated by Microsofts Graph
API variable performance and throttling. Corso follows Microsofts
recommend best practices (for example, [correctly decorating API
traffic](https://learn.microsoft.com/en-us/sharepoint/dev/general-development/how-to-avoid-getting-throttled-or-blocked-in-sharepoint-online#how-to-decorate-your-http-traffic))
and, in addition, implements a number of optimizations to improve
backup and restore reliability.
## Recovery from transient failures
Corso, at the HTTP layer, will retry requests (after a HTTP timeout,
for example) and will respect Graph APIs directives such as the
`retry-after` header to backoff when needed. This allows backups to
succeed in the face of transient or temporary failures.
## Restarting from permanent API failures
The Graph API can, for internal reasons, exhibit extended periods of
failures for particular Graph objects. In this scenario, bounded retries
will be ineffective. Unless invoked with the
fail fast option, Corso will skip over these failing objects. For
backups, it will move forward with backing up other objects belonging
to the user and, for restores, it will continue with trying to restore
any remaining objects. If a multi-user backed is in progress (via `*`
or by specifying multiple users with the `—user` argument), Corso will
also continue processing backups for the remaining users. In both
cases, Corso will exit with a non-zero exit code to reflect incomplete
backups or restores.
On subsequent backup attempts, Corso will try to
minimize the work involved. If the previous backup was successful and
Corsos stored state tokens havent expired, it will use [delta
queries](https://learn.microsoft.com/en-us/graph/delta-query-overview),
wherever supported, to perform incremental backups.
If the previous backup for a user had resulted in a failure, Corso
uses a variety of fallback mechanisms to reduce the amount of data
downloaded and reduce the number of objects enumerated. For example, with
OneDrive, Corso won't redo downloads of data from Microsoft 365 or
uploads of data to the Corso repository if it had successfully backed
up that OneDrive file as a part of a previously incomplete and failed
backup. Even if the Graph API might not allow Corso to skip
downloading data, Corso can still skip another upload it to the repository.