ashmrtn 87e41b20e0
JSON sanitizer and tests (#4925)
We've found that it's possible to get malformed JSON back from
the graph server and graph SDK. This results in errors while trying
to deserialize objects that are malformed JSON

This adds a simple JSON sanitizer and tests to make sure that even if
graph and the graph SDK happen to generate malformed JSON we can get
back to a state where we can deserialize it again. The tests also
print info when the received content differs from the original

This PR does not change any of the logic that corso uses during
backups or restores, it just adds sanitization code and tests

---

#### Does this PR need a docs update or release note?

- [ ]  Yes, it's included
- [ ] 🕐 Yes, but in a later PR
- [x]  No

#### Type of change

- [ ] 🌻 Feature
- [x] 🐛 Bugfix
- [ ] 🗺️ Documentation
- [x] 🤖 Supportability/Tests
- [ ] 💻 CI/Deployment
- [ ] 🧹 Tech Debt/Cleanup

#### Test Plan

- [x] 💪 Manual
- [x]  Unit test
- [ ] 💚 E2E
2023-12-23 05:24:00 +00:00

59 lines
1.9 KiB
Go

package sanitize
import (
"bytes"
"fmt"
"golang.org/x/exp/slices"
)
// JSONString takes a []byte containing JSON as input and returns a []byte
// containing the same content but with any character codes < 0x20 that weren't
// escaped in the original escaped properly.
func JSONBytes(input []byte) []byte {
if len(input) == 0 {
return input
}
// Avoid most reallocations by just getting a buffer of the right size to
// start with.
// TODO(ashmrtn): We may actually want to overshoot this a little so we won't
// cause a reallocation and possible doubling in size if we only need to
// escape a few characters.
buf := bytes.Buffer{}
buf.Grow(len(input))
for _, c := range input {
switch {
case c == '\n' || c == '\t' || c == '\r':
// Whitespace characters also seem to be getting transformed inside JSON
// strings already. We shouldn't further transform them because they could
// just be formatting around the JSON fields so changing them will result
// in invalid JSON.
//
// The set of whitespace characters was taken from RFC 8259 although space
// is not included in this set as it's already > 0x20.
buf.WriteByte(c)
case c < 0x20:
// Escape character ranges taken from RFC 8259. This case doesn't handle
// escape characters (0x5c) or double quotes (0x22). We're assuming escape
// characters don't require additional processing and that double quotes
// are properly escaped by whatever handed us the JSON.
//
// We need to escape the character and transform it (e.x. linefeed -> \n).
// We could use transforms like linefeed to \n, but it's actually easier,
// if a little less space efficient, to just turn them into
// multi-character sequences denoting a unicode character.
buf.WriteString(fmt.Sprintf(`\u%04X`, c))
default:
buf.WriteByte(c)
}
}
// Return a copy just so we don't hold a reference to internal bytes.Buffer
// data.
return slices.Clone(buf.Bytes())
}