CBOR – Concise Binary Object Representation

Animats · on Jan 29, 2016

It's not as bad as ASN.1. But when you need a "strict mode" for a binary format, and otherwise allow ambiguous decoding, there's something wrong. Something that probably can be exploited. The RFC admits this.

How does the density compare with JSON run through GZIP compression?

galonk · on Jan 30, 2016

It seems rather that the spec allows, but doesn't require, lenience in what data the encoder accepts. If you want to write an encoder that always errors when given invalid data (that is always "strict"), that's fine. But if you're going to be liberal in what you accept, you are required to also implement a strict mode.

brianolson · on Jan 30, 2016

I have implemented CBOR for Python and Go. In my testing: CBOR+GZIP is slightly smaller, about 10%, than JSON+GZIP CBOR also parses faster, sometimes as much as 3-5x speedup.

I'm not sure what 'ambiguous decoding' you're referring to. I've seen some complains that an int might decode to an int32 or int64 depending on the value but be different than the original storage size. e.g. '42' stored from an int64 might unpack to an int8 depending on the language and implementation of the decoder.

CJefferson · on Jan 29, 2016

This format is weird -- it represents (effectively) a 65-bit signed integer, which doesn't fit on any current platform sensibly, but also puts a strict upper limit on integer sizes.

EDIT: Found bigints. Still weird having a signed 65-bit integer.

advisedwang · on Jan 29, 2016

This sounds like they were trying to make one format that would fit both "signed int64" and "unsigned int64".

ioquatix · on Jan 29, 2016

A signed 64-bit integer with 2s complement representation still only has 264 unique values. 265 for a signed representation, while not stupid, is a bit out there.

deathanatos · on Jan 30, 2016

readers: The parent poster is saying that it has 2⁶⁴ unique and 2⁶⁵ unique values; the formatting is wonky.

ioquatix: You might want to edit your post to indicate exponentiation. It took me quite a moment to realize you meant that a signed 64-bit int represented in 2s complement has two-to-the-sixty-fourth values, not two hundred and sixty four. I'm guessing you used two asterisks which HN mistook as a zero-length italics?

moron4hire · on Jan 30, 2016

Whoa, how did you do it?

deathanatos · on Jan 30, 2016

Unicode has superscript versions of all of the arabic numerals. "⁶" here is U+2076, "SUPERSCRIPT SIX"[1]; there's not any real "formatting" happening in my post. (Because HN doesn't support superscript, and two asterisks is a pain as I noted, and ^ gets confused by programmers for xor…)

If you happen to be on OS X, you can type Control+Command+Space in most contexts, and get a popup that will let you select special characters, like the above, emoji, etc. It even has a decent search box.

If you're on Linux, you can set a key as a Compose key (I use my right alt key); the default set of compose sequences includes superscript six as Compose, ^, 6. You can also input arbitrary unicode into most contexts if the app is using GTK with Ctrl+Shift+u, followed by the hexadecimal code point, followed by space.

On Windows, I have no idea.

[1]: http://www.fileformat.info/info/unicode/char/2076/index.htm

logn · on Jan 30, 2016

Unicode https://en.wikipedia.org/wiki/Unicode_subscripts_and_supersc...

ioquatix · on Jan 30, 2016

I guess I can't edit my post but yeah, it was supposed to be 2^64 and 2^65 :)

p0nce · on Jan 30, 2016

I once implemented a CBOR parser and found that a strange oddity.

stock_toaster · on Jan 30, 2016

Wasn't CBOR originally submitted to IETF as msgpack, against[1] the desires of the msgpack dev team? Or am I thinking of something else?

[1]: https://github.com/msgpack/msgpack/issues/129

nornagon · on Jan 29, 2016

How does this compare to other similar specifications, e.g. MsgPack and BSON?

rspeer · on Jan 29, 2016

From what I've read of the spec so far: msgpack has a confusion between text and binary data baked into it that will probably never be resolved, and CBOR deliberately fixes that.

It also seems to have a better implementation of streaming data than msgpack. In msgpack, streaming is something you have to implement outside of the format itself, possibly by concatenating together many msgpack representations. CBOR has a way to say "here comes a streaming list, I'll tell you when it's done".

BSON is a representation of MongoDB's data model and doesn't make that much sense to use without MongoDB.

I currently use msgpack for a lot of things, but if CBOR's Python library is good enough, I might switch.

Matthias247 · on Jan 29, 2016

Afaik newer versions of messagepack added an extra type to have string and binary now seperated.

I read somewhere that CBOR was better designed for extensibility, but don't know anything further about it.

One difference (on the non-technical side) is that CBOR is standardized through IETF.

spc476 · on Jan 29, 2016

The base types are "unsigned int", "negative int", "binary data", "UTF-8 string", an array of said items, a map (key,value) of said items, extended types (up to 2^64) and tags (again, up to 2^64). There are only 8 "extended types" currently defined (false, true, null, undefined, half float (IEEE 16b), single float (IEEE 32b) and double float (IEEE 64b) and break (used to terminate streaming data)), leaving plenty unused values for future expansion.

Tags are used to apply meta information to a piece of datum. For example, you can tag a UTF-8 string as a URL or tag an array with a reference so it can be referred to elsewhere in the CBOR encoded data (an extension defined by the IETF but outside RFC-7049).

rspeer · on Jan 29, 2016

> Afaik newer versions of messagepack added an extra type to have string and binary now seperated.

The problem is that the 'str' type contains arbitrary binary data in an unspecified encoding, and always will, because of backward compatibility. This isn't changed by adding a 'bin' type.

Msgpack decoders in Python, for example, have to give you bytestrings unless you pass an option that promises that 'str's are all encoded in UTF-8.

prutschman · on Jan 30, 2016

From https://github.com/msgpack/msgpack/blob/master/spec.md

  Raw
    String extending Raw type represents a UTF-8 string
    Binary extending Raw type represents a byte array

rspeer · on Jan 30, 2016

Ah okay, I didn't know there was now a specific String type (and that the one I was calling 'str' is called 'raw'). Does the Python library use it?

prutschman · on Jan 31, 2016

It can: see https://github.com/msgpack/msgpack-python#string-and-binary-...

rspeer · on Jan 31, 2016

I don't even know what to believe anymore. That documentation is referring to two types, with "raw" renamed to "str" plus a new "bin", which is what I thought it was.

But the link you posted referred to three types, where "str" and "bin" subclass "raw", which sounded like it provided a non-backward-compatible "str" that's guaranteed to be text.

tracker1 · on Jan 29, 2016

They should just add a UTF8 type... I don't know why that wasn't the default for strings all along.

brianolson · on Jan 30, 2016

I wrote the Python CBOR implementation you get if you `pip install cbor`. Source project here: https://bitbucket.org/bodhisnarkva/cbor It's getting pretty extensive use at a few places I know, so I hope it's good enough!

rspeer · on Feb 1, 2016

Good to know! I'll try it as soon as it supports Python 3.5, which I use on a couple of machines. The issues page tells me that you're working on that already.

brianolson · on Feb 2, 2016

Fixed that last night. Good to go now.

Zash · on Jan 29, 2016

There's a comparison at the end of the specification: http://tools.ietf.org/html/rfc7049#appendix-E

Having written an implementation (the Lua one) I quite like this format, it's flexible, not too complicated and still quite compact.

edwintorok · on Jan 30, 2016

How does it compare to UBJSON?

calibraxis · on Jan 29, 2016

Comparisons to Fressian and Transit would be interesting too.

vbit · on Jan 30, 2016

See also ubjson (http://ubjson.org/) which is quite nice because it is a simple format that manages to be quite concise and extensible.

An interesting feature is support for strongly typed arrays and objects. This lets you embed binary data as-is by specifying it as an array of uint8.

floatboth · on Jan 30, 2016

I wrote a Swift encoder/decoder some time ago: https://github.com/myfreeweb/SwiftCBOR

greydius · on Jan 30, 2016

> ...and a few values such as false, true, and null.

when are we going to learn from our past mistakes?

deathanatos · on Jan 30, 2016

I don't think null is necessarily bad. Sometimes, you either have a type or nothing at all. It's when nullable is the default that undoes us, I think. It should be an explicit choice that your type is <thing|null>, not an implicit one. That's the problem with Java references, or C pointers: it's on by default and you can't opt-out; since null is always a valid value for those types, I can't inform the type checker that null isn't a valid input (or a possible output) from a function, so if someone mistakenly passes one, you won't find out until runtime.

Contrast to Rust's Option<Foo> type; you know when null (None in Rust) is a possibility, because it's in the type. (And in Rust, you're forced to deal with it, too.)

paulddraper · on Jan 30, 2016

You are right that null can be problem when encoded wrong in a statically typed language.

But they cause problems even without static types.

https://www.lucidchart.com/techblog/2015/08/31/the-worst-mis...

For example, Rust can have Option<Option<T>>, which can be None, Some(None), or Some(Some(...)). But you can represent that with null, because nulls don't "stack".

TazeTSchnitzel · on Jan 30, 2016

It must contain null if only for JSON compatibility.

bascule · on Jan 30, 2016

I can't speak to the people who designed JOSE/COSE or their motivations, but to me every design decision they made is either ignorant or actively opposed to past problems with similar standards.

ASN.1 would perhaps be the main motivation here. Despite being an abstract syntax, signatures across BER/DER/PER were all distinct. With a little bit of work, signature algorithms as expressed in standards like CMS could be abstract across the encoded representations.

But They Didn't Do That.

Flash forward almost 30 years into the future, and we're literally dealing with the same problems:

https://www.ietf.org/proceedings/94/cose.html

"The resulting formats will not be cryptographically convertible from or to JOSE formats."

WHAT REALLY? A longstanding problem with these formats for 30 years, and they literally did the exact same thing? Yes, yes they did.

Hey, know what format does this (at least for bearer credentials if you think JWT/CWT are cool. It's hard to argue with one specific JOSE/COSE thing since this cancer literally has their fingers in every single honeypot they can get their tentacles into)?

Macaroons:

http://macaroons.io

In addition to not just being JSON-for-CMS/SAML, Macaroons are actually designed to be bearer credentials rather than slapping a bunch of "We took an old idea and added JSON" and slapping it on old concepts, but...

Macaroons are provably secure in their own dialect of Abadi's authorization logic. Check the last page of the paper:

http://research.google.com/pubs/pub41892.html

CWT (and vicariously JWT) try to slap JSON/JOSE syntax/standards onto a bunch of existing concepts, but fail to actually fix the fundamental problems, like provable security. Yes, provable security: Macaroons are predicated on proofs. JWTs are predicated on broken promises and ad hoc design.

All that said, the authors of the JOSE/COSE standards couldn't have tried harder to repeat every single one of the mistakes of the past. These standards are nonsense. Unless you're switching from CMS they offer no practical benefits, and can potentially introduce new vulnerabilities due to their ludicrous complexity.

Unless you can name the exact CBOR-encoded standard you want to use and why you should use it, and the alternative you're considering is practically anything but CMS, AVOID AVOID. Stick with anything that's more standard. Slapping JSON on things doesn't help, but just introduces new problems in a space where older standards are at least better understood.

There are problems in this space: ASN.1 is old, overcomplicated, hard to use, and the source of many vulnerabilities in the way it's described. So we have a serialization format with bad semantics used in security critical contexts. Should we replace it? Yes? Is JOSE/COSE the solution?

JSON has many odd/ambiguous semantics, a shitty type system, and there is no direct mapping between ASN.1's type system and JSON/JOSE's, because it is intentionally restricted by design.

JOSE/COSE's solution is to improve a security critical format by getting rid of types.

Wait what? Improving security by getting rid of types? Yes that's exactly what JOSE/COSE are doing, and it's the wrong direction IMO.

I would much prefer some sort of modern typed serialization format which has packed and unpacked representations. Protobufs or capnproto come to mind.