About Data, Files and Protocols

This post is currently a draft, it’s not yet finished and will be reworked.

Today I want to write a bit about a theoretical approach to file, formats, packets, data and memory. This is the second time, I start writing this post, since the thing i’m trying to present is still evolving in my head.

This post will be about data. Not any specific data, just data in general. What I mean by data is nothing else than information, or more theoretically entropy, chaos. It’s numbers, texts, statistics, programs, music, video, webpages, and much more.

One single piece of data is pretty boring. The number 2374552 for example, is a piece of data, but there is not much you can do with that on its own. Data becomes interesting when you have multiple related pieces of information.

Data-Types

This is where we need to introduce the term data-type: A data-type describes what kind of information you are dealing with. There are some “prime” data-types you can find in most programming languages, for example an integer-data-type or a boolean-data-type.

Now in addition to these prime data-types, we have constructed data-types. These data-types can be described as data-structures, and consist of different members, each for a specific purpose. One example would be a date. It is composed of three numbers for the year, the month and the day.

In C++ and other programming languages, we know them as structs or classes, although classes are a principle of object-oriented programming and used with different intensions, even though they are technically the same.

Data-Collections

There is a third kind of data-type: The data-collection, which may sound very simple, but is actually responsable for a ton of difficulties in programming, caused by its requirements and implications. There are two aspects which define what kind of data-collection we deal with. The first is the way the collection is accessed, and the second is the data-types it accepts as its items.

Collections have a fixed or variable size. I can not think of a sensible application of fixed-size collection-types, which do not use a numeric index. Variable size collections are different.

It is perfectly fine to have a variable size collection where only one element can be read at any time. An example would be a stack (FILO) or queue (FIFO). There is also the possibility of linked or double-linked lists, which do not have indices.

The next step up would be numerically-indexed variable-sized collections, such as a vector (basically a stack where every item can be accessed), binary trees, and whatever you can come up with to use a number as an identifier. These make inserting more difficult, since you might be forced to updated many indices when inserting an item in the middle. There is also a difference between continuously-indexed collections (e.g. c++ arrays) and collections that allow gaps in their indexing.

Next, we have non-numerically-indexed variable-sized collections, where for example a string is used as a key to identify the object. This may be optimised using hashing functions and advanced search mechanics, but it makes the whole process more difficult.

Describing the types of items a collection accepts is pretty simple. The way to define that programatically may be different depending on the system you use. Generally, a collection will accept a specific data-type, and every data-type that inherits from it. In more complicated cases like DTD, it might accept more than one data type.

Now it is also possible, that a data-collection is a data-structure too. This is called for when a structure has one primary set of items, and some other members attributes. In cases where there is more than one collection of items in a structure, that is most likely a data-structure owning multiple data-collections.

A HTML container is one example of a data-type that has both items and members (attributes). The html element itself has a head and a body member which are collections, leaving the html element itself as a plain data-structure.

Ownership

When we define data-types following the rules above, and then create a value of one data-type, we have (in most cases) a tree of nested values, where each leaf is a prime data-type. In this case, every node is the owner of its children. But in some cases, this might be less than ideal, creating data-redundancy, which makes the data hard to manipulate, and otherwise work with.

I’m talking about a situation, where one small value is used in several places in the data-tree of a more complex value. One simple example would be an issue tracker in a development group. Let me define some data-types:

person(string name, integer age)
issue(string title, string description, person assignee)
tracker(list<issue> issues)

When we create an issue, we need to create an assignee for it too. This means that any team member without assigned issue is left unmentioned, and people with more than one assigned issue have their data duplicated. This is obviously a very bad data-structure, but it’s the only one which can not have invalid states.

Addind a list<person> team to the tracker data type solves the unmentione members problem, but not the data duplication one. It also introduces the first invalid state, when an assignee value doesn’t correspond to any item in team.

What we need to do is have each issue reference the person that has been assigned. This has the advantage of removing the data-redundancy, and it also allows for a NULL reference, meaning there is nobody assigned.

A reference in a general definition is a key, which uniquely identifies a value in a collection of multiple values. One example would be the pointer in C, where a memory address is the key, and the scope is the entire memory.

In our example, the reference could be the name in the scope of a list of team members, or better yet, a ID for each team member, that is used as a key in the team list, and as the reference key in the issue.

As we can see, references are necessary but dangerous. When the list is manipulated in a way that breaks the references, we end up with dead links, which results in a corrupted value, which may even be harmful in the case of memory pointers.

The problem is, that the value containing the reference doesn’t own the value it’s referencing. The referenced value can change without the reference noticing it, and in collections, it can even disappear entirely. There are more and more smart references, which handle this, foreign keys in SQL or unique_ptr/shared_ptr in C++, but this is rarely used in files for example.

Additionally, these smart references mostly prevent corruption and not invalidity, e.g. there is no way to tell what is the appropriate action when removing an item from a list that is referenced elsewhere. Do you cancel the removal, or destroy all referencing values, that have become invalid?

References

When we talking about keys for references, there is actually a slight difference between different types. Basically anything may be used as a key, but some keys are safer than others. As I’ve said before, a key is something that identifies a value within a scope or collection.

Let’s say we have a numbered list of people working in a company, giving every person a unique ID. Then, this number can be used as a key to reference that specific person. But this key only carries meaning as an index in the list of employees. In other words, it is owned by the list.

In the early days of computing, many researchers had a three character-name, identifying them. This name was based on their real name, and was pretty much globally unique. This means, that if it was used as a key in a list of employees, it fullfilled they requirements of a reference, while still retaining meaning when the referenced value goes out of scope. The key is owned by the referenced object.

This is what I would call identity-reference as opposed to address-reference. The adavantage of an address-reference is, that the list doesn’t need to know anything about what it contains.

A reference should only ever refer to a single value. When we need to be able to select multiple values at once, we need to use a selector. Selectors are a superset of references, meaning you can choose values by address in lists and their identity. In addition to that, you have a common-identity-selector or type-selector. Examples for this would be CSS element and class selectors.

Scope

Let us have a closer look at the concept of a data-scope. Whenever we have a reference to some data, we need to have rules on how to find the data that reference is pointing to. In the case of a memory-pointer, that is relatively easy. It’s an address-pointer on a single collection, the RAM of the computer running the program. Its scope is the RAM.

Uniform Resource Identifiers are probably the most known, and most diverse references that we find even outside of programming. Let us have a look at some URI schemes and find out how they fit into the definitions we defined here.

URLs are a subset of URIs and define a protocol, a host and a file reference. The protocol only becomes relevant when you want to retrieve the data, but the host and file compose the actual reference.

The host is an unabmiguous reference to a specific computer (or a network of mirrors providing the same resources). It’s resolved via DNS, which makes it an address-reference due to its top-down hierarchical structure. IP address-spaces and Top-Level-Domains are managed be the IANA, which forms a global root scope. Each authority is delegated a part of the hierarchy and may delegate responsibility for derived parts of the hierarchy.

A file reference is an address-reference too, although its authority is the web-server on the host machine. It does not have to represent the file system of that machine, but it does return a single resource (or none) depending on the filename.

TODO: Address pointers unambiguous, identity not

Data-Representation

Up until this point, this entire text on data has been purely theoretical, meaning it is independent on the actual implementation. And I’d argue that all data should be independant of its representation, so that it can be independant of the system it works on.

Text-based data-representations tend to be more flexible in representing any kind of data (JSON, XML) while binary representations are far more useful while processing data.

What I mean by data-representation is a way of encoding a data-type onto a more basic data-type. Using a stack of these data-representations, we are able to save any data-type onto physical memory. In most cases these stacks are rather small, but pulling them apart might change the way we think about saved data.

Usually, a stack of data-representations is called a file-format. Defining a file format is sufficient when you want to read and write data of this format, but is not sufficient, when you want to store this data in any other way.

Let us have a look at a HTML file: The memory on your harddrive provides you with a list of 8-bit values. Since they have a clear order, they can be addressed as individual bytes.

On top of that, we have a character-encoding, for example UTF-8. UTF-8 requires a list of 8-bit value, and provides list of unicode code points. This means, the next level can use all Unicode Characters, and does not need to care about how it is stored.

The next level is a bit of a special one. When we are talking about a HTML 4.0 file, it is actually a SGML (Standardized General Markup Language) generic-data-representation, used to represent a HTML data-type. Generic-data-representations can represent any data-type, provided they have access to the data-type definitions used.

Next Update: Example for normal data-representation & differences between files & packets (less inheritance/collections, exception: AMF), Protocols

Xiphosia