Regex Configurations

Yesterday, I implemented a first version of some code to parse a config file using regex. We needed a simple, human and machine-readable config format, that would be easy to implement and needed basically no error handling and it should be able to do single & multi-line config files. I already was used to regex, having used it to validate various inputs before feeding them into SQL statements, but also to filter text.

So the idea came up to create a basic regex statement, that would match config values in a key: value type fromat. A basic regex query to parse this would be:

(\w+):(\w*)

This matches any two strings of alphanumeric characters and underscores that are linked together with a colon. The key can then be extracted as sub match #1 and the value as sub match #2. Since the regex implementation handles the extraction for you, you don’t need to worry about malformed files, they simply won’t give any results.

At this point, I’m going to introduce you to a very handy tool, I use to develop regex expressions, which includes a handy cheatsheet. I’m talking about regexr.com. If your config regex doesn’t return anything, it’s a good place to check the file you are using against your expression. Keep in mind, that e.g. a string in c++ needs to have all ‘\’ characters escaped, so you need to to remove those double ‘\’ characters when putting the expression into regexr.

The regex expression above is already capable of parsing multiple valid formats you may want. It is even capable of parsing empty values. But we can improve it to make the syntax we want more specific and while doing that, allow us more freedom in how we want to lay out the file.

Look at the second string in the expression. It allows as to capture any sequence with 0 or more alphanumeric or underscore characters. But what about a URL? URLs use colons, slashes, bars and dots, and we want to be able to parse them too. Imagine the following config file:

name:Examle
homepage:http://www.example.com

It will result in the a “name” key with value “Example” and a “homepage” key with value “http”. So we need a way to tell the regex, that we want more types of characters. The most intuitive way will probably be to add the required characters to the expression, so that the match continues until it encounters a character that doesn’t match. So you could use

(\w+):([\w:/.?=&%]*)

to allow URLs as valid parameters. The “?=&%” is needed to support a full urlencoded string with a query. But when you start needing more and more characters, e.g. if you need JSON strings in there, different kinds of brackets, mathematical expressions, or even regex expressions, you should seriously think about a different way to encode that. And this is where we come to delimeters.

Let’s say you want a system, that can handle any kind of data, and still have multiple values. If you simply accept everything, then all information about following key-value pairs will be included as the data of the first key. So if we define our config format as key:value; we can use the following regex expression to match it:

(\w+):([^;]*)

This does exactly what we want, allowing anything but a semicolon, and terminating each entry with a semicolon. A sideeffect of this is, that you can even have text that spans multiple lines.

Now consider a last problem, our syntax had since the very beginning. When you try to format your config file in a way, that makes it look good, you always end up with spaces in your configuration, where you don’t want them. Take this example:

name:		Examle;
homepage:	http://www.example.com;

Here we use tabs to align the keys and data in our config file. If you give this to the regex expression above, you’ll notice that the resulting sub matches have the same tab characters still in them. Which does make sense when you think about how the match works. It takes any character except our delimeter and interprets it as data.

What we can do to prevent that is to give the expression a way to match any spaces or tabs before the data even starts. Of course this means that you cannot use a single tab as a configuration value anymore. Take a look a this expression:

(\w+):[ \t]*([^;]*)

We allow any amount of space or tab characters after the colon and before the data block. So if you need a config file parsed by almost any language out there without the need for external libraries, just use this regular expression. Feel free to adapt it to fit your needs and to change anything you don’t like about it. The regex provide can still be improved, you may want to add spaces in front of the colon, or you may want to be able to use semicolons, which is possible using escape characters, but this is a story for another time.

INI Files

But regex can be useful for more complex syntax. With the right handling in your programming language of choice, you can process INI files. This regex is capable of reading the required data from a file valid by strict INI syntax rules:

REGEX EXPRESSION:
\[(\w+)\]|(\w+)=([^\n]*)|;[^\n]*

The following code is all you need to load this into an accessible data structure, when you are using C++, of course, the same structure applies for any other language:

C++ CODE:
#include <string>
#include <unorderd_map>
#include <regex>

using namespace std;

unordered_map<string, unordered_map<string, string>> getINI(string file){
  string regex = "\\[(\\w+)\\]|(\\w+)=([^\\n]*)|;[^\\n]*";
  unorderd_map<string, unorderd_map<string, string>> config;
  unorderd_map<string, string> group;
  string name = "GENERAL";
  regex_iterator rit(file.begin(), file.end(), regex);
  regex_iterator rend();
  while (rit != rend){
    if ((*rit)[1].length() > 0) {
      if (group.size() > 0) {
        config.insert(make_pair(name, group));
        group.clear();
      }
      name = (*rit)[1].str();
    } else {
      group.insert(make_pair((*rit)[2], (*rit)[3]));
    }
    ++rit;
  }
}

I hope this post will be useful for some people and help you understand the power of regex for such situations.

Xiphosia