Pippin

Pippin 1 year ago friendica

Content warning: Rambling on about server configuration databases

I've been looking for a database system for a while now, that has a few particular characteristics:

1. Simple to admin, *very* low resource usage, won't store much data (under a megabyte probably).
2. Distributed, so I can have a local copy of the data on each of a bunch of machines for local access.
3. Able to do eventual-consistency so it can be read and written by local processes even when partitioned from every other node, and will catch up when it regains connectivity.
4. Would be nice if it could use multiple redundant datapaths between nodes.
5. Likely won't need conflict resolution as I plan to just append most of the time, and occasionally delete or consolidate parts.
6. Pretty much any data model is likely fine, happy with either relational or key-value store, simplicity preferred though.

Apache Cassandra is interesting, does eventual consistency (and several other levels of consistency) but looks like it'll be a huge heavy lump of Java, and while I haven't tried it yet it seems likely to need lots of resources, both machine resources and my time. I *really* don't want to have to install a JRE on every machine if I don't need to. (I like the Java language, but I hate how resource-intensive running almost anything written in it is.)

etcd seems interesting as it's apparently built for exactly the kind of data I want to put in it (config stuff for a bunch of servers) and is nice and light. rqlite is interesting too as it's apparently *very* simple to admin. Both of those are written in Go, so should be a simple install and run as they'll be a single binary each. But both seem to only support full consistency (both use the raft algorithm, I think) and I'm guessing won't allow even reads on a node without quorum.

I did spend a bit of time starting to write something myself (in bash, of course; probably ought to consider it a prototype) which I might end up finishing off if I can't find anything that does quite what I want. Basically it'll keep a log file per node to which change records (create/update/delete operations) are appended. Each log gets replicated (over a long-lived ssh connection) to every other node in the cluster so each node has a view of all nodes' logs. You get the latest state of an object by searching all the logs you can see for changes to the state of the object you are interested in, sorting them by timestamp and applying them in order, starting with a default state. Alternatively each record might be a complete new state for an object, in which case you just look for the most recent and use that — not yet sure which would be best for the stuff I want to put in it. The logs don't grow without bound because every so often (daily/weekly/monthly? it'll be very low traffic) each log is retired and replaced by a new one, which starts with records giving the current state of each object — i.e. the final state of all previous logs. Old logs can be discarded. There's no need for anything fancy because each log can only be written on its owner node and read on all other nodes. Changes to objects *can* be made by any node, so technically there's conflict resolution needed, but my plan is to just use what's there at any given time. Each node could write a confirmation record into its own log for each change operation record it sees from other nodes, so a node can tell how far a change has propagated and treat a change as "unconfirmed" if it doesn't see confirmation from every other node (or maybe a quorum of nodes, or a specific set of nodes) yet.

Anyone happen to know of something on this kind of level of simplicity already out there?