sohliloquies

Reclaiming Cyberspace

13 January 2019
Tags: theseus

Cyberspace is the “place” where a telephone conversation appears to occur. Not inside your actual phone, the plastic device on your desk. Not inside the other person’s phone, in some other city. The place between the phones. The indefinite place out there, where the two of you, two human beings, actually meet and communicate.

Bruce Sterling, The Hacker Crackdown (1992)

At first, no one really knew what the internet was for. Some people knew what it was, and they were coming up with all sorts of uses for it – bulletin boards, instant messaging, mailing lists – but there was this general sense that the internet’s potential went far beyond any of that. Bigger things were coming; it’s just that no one quite knew what they would look like. To capture this sense of excited uncertainty, sci-fi authors of the ’80s and ’90s leaned heavily on the idea of cyberspace.¹

As a metaphor, cyberspace caught on because it seemed to capture the sheer breadth of possibilities that exist with networked technology. By analogy to physical space, which of course is the venue for all analog human experience, cyberspace seemed to suggest a limitless new domain: new experiences, new possibilities, new potentialities. It’s an exciting thought. Fast-forward to 2019, and we’ve fallen a bit short of expectations. Many people’s experience of the modern internet consists of four social media sites, each of which is largely comprised of screenshots from the other three.

OK, maybe that’s a bit much. But grant me that bit of hyperbole and, in return, I’ll let you in on a secret: cyberspace hasn’t gone away. We’re still in it, we’re just spending our time in all the wrong parts of it.

More specifically: we’re spending our time in (cyber-)spaces that are privately owned. They’re private property, controlled and run by large companies who stay in business by treating their userbase as a product. Their business models are fundamentally abusive: their goal is to try and take advantage of you as much as they can without you leaving. In fact, so much of the modern internet consists of these spaces that their downsides have come to appear as downsides of the internet itself. This is an easy mindset to fall into, but it is nonetheless mistaken.

The issues of private ownership can be avoided by building public spaces on the internet – distributed systems with shared ownership, mediating shared experiences. Networks of volunteers coming together to create entire new categories of common goods. Public cyberspace.

Previous attempts at building such spaces (albeit under different names) have encountered seemingly insurmountable challenges. However, recent research² has offered new hope for overcoming these. If successful, the benefits of the resulting public cyberspace over our current systems would be significant. Let me explain.

Architecturally, the modern web is built around a client-server model: data requesters (like you, loading up Twitter) and data providers (like Twitter, sending you fresh tweets). This data lives on servers; the servers live in large air-conditioned rooms with armed guards; these servers dole out data to clients like yourself at-will.

What sort of cyberspace have we architected here? Who, if anyone, owns it? Does the service’s owner have power over you? Certainly they do, and certainly they own the space – though most companies go to great pains to hide this fact.

How do they hide it? With the same tricks that have worked for generations. Early on in The Hacker Crackdown, while Bruce Sterling is chronicling the development of the phone network, he recounts how, in the late 1800s, telephone technology was first normalized:

Contemporary press reports of the stage debut of the telephone showed pleased astonishment mixed with considerable dread. Bell’s stage telephone was a large wooden box with a crude speaker-nozzle, the whole contraption about the size and shape of an overgrown Brownie camera. Its buzzing steel soundplate, pumped up by powerful electromagnets, was loud enough to fill an auditorium. Bell’s assistant Mr. Watson, who could manage on the keyboards fairly well, kicked in by playing the organ from distant rooms, and, later, distant cities. This feat was considered marvellous, but very eerie indeed.

… After a year or so, Alexander Graham Bell and his capitalist backers concluded that eerie music piped from nineteenth-century cyberspace was not the real selling point of his invention. Instead, the telephone was about speech – individual, personal speech, the human voice, human conversation and human interaction. The telephone was not to be managed from any centralized broadcast center. It was to be a personal, intimate technology.

When you picked up a telephone, you were not absorbing the cold output of a machine – you were speaking to another human being. Once people realized this, their instinctive dread of the telephone as an eerie, unnatural device, swiftly vanished. A “telephone call” was not a “call” from a “telephone” itself, but a call from another human being, someone you would generally know and recognize. The real point was not what the machine could do for you (or to you), but what you yourself, a person and citizen, could do through the machine.

Modern social media sites have learned this lesson well: while the individual interactions taking place over the web are always between you and various servers, the perceived interactions are between you and your friends³ (or enemies, if you’re that particular sad sort of person). It’s easy to forget that everything you do is being proxied through these third-party servers. This humanization of the technology obscures an important fact: social media services’ interposition into your interactions with your friends effectively places these interactions on private property belonging to not to you or your friends but to the service you are using.

Now, there are some big advantages to building web sites with this centralized, business-run, private-property model. We’ve spent a few decades doing things this way, and by now we’ve gotten pretty good at it. There are lots of centralized web sites that we don’t yet know any other way of building.

That said, there are also significant drawbacks to this model. Servers aren’t cheap. Most web sites have needed to make money to stay online, which turns out to be pretty difficult. Hence the advent of online advertising, which has since been described (by people who would know) as the original sin of the internet.

Decades down the road, we’ve discovered that the digital ad ecosystem is just as bad as any other funding model we’ve tried – it just has a less obvious failure state. However, the lifeblood of this ecosystem – tracking data, the more personal the better – has turned out to be tremendously valuable to a lot of companies for a lot of reasons.⁴

And so, even as ad-based revenue models are looking more and more like a dead-end, the tracking methods used to deliver targeted ads have turned out to stand on their own as a revenue source. They are goldmines of detailed personal information. This information is readily bought and sold, and – needless to say – the groups buying information about you rarely have your best interests in mind.

For starters, this information is used to slash credit and deny loans, to raise interest premiums, to enable hate speech, to single out psychologically vulnerable individuals for manipulation,⁵ to enable domestic violence… The list goes on and on. When handled sloppily, this data can – and has – outed pseudonymous performers, members of anonymous recovery groups, recipients of mental health care services, and more. The chilling effect of pervasive surveillance on civil society and on civil dissent in particular⁶ is also well-documented, and has been for quite some time. This surfeit of surveillance causes immeasurable actual harm to actual people.

All the conduct described above is repulsive beyond words. It constitutes fundamental breaches of human trust, committed casually and constantly. But, of course, the market doesn’t care about that. Companies have a profit incentive to collect and monetize this data, so collect and monetize they do. And as long as we’re spending all our time in their private corners of cyberspace, there’s nothing we can do about it. It’s like we’re carrying out our entire social lives in a mall’s highly surveilled food court, with the mall transcribing all our conversations and selling them to the highest bidder.⁷

One response to these unintended consequences of centralized systems has been to try to decentralize by building new, federated⁸ systems for web sites (with special focus on social media sites, since that’s where most of us spend most of our time online – and also where some of the most openly user-hostile companies are).

The idea is this: instead of having a service provided by a big, monolithic server (or set of servers) owned and operated by one company, you have a number of smaller, volunteer-run servers, likely (though not necessarily) run out of different physical locations. These servers collaborate to provide a similar experience to centralized web sites. Distributing responsibility in this way also distributes ownership (somewhat!), limiting the amount of data that can be collected on users.⁹

The move from centralized to federated sites brings some welcome changes. One can have a somewhat greater degree of confidence that the platform is respecting one’s privacy. This is like swearing off the food court and hanging out at someone’s house instead: It’s chill, as long as they’re chill. But you’re still on someone’s private property, so you’re still placing trust in someone, and things can still go very wrong.

Perhaps the most popular federated social media site in 2019 is Mastodon, which is actually doing pretty well as these things go but still has a long way to go before they can claim serious adoption (and they might want to find a better word than ‘toot’ for their version of tweets – but then again, what do I know).

With all federated services, you make a number of compromises. You lose the certainty of a monolithic corporation data-mining everything you do or say through them, but you pick up the risk of certain individual strangers trying to do the same. You also place your trust in these strangers’ abilities as sysadmins, which is a lot of responsibility to leave on an unpaid stranger’s shoulders, especially long-term.

You also lose the skilled, well-paid security teams of these large companies – teams who work hard every day to protect your data by making sure it gets leaked only to those who are paying for it and no one else. In their place you have some poor specific individual sucker, and you just really hope they’re doing their best, even though they’re defintely not getting paid enough for this shit.¹⁰

Overall, federation represents an improvement on centralization, but it carries over a number of centralization’s problems, a few of which it makes much worse. The story of federated systems is in many ways the story of centralized systems, just writ small and heavily repeated.

If we want to get away from this set of issues, perhaps it makes sense to get away from centralization entirely. The polar opposite of centralized systems are distributed systems. These are peer-to-peer networks that provide services through the power of mutual aid. Every peer shares a small amount of responsibility for the system’s integrity. They’re a little bit like employee-run companies that anyone can join – and incredibly enough, they actually work.

When built well,¹¹ these systems are surprisingly robust and can create reliable services that somehow exist between the peers involved in maintaining them – carving out what might reasonably be considered public cyberspace.

What does public cyberspace look like? Well, perhaps the most famous example of a distributed, peer-to-peer system is BitTorrent, which was recently reported to constitute 22% of all internet traffic globally (for scale: Netflix claims to be responsible for 15%). BitTorrent is perhaps best known for its popularity among the pirating community (that 22% stat has been known to spike by up to ten percentage points during Game of Thrones season) but it is also used to speed up downloads for video game installations and updates, for open source software’s installers, for quickly mirroring data from vital services like The Internet Archive, and much more. Rumor has it that Facebook also uses BitTorrent internally for moving large files around.

From a design perspective, the premise of BitTorrent is this: you break large files up into small chunks, and you download a bunch of these chunks at a time in parallel from a bunch of different people. Once you’ve downloaded part of a file, you can start sending those parts to other downloaders, and perhaps they will reciprocate (or perhaps they won’t – and that’s ok, because there are enough generous uploaders that a few unhelpful peers don’t spoil the thing for everyone). In the end, as long as the honest peers in the network collectively possesses all parts of the file, everyone will be able to download the whole thing – and the whole process is usually a lot faster and more fault-tolerant than it would be if you were downloading from a single source (like, say, a web server).

To echo an earlier question: what sort of cyberspace are you taking part in when you join a BitTorrent peer swarm? Does anyone own this space? The network itself is totally ad-hoc (although your introduction point to it might not be). You have no client-server relationships, only peer-to-peer. There are no mediators; no social-media giants are interposing themselves into your interactions. This is not to say you’re completely secure. You still place some small amount of trust in your peers. However, the range of misbehavior that peers are capable of is much, much more limited than what a corporation¹² can and will do.

Of course, BitTorrent is a pretty limited application. It does one thing, and does it well – but it only does one thing. What else are these peer-to-peer systems capable of? In spite of decades of research, that is still something of an open question in 2019, largely because there doesn’t seem to be a lot of money to be made by answering it. Nevertheless, it’s an important question and one that I’d very much like to see more exploration of. Much of the untapped potential implied by the idea of “cyberspace” lies in peer-to-peer systems. We’ll go into more depth on this later.

Returning for a moment to the example of BitTorrent: I mentioned earlier that one sticking point with BitTorrent is getting an introduction to a peer swarm. This turns out to be a challenge shared by almost all peer-to-peer systems: Once you’ve met some peers, you’re set, but getting to that point can be tricky. How is it done today?

The “traditional” way to get started is via a peer tracker, a centralized service that peers can register themselves with and which in turn serves up a list of registered peers on demand. Not only is this a central point of surveillance, it is a central point of failure.

A step beyond running a peer tracker is to offer mirrors, which are essentially a federation of the tracker service. In this strategy, a bunch of folks grab the same content index and start serving up their own trackers from it, essentially forcing copyright regulators and their ilk to play whack-a-mole trying to take everyone down. This is rarely a permanent solution, but at the very least it can buy some time.

A step further is to use magnet links. The way these work, frankly, borders on magic. A magnet link is a small string, short enough to easily copy-paste or even transcribe by hand (if you’re a good typist). This string defines an address where one can look to get details of a torrent, and a list of peers for that torrent. Where on the internet do you look for this address? Not on a web site, but rather in something called a distributed hash table or DHT.

A DHT is just a very simple database that a bunch of peers collectively maintain. It uses a peer-to-peer protocol, like BitTorrent does, but this protocol offers something different. It lets you store small chunks of data (like, say, the contact info other peers need to connect to you) at addresses (like the one you get from your magnet link), and it lets you query any address to see all the data everyone has stored there.

Anyone can connect to it, anyone can store data on it, and anyone can get data from it. Some people maintain very long-lived presences in the network, serving as reliable points of reintroduction, and trackers for DHT peer swarms exist as well. This is where your computer goes to resolve magnet links, and incredibly enough, it works.

To return to our earlier question: what sort of (cyber-)space is this? Like I said, I’m inclined to look at it as public cyberspace. It’s like a public park or a library: something which exists because a small number of people care enough to look after it, and which therefore is available for a huge number of people to use and enjoy, free of cost. No one owns it – we just coexist within it. No one needs to own it.

There’s a reason this seems almost too good to be true: it almost is. Malicious interference in peer-to-peer systems turns out to often be very easy, and defending against this is often nontrivial. BitTorrent protects itself by using lots of integrity checks – as long as you have a torrent file describing the file you’re trying to download, you’ll be able to validate every chunk of it and keep your peers honest – but with a construct like a DHT, the contents of which are constantly changing, there is no such single source of truth. As a result, it turns out that most real-world DHTs are very easy to mess with.

Here’s an example: The Sybil attack consists of deploying a huge number of peers on the network – peers which you control, but which look like distinct entities to anyone else. This gives you a lot of leverage, and while each DHT protocol fails differently, they all do fail sooner or later. It has been shown that perfect defense against Sybil attacks is practically impossible (PDF link). Of course, if we’re keeping score, it could be argued that perfect defense against standard threats for a large corporation is practically impossible too. The DHT that most magnet link resolutions go through, Mainline DHT, makes no attempt at defending against Sybil attacks, and so it experiences many of them each day (PDF link). These attacks are easy to carry out and have the potential to completely and arbitrarily censor any data from Mainline DHT.

While “perfect” defense against Sybil attacks may not be possible, effective mitigations do exist. The Theseus DHT project, which I have been working on since 2016, exists to explore these. The approach taken is much more analytic than most previous work, and the results this project has achieved so far are very encouraging.¹³

A construct like Theseus DHT, if successfully realized, would provide a tangibly useful service. It would be able to support many sorts of ephemeral data sharing and storage, serving as a data layer for entirely new app architectures. As for what types of apps Theseus DHT could support, the sky’s the limit. Here are some examples.

The project’s namesake and original motivation, Theseus, a planned service for sharing, storing, and searching research, the design of which is described in some depth here.¹⁴ From that post: Imagine a system with the reference features and massive scope of the modern scientific paper repositories – arXiv, Embase, et al. – but completely decentralized and peer-to-peer. The processing power required for all of the system’s features, including searching for and downloading papers or data, would be shared across all the network’s participants. Anyone with an internet connection could join and use a friendly web interface to immediately access anything stored in the system.

How about secure chat app with no central list of users, where identities are established via public keys and all messages are encrypted? The message transport could be built using existing protocols like OTR or Signal, with peer introductions handled using the ideas laid out here (or in any number of other ways).

As for social media, one common complaint about Twitter is that old tweets are rarely more than a liability, and many services exist for automatically deleting them. Newer social media and chat apps typically build in ephemeral messaging by default (think Snapchat, or Signal’s expiring texts). To that end, a Twitter-like app with ephemeral tweets stored directly on the DHT would be trivially straightforward to build.

Other, more complex forms of social media could be supported as well, although these would be significant projects in their own right. If this sounds interesting to you, hit me up for an invitation to the Theseus DHT Slack channel¹⁵ and we can brainstorm.

It would be straightforward to build a distributed mirror of the Internet Archive which dynamically retrieves content based on the torrents they publish. Of course, for the system to be fully decentralized, an index of these torrents would have to be stored in the DHT as well, but that shouldn’t be too difficult.

As for more ambitious ideas: how about automatic mirroring for sensitive data (e.g. with people volunteering to mirror encrypted copies of data leaks, to ensure journalists can maintain access to them even under extreme circumstances) – a service that would use largely the same protocols as Theseus, and which could conceivably even be integrated with systems like SecureDrop to automatically back up encrypted copies of anything submitted to a newsroom as soon as it arrives.¹⁶

Theseus DHT is also a natural fit for making introductions to peer swarms. The fact that Theseus DHT is designed for resilience against Sybil attacks makes it preferable over MLDHT for this purpose, as does the fact that it uses strong, carefully-engineered encryption to help protect peers’ privacy even under pervasive surveillance.

What this means is that a system like Theseus DHT would be not only a powerful construct in its own right but also an ideal bootstrapping point for the next generation of peer-to-peer apps – a layer on which they can build, a substrate, a common good commonly used and maintained, with common benefit. Of course, all this depends on developers recognizing the potential that exists here and deciding to help develop this corner of cyberspace. That’s why I’m not necessarily saying this is the future. What I’m saying is: We should be so lucky as for this to be the future.

Style note: Throughout this post, “cyberspace” refers to the word cyberspace when italicized, and to the concept of cyberspace otherwise. ↩
Some of which has been carried out through the Theseus project, which you’ll hear more about later. ↩
For instance, suppose you post a tweet, and a friend of yours retweets it. Technically, the interactions here are: you contact Twitter’s servers and ask them to store a new tweet from your account on your behalf; Twitter accepts your request; your friend contacts Twitter’s servers, asking for fresh tweets; Twitter accepts your friend’s request and shows them your tweet; your friend contacts Twitter’s servers again, asking to retweet your tweet; Twitter accepts this request as well; soon after, your Twitter app contacts Twitter’s servers of its own volition on your behalf, asking for new notifications; Twitter responds with the news that you have been retweeted. Note how you and your friend do not at any point interact directly over the network, even though the perceived interaction (i.e.: you post a tweet; your friend retweets it; you see that your friend has retweeted it) is between you and them. ↩
For more on this, I cannot recommend Bruce Schneier’s book Data and Goliath highly enough. I’ve written about that book previously; you can see what I had to say here. Long story short: it covers an incredibly important subject, makes explosive claims from start to finish, and backs them up at the end with more than 100 pages of citations. You can probably find or request a copy at your local library. ↩
From that article: Facebook isn’t a mind-control ray. It’s a tool for finding people who possess uncommon, hard-to-locate traits, whether that’s “person thinking of buying a new refrigerator,” “person with the same rare disease as you,” or “person who might participate in a genocidal pogrom,” and then pitching them on a nice side-by-side or some tiki torches, while showing them social proof of the desirability of their course of action, in the form of other people (or bots) that are doing the same thing, so they feel like they’re part of a crowd. ↩
From Data and Goliath (Schneier, 2015, page 115): These chilling effects are especially damaging to political discourse. There is value in dissent. And, perversely, there ca be value in lawbreaking. These are both ways we improve as a society. Ubiquitous mass surveillance is the enemy of democracy, liberty, freedom, and progress. Defending this assertion requires a subtle argument – something I wrote about in my previous book Liars and Outliers – but it’s vitally important to society. Think about it this way. Across the US, states are on the verge of reversing decades-old laws about homosexual relationships and marijuana use. If the old laws could have been perfectly enforced through surveillance, society would never have reached the point where the majority of citizens thought those things were okay. There has to be a period where they are still illegal yet incresaingly tolerated, so that people can look around and say, “You know, that wasn’t so bad.” Yes, the process can take decades, but it’s a process that can’t happen without lawbreaking. Frank Zappa said something similar in 1971: “Without deviation from the norm, progress is not possible.” ↩
Or, in Facebook’s case, selling them to just about every bidder. ↩
The term derives from the concept of federation in political science, where it signifies the formation of a political unity, with a central government, by a number of separate states, each of which retains control of its own internal affairs. (source) ↩
It is assumed, also, that altruistic volunteers are less likely to indulge in invasive tracking, as well as perhaps less technically capable of implementing it. There is, of course, no good way of validating this assumption. ↩
In fact, usually they’re not getting paid at all. ↩
The details vary from system to system, but usually “built well” entails building in thorough error-checking, lots of redundancies, or both. ↩
(or even, in some cases, a federated system node’s maintainer) ↩
In brief, a combination of tactics is used to make Sybil attacks expensive to perform, easy to detect, and easy to compensate for once detected. The network is also designed to admit straightforward mathematical analysis, which allows for formal validation of each of the aforementioned tactics’ effectiveness. Technical write-ups are forthcoming. Hit me up on Twitter if you’d like a peek at the drafts :) ↩
This project, which places an unusually strong emphasis on high availability and resistance to censorship, was originally motivated by explicit threats of censorship of climate research made by the Trump EPA. Ideologically, it has much in common with existing systems like Sci-Hub. ↩
Anyone is welcome, but I’m not sharing the link publicly quite yet, because (at least for now) I want to keep the group small enough that everyone can know everyone. ↩
In terms of security, this would be something of a trade-off, as it could allow interested parties to monitor SecureDrop activity indirectly, and could result in ongoing data disclosure from one-time key compromise. Proper implementation would require regular key rotation, frequent “junk” uploads to obfuscate legitimate traffic, extremely careful key management, and more. That said, if such a system were properly implemented, it would offer tremendous data resilience even against powerful adversaries. ↩