Zerocash: Decentralized anonymous payments from Bitcoin Ben-Sasson et al., 2014

Yesterday we saw that de-anonymising techniques can learn a lot about the true identities of participants in Bitcoin transactions. Ben-Sasson et al. point out that given this, Bitcoin could be considered significantly less private than traditional schemes:

While users may employ many identities (or pseudonyms) to enhance their privacy, an increasing body of research shows that anyone can de-anonymize Bitcoin by using information in the blockchain, such as the structure of the transaction graph as well as the value and dates of transactions. As a result, Bitcoin fails to offer even a modicum of the privacy provided by traditional payment systems, let alone the robust privacy of anonymous e-cash schemes.

(Think about it, once de-anonymised, the complete record of all your transactions – amounts, dates, recipients and so on – becomes public record).

One possible solution is to use mixes (aka laundries or tumblers) that pool and mix coins using a trusted central party. This is not for the average user, the authors claim. Besides, having anonymity depend on a trusted central party seems at odds with a decentralised payment system. So Ben-Sasson et al. set out to design something better, and the result is Zerocash:

To protect their privacy, users thus need an instant, risk-free, and, most importantly, automatic guarantee that data revealing their spending habits and account balances is not publicly accessible by their neighbors, co-workers, and the merchants with whom they do business. Anonymous transactions also ensure that the market value of a coin is independent of its history, thus ensuring that legitimate users’ coins remain fungible.

This is the second of Narayanan and Miller’s ACM Queue Research for Practice selections on ‘Cryptocurrencies, Blockchains, and Smart Contracts.’ Zerocash, they say, is the most radical of the alternative cryptocurrency designs. We also get a warning that “the paper is long and intricate, and the underlying mathematical assumptions are fairly new by cryptographic standards.” And long it is, at 56 pages. Thankfully the paper is more approachable than I feared, although many of the cryptographic assertions I just have to take on faith. My goal for today’s write-up is simply to give you a high-level sense of how Zerocash works.

A decentralised anonymous payment (DAP) scheme

There are two nice constructions in the paper that help to tame some of the complexity. Firstly, we get an abstract definition of a decentralised anonymous payment (DAP) scheme (section 3), which allows us to reason about the operations without being burdened by particular cryptographic scheme: the DAP scheme is then instantiated using a particular set of cryptographic primitives to form the foundation for Zerocash. Secondly, we get a 6-step gradual build up (section 1.3) to the full complexity of the scheme that helps to build intuition for the roles of the various pieces.

A DAP is built on top of an underlying append-only ledger-based currency such as Bitcoin, call it the Basecoin. The ledger includes Basecoin transactions, as well as two new types of transactions: mint transactions and pour transactions. Users of the scheme generate (at least one) address key pair with a public key enabling others to direct payments to the user, and a secret key used to send payments. Coins are of course just data objects. A coin c has the following attributes:

• A coin commitment, $cm(c)$, which is a string that appears on the ledger once the coin is minted.
• A coin value, $v(c)$, measured in basecoins. This is an integer between 0 and some system maximum.
• A coin serial number, $sn(c)$, a unique string associated with the coin used to prevent double spending
• A coin address, $addr_{pk}(c)$, an address public key, representing the owner of c

Coins may have other attributes, but these are implementation details of particular DAP instantiations.

Mint transactions are tuples $(commitment, value, *)$ (where ‘*’ denotes implementation dependent information. A $tx_{mint}$ transaction records that a coin with a given commitment and value has been minted.

Pour transactions record the ‘pouring’ of two old (and now consumed) coins into two new output coins as well a possible public output (removing value from the DAP network). More on these later.

Given this setup, a DAP scheme comprises 6 abstract operations:

1. Setup is a one-time operation executed by a trusted party to initialise the system and publish its public parameters. After this setup no trusted party is needed and no global secrets or trapdoors are kept.
3. Mint generates a coin of a given value and a mint transaction
4. Pour transfers value from input coins to new output coins, marking the input coins as consumed. “Pouring allows users to subdivide coins into smaller denominations, merge coins, and transfer ownership of anonymous coins, or make public payments.
5. VerifyTransaction checks the validity of a transaction: both mint and pour transactions must be verified before being considered well-formed.
6. Receive scans the ledger and retrieves unspent coins paid to a particular address.

A DAP guarantees a number of security properties (see section 3.4) called ledger indistinguishability, transaction non-malleability, and balance. Ledger indistinguishability is the guarantee that the ledger reveals no new information to an adversary beyond what is public in the ledger, non-malleability prevents attackers from modifying other’s transactions before they are added to the ledger, and balance guarantees that no adversary can own more money than he or she minted or received via payments from others.

Instantiating a DAP using zk-SNARKS!

(Here be dragons.)

A SNARK is a Succint Non-interactive ARgument of Knowledge, and a zk-SNARK is a zero-knowledge proof of knowledge. The succint property means that proofs are short and easy to verify. Research on zk-SNARKs goes back as far as 1989.

There are many zk-SNARK constructions in the literature. We are interested in zk-SNARKs for arithmetic circuit satisfiability, and the most efficient ones for this language are based on quadratic arithmetic programs; such constructions provide a linear-time KeyGen, quasilinear-time Prove, and linear-time Verify.

Let’s just take it on faith that such things exist. Informally, they allow Alice to make verifiable statements such as “I own 30 bitcoins” without revealing any information beyond the fact that she knows some secret keys that control 30 coins.

Crucially, such proofs can be obtained for any statement that can be verified to be true by use of an efficient computation involving auxiliary inputs such as trapdoors and passwords (such statements are called ‘NP statements’.

Zerocash is an instantiation of a DAP that makes judicious use of zk-SNARK (since it is a relatively heavy cryptographic construct, we don’t want to use it everywhere). zk-SNARK are used to prove/verify Pour transactions. Other parts of the system also make use of collision-resistant hashing and pseudorandom functions based on SHA256, digital signatures (ECDSA), and public-key encryption (Elliptic-Curve Integrated Encryption Scheme, ECIES). The zk-SNARK instantiation itself is also based on SHA256:

This, along with judicious parameter choices, and a state-of-the-art implementation of a zk-SNARK for arithmetic circuits, results in a zk-SNARK prover running time of a few minutes, and zk-SNARK verifier running time of a few milliseconds. This allows the DAP scheme implementation to be practical for deployment, as our experiments show.

Building up an intuition

Section 1.3 of the paper does a good job of slowly building up to the full Zerocash solution over six steps. To do it proper justice would double the length of this write-up, but here’s a flavour of what’s involved:

1. The simplest base system provides for user anonymity using fixed value (e.g. 1 BTC) coins. Coins are minted by sampling a random serial number and trapdoor r, and from these computing a coin commitment. This first phase depends on an ever growing ledger of all coin commitments. Coins can be spent by generating zk-SNARK proofs of the form “I know r such that a coin commitment with this serial number sn appears in the ledger.” (sn must not already appear on the ledger as part of a past spend transaction, this requirement prevents double spending).
2. Keeping a linear list of all coin commitments is inefficient, instead lets keep the ledger using an efficiently updateable, append-only, collision-resistant hash-based Merkle Tree. This reduces time and space complexity from linear to logarithmic. Using Merkle trees of depth 64, Zerocash can support $2^{64}$ coins.
3. The concept of addresses are introduced to provide for direct anonymous payments. Without this step, every previous owner of a coin can track its future spending because those owners know its serial number (sn). The pour operation is also introduced at this step for spending coins.
4. Sending the new coins generated by a pour to another user requires that the recipient know the secret values for that key. Rather than requiring these to be passed out of band, a key-private encryption scheme is introduced to enable them to be encrypted for the recipient’s eyes only.
5. At this juncture we can mint, merge, and split coins, but there is no way to redeem a coin an convert it back into the Basecoin currency (e.g., Bitcoin). The pour operation is modified to include a public output that can be used to specify the destination of redeemed funds (e.g. a Bitcoin wallet public key).
6. To prevent embezzlement by re-targeting the public output of a pour transaction, digital signatures are introduced so that any tampering can be detected.

This ends the outline of the construction, which is summarized in part in Figure 1 (below).”

Zerocash today

You can find the Zerocash project online at http://zerocash-project.org. The protocol is now being developed in a full digital currency, called Zcash:

If Bitcoin is like http for money, Zcash is https – a secure transport layer.

A fistful of bitcoins: characterizing payments among men with no names Meiklejohn et al., USENIX ;login: 2013

This week we’re going to be looking at the five papers from the ACM Queue Research for Practice selections on ‘Cryptocurrencies, Blockchains, and Smart Contracts.’ These papers are chosen by Arvind Narayanan and Andrew Miller, co-authors of the ‘Bitcoin and cryptocurrency technologies‘ book which I also highly recommend.

Their first choice looks at the question of anonymity in the Bitcoin economy. By exploiting information leaked when spending bitcoins and receiving change, Meiklejohn et al. were able to cluster bitcoin addresses and from there map significant flows of bitcoins through the network. I worked from the USENIX ;login: article that Narayanan and Miller link in their ACM Queue piece. There’s also a full paper version with some nice additional details from IMC ’13.

2013 is a long time ago in the world of cryptocurrencies. The point forcefully made in ‘A Fistful of Bitcoins’ remains true today – Bitcoin offers little in the way of anonymity. In fact, we get better and better at techniques for de-anonymising block chain transactions over time. See for example ‘Data-driven de-anonymization in Bitcoin.’ The Mt Gox and Silk Road communities discussed in today’s paper both spectacularly disappeared though. Mt Gox ceased trading in February 2014, after losing about $350M ($473M?) worth of Bitcoin (‘Mt Gox: The History of a Failed Bitcoin Exchange‘). The aftermath of Mt Gox and one of the ‘greatest digital heists of all time’ is still ongoing (‘Hedge funds gamble on Mt Gox bitcoin payout‘, FT Feb 13 2017). The Silk Road marketplace was shut down by the FBI in October 2013.

Let’s take a look at the de-anonymisation process…

Obtaining a seed set of bitcoin addresses

The first step on the journey was to obtain a small set of de-anonymised addresses (832). To do this, the authors simply engaged in transactions with a wide variety of bitcoin services (26 exchanges, 10 wallet services, and 25 different vendors – 9 of which used the BitPay payment gateway).

For example, by depositing bitcoins into an account at the biggest Bitcoin exchange, Mt. Gox, we are able to tag one address as definitely belonging to that service, and by later withdrawing those bitcoins we are able to identify another.

The full list of services used is shown below:

We additionally scraped various publicly claimed addresses that we found, such as user’s signatures in Bitcoin forums, although we were careful to use only tags for which we could perform some manual due diligence.

Knowing one or two Mt. Gox addresses though (for example) is just a drop in the ocean compared to the total number of addresses a service such as Mt. Gox will use (ultimately the authors identified over 500,000 addresses as controlled by Mt. Gox). Starting from this seed set of addresses therefore, the next step is to exploit properties of transaction inputs and outputs to group addresses into clusters owned by the same entities. If we know the identity of any one address in the cluster, then we can know them all, and by propagating this through the transaction graph it is possible to ultimately learn the identity behind large numbers of addresses.

In theory, the use of pseudonyms within Bitcoin provides a property called unlinkability, which says that users’ transactions using one set of pseudonyms should not be linked to their transactions using a different set of pseudonyms. In practice, however, certain properties of Bitcoin usage erodes this anonymity.

Suppose you want to spend (send) some bitcoins – for example to make a purchase costing 10 BTC with a merchant. You have bitcoins spread across multiple addresses – one containing 6 BTC and another containing 4 BTC…

One potential way to pay the merchant would be to create a new address, send the 4 BTC and 6 BTC to this new address, and then send the 10 BTC now contained in this new address to the merchant. (In fact, this is the method that preserves the most anonymity).

There’s a simpler and more efficient way though – transactions are allowed to have arbitrarily many inputs, so it’s possible to just use the 4 BTC and 6 BTC addresses as input to the same transaction.

(mining fees ignored).

This provides the first means by which we can cluster (link) addresses – if two addresses have been used as input to the same transaction, they are controlled by the same user. Since the sender must know the private keys corresponding to all input addresses, this assumption is strong.

When sending bitcoins the user creates a message containing the intended receiver of the bitcoins (identified by public key), and the transaction in which his pseudonym received the bitcoins. When the transaction is broadcast through the peer-to-peer network, peers verify that no other transaction has already referenced the same previous transaction. This mechanism prevents double-spending. A consequence is that all transactions are zero balance – the same number of bitcoins flow out as flow in.

So what do you do if you have an address containing 4 BTC, and you want to send a merchant only 3 BTC? The answer is to create a transaction with multiple outputs.

In this case, we create a transaction with two outputs: 3 BTC going to the merchant address, and 1 BTC in ‘change’ going to a change address. Change addresses therefore can be identified as belonging to the sender. How do we know that a given address is a change address though?

As a first step, we observe that in the standard Bitcoin client, a change address is created internally and is not even known to the user (although he can always learn it by examining the block chain manually). Furthermore, these change addresses are used only twice: once to receive the change in a transaction, and once to spend their contents fully as the input in another transaction (in which the client will create a fresh address to receive any change). By examining transactions and identifying the outputs that meet this pattern of one-time usage, we identify the change addresses. If more than one output meets this pattern, then we err on the side of safety and do not tag anything as a change address.

Using this technique, the authors identified 3.5 million change addresses, with an estimated false positive rate of 0.17%. As of April 2013 when this study was carried out, the block chain contained more than 16 million transactions involving 12 million distinct public keys. Using the change address heuristic, these 12 million keys were collapsed into 3.3 million clusters.

Results

By layering our clustering analysis on top of our ground-truth data (and thus transitively tagging entire clusters that contain previously tagged addresses), we were able to identify 1.9 million public keys with some real-world service or identity, although in many cases the identity was not a real name, but rather (for example) a username on a forum. Although this is a somewhat small fraction (about 16%) of all public keys, it nevertheless allows us to de-anonymize significant flows of bitcoins throughout the network.

The authors uncovered 500,000 addresses as controlled by Mt. Gox, and 250,000 addresses as controlled by Silk Road, and thus were able to observe interactions with these services. This does not de-anonymise the users participating in the transactions, but it does de-anonymise flows of bitcoins into and out of such services.

We followed the flow of bitcoins out of Silk Road (in particular, from one notorious address)) and from a number of highly publicized thefts to see whether we could track the bitcoins to known services. Although some of the thieves attempted to use sophisticated mixing techniques to obscure the flow of bitcoins, for the most part tracking the bitcoins was quite straightforward and we ultimately saw large quantities of bitcoins flow to a variety of exchanges directly from the point of theft (or the withdrawal from Silk Road).

Because you really need to go through an exchange to cash out of the Bitcoin economy, the authors concluded that using Bitcoin for money laundering does not seem to be particularly attractive!

My how you’ve grown

It’s interesting to see how Bitcoin and the block chain have grown since this 2013 article was first written. Firstly, looking at XBT (BTC) : USB exchange rate over time, you can see a big spike in the value of XBT in early 2014, but taking that out we’re left with what looks like a pretty good hockey stick…

Bitcoinblockhalf.com has good statistics on the Bitcoin block chain itself. Here’s a snapshot from Sunday 19th February, 2017:

You can go to a lot of effort to build social networking features or support into your app or website. If the goal is engagement directly within the app then at least you have something you can measure, even if cause and effect can be hard to untangle . But if the goal is to drive user behaviour outside of the app this becomes a much harder problem. An e-commerce example would be the desire to drive foot traffic into retail stores (see e.g., Facebook’s dynamic ads for retail). A more direct example is fitness apps want to drive physical activity outside of the app. In today’s paper, Althoff et al. analyse a dataset uniquely qualified to help us get a handle on the benefits of social networking features as they relate to changing real world (‘offline’) behaviour. The headline results suggest the social features are worth it: creation of new social connections leads to a 30% increase in in-app activity, 17% improvement in user retention, and 7% increase in real-world activity.

Many successful websites and apps use social networking features to appeal to their users, allowing them to interact, form social connections, post updates, spread content, and comment on other’s posts. Social networking features are ubiquitous and are not only used by online social networks, such as Facebook and Twitter. For example, news reading, online education, music listening, book reading, diet and weight loss, physical activity tracking, and many other types of modern computing applications all heavily rely on social networking… However the impact of the online social networks on user behavior remains elusive.

The dataset they study contains data on 791 million online and offline actions of 6 million users in a physical activity tracking app over the course of 5 years. This is four orders of magnitude more users, and six orders of magnitude more activity than datasets used in previous comparable studies on social network effects on physical activity. The following features also make the dataset ideally suited for study:

• There is no need to rely on self-reporting of activity (which is known to be unreliable) as the app tracks actual steps. Thus it becomes possible to reliable compare activity posts as a measure of online engagement, and steps taken as a measure of offline physical activity behaviour.
• The social networking features were first introduced 2 years after measurements began, so before and after social network introduction comparisons are possible
• The network contains both friend and follow actions. Friend requests require approval – the delay between a friend request being made and it being accepted proves especially useful in teasing out social effects. After a friend request is accepted the new friend’s activities appear in the user’s timeline and commenting is allowed.

The key idea of our approach is to identify a natural experiment where we control for the amount of intrinsic motivation and vary the effect of social edge formation. We achieve this by relying on the fact that social network edges only become active (i.e., expose users to notifications and posts), after being accepted by the target person. This means we can compare the population of users who all sent edge requests (i.e., they were all intrinsically motivated) but some of these requests get accepted immediately, while others get accepted with significant delay.

The study compares the difference in activity levels between 7 days before a friend request, and 7 days after. We can see how this varies between users whose friend requests were accepted immediately (within one day), and those whose friend requests were not accepted within seven days (8% of all friend requests, with a median time to acceptance of 25 days).

Untangling intrinsic motivation and social influence

Even when a friend request is not accepted, users sending such requests increase their average daily activity by 148 steps ($p < 10^{-3}$), suggesting that users sending a friend request are intrinsically motivated at the time.

When the friend request is immediately accepted though, average daily activity increases by 328 steps ($p < 10^{-15}$).

This means that the difference of 180 average daily steps can be attributed to an influence effect I from the social connection. This influence effect I is statistically significant (95% confidence interval between 74 and 236 steps; p < 10-3; Mann-Whitney U test).

Exposure to the new friend’s activity and status updates leads to higher physical activity.

How does joining a social network impact user behaviour?

What impact does the very first edge of a user have? I.e., what is the effect of joining the network for the first time.

Since the social network was introduced only after two years into our observation period, we can measure the effect its introduction had on user behavior. We show that joining a social network makes users more physically active for a period of several months, increases their number of posts within the app, and makes them more likely to continue to use the app compared to users in a matched control group.

The study compares paired users showing very similar activity levels for 4 weeks before one of them joined the social network, and the other one didn’t…

Joining the network gives an immediate 7% boost in activity, and the effect then gradually diminishes over a 20 week period – after which the two groups are statistically indistinguishable again.

Do social network features increase online engagement and retention?

A similar matching technique based on number of activity posts per day is used to compare users who started using the app around the same time (within 7 days of each other), but where one user started using the social network and the other didn’t.

Users who do join the network are 17% more likely to be retained after one year. Joining the network also gives a short term boost in in-app (as opposed to physical) activity, but this tails off over the year until the effect is indistinguishable.

What impact does adding each new friend have?

Perhaps unsurprisingly, the more friends you already have, the less the marginal benefit of adding an additional friend. The authors studied this effect for the first 5 friends:

It’s all quite predictable really

Finally, the authors show that it is relatively easy to predict behaviour change in the dataset using a combination of features including activity levels, type of social edge created, and user demographics (age, gender, BMI). The model can predict which users will be influenced to increase their activity after each creation (classifier) with high accuracy (78% AUC).

Beyond the words: predicting user personality from heterogeneous information Wei et al., WSDM 2017

Here’s a very topical paper! You may have seen the recent Motherboard piece, “The data that turned the world upside down,” describing how personality profiling was used to provide tailored messages to voters in the recent American elections. In the interest of balance, here’s the less widely read counterpoint: “The myth that British data scientists won the election for Trump.” In today’s paper, Wei et al. look at the how to generate more accurate personality profiles (underpinned by the same OCEAN model) by combining data from a variety of sources: words used in social media posts, use of emoticons, the avatars that users chose for themselves, and how users respond to posts/tweets in their timelines. The good news is that they’re not trying to use their improved method as a Weapon of Mass Manipulation to influence who gets to hold one of most powerful offices in the world. Instead they suggest personality knowledge can be exploited in good old e-commerce too!

Taking insurance companies as an example, it is easier to persuade users positive in agreeableness to buy their insurance. By comparison, users negative in agreeableness are less likely to accept the promotion.

The authors also build a chatbot, DiPsy, that uses background analysis of Weibo and WeChat account data (the user gives permission during the onboarding process) combined with conversational interactions to refine a personality profile of the user.

Currently, DiPsy serves as a personalized chatbot to evaluate user personality. Due to the wide use of personality in mental health examination, we hope to build DiPsy as a digital psychologist to diagnose, and treat users’ mental process through digital footprints and natural conversations in the near future. By determining users’mental status, DiPsy is able to conduct cognitive behavioral therapy (CBT) or early intervention to help at-risk users alleviate/manage their problems by changing the way they think and behave in a variety of therapeutic contexts.

So, used in an individual context and with explicit consent, automated personality profiling may be a useful tool. Used at large scale to automatically profile people based on public information (i.e. without explicit consent) and target messaging for them feels quite different. If you can’t buy personality-profile enriched consumer datasets in the data market yet (I didn’t go looking), I’m confident it’s only a matter of time…

Anyway, let’s get back to looking at the technology, which is what you probably came here for! There’s prior art in looking at the words people use to try and predict personality, but not much exploration of what else in our ‘digital traces’ much be useful to that end.

With the rise of web and social media, individuals are generating considerable digital traces besides users’ language features. Currently, the sequential text data in users’ social media contents are utilized for predicting user preferences and characteristics. However, to the best of our knowledge, there has been little exploration of how to predict user personality by leveraging the heterogeneous information embedded in these digital traces.

For personality profiling the “Big Five” OCEAN model is used, which scores people on attributes of Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. Using data collected from thousands of volunteers alongside ‘ground truth’ personality profiles of them collected using traditional questionnaire techniques, the authors discovered that there is indeed personality signal not just in the words we use in our tweets, but also in our use of emoticons, our selection of avatars, and the way we respond to messages in our timelines. For example, users that score highly in Agreeableness tend to use more smileys, and users that score highly in Neuroticism use more theatrical emoticons to exaggerate their emotions:

And here’s what your choice of avatar may be revealing about you:

The personality prediction model, which uses an ensemble of different methods, is designed to classify users that are biased towards either extreme positivity or negativity in one or more of the Big Five attributes, since this information and these users are most valuable for targeting.

Use too many 😀😀😀 smileys in your tweets and you’ll be pushing up your Agreeableness score, which could increase your chances of being targeted by advertising 🤔!

Each of the four basic inputs (tweet text, avatars, emoticons, and response patterns) are fed into their own classifiers which are then combined using a stacked generalization-based ensemble to give the final Big Five personality prediction for a user. The resulting system is called HIE, for Heterogeneous Information Ensemble.

Predicting personality from tweets

Tweet features for input to the classifier are engineered in four ways:

1. Classic LIWC (Linguistic Inquiry and Word Count), the state-of-the-art lexicon based method
2. Pearson correlation. For each of the OCEAN dimensions, the Pearson correlation between each word and the targeted personality is determined, and the resulting top 2000 words are used to assist personality prediction.
3. A bag-of-words approach which compensates for “LIWC’s limited capability to represent user’s linguistic patterns in short and informal texts” (i.e., tweets). The top 1,500 (Chinese) words and all punctuation are put in the bag. K-means clustering is then used to cluster the bag-of-words formatted data, and the number of items within each cluster (one cluster per trait) is used as the representation of the user.
4. A deep CNN model tuned for text:

Predicting personality from emoticons

The two emoticon strategies are Pearson correlation and emotion mapping. For Pearson correlation, the analysis is based on the top 50 emoticons that are strongly correlated with user personality. For emotion mapping, the 495 different types of emoticons are mapped into 8-dimension vectors, representing 8 different emotions drawn from Ekman’s emotion atlas.

Predicting personality from avatars

Avatar features are engineered through two model routes: a deep fully connected network and k-means clustering. Both make use of ResNets (Residual Networks, not the networks you find on college campuses!) to encode avatar images into 256-dimensional vectors.

… there are some interesting findings in the usage of avatars and emoticons. For instance, introverts tend to cover their face or show side face. Users high in Openness are more likely to use avatars with their friends, while users low in Openness prefer avatars with themselves only.

Predicting personality from response patterns

We emphasize great importance on the responsive pattern. In psychology studies, it is believed that different reactions to the
same scenarios during interaction reflect differences in user personality.

Each tweet is embedded into a vector by taking an average over word vectors for the words in the tweets (this simple scheme outperformed more complex approaches such as using RNNs). This is then combined with question-answer (tweet/retweet) vectors encoding the following interaction scenarios:

• A user tweets a message and his/her fans or followers comment on it
• A user retweets someone else’s message
• A user comments on someone else’s message

Overall HIE prediction accuracy

Here’s a look at how well HIE does (when using just tweets and avatars) compared to the state-of-art IBM Watson Personality Insights algorithm and the Mypersonality Facebook application algorithm:

Here EAR stands for Extreme Accuracy Rate, and is the precision score for predicted extreme users (positive or negative). EPR@P is the precision score for predicted positive extreme users, and EPR@N is the precision score for predicted negative extreme users.

Due to the demands for personalized services, reliability of predicted results is more important than detecting all potential individuals in a specific personality category. HIE perfectly meets this need of real-world usage. Depicted in Table 7.3.3 (below), we notice that users with high probability scores are aggregated in the two ends of the confidence interval. Moreover, users with extremely low scores are on the left end while users with extremely high scores are on the right end. It shows the ability of HIE to accurately distinguish the targeted extreme users from the others.

(I think the ‘neural’ users are also ‘neutral’ 😉 ).

RedQueen: An online algorithm for smart broadcasting in social networks Zarezade et al., WSDM 2017

Ssshh, don’t tell the folks in marketing ;). This paper starts out with a simple question “when’s the best time to tweet if you want to get noticed?,” detours through some math involving “solving a novel optimal control problem for a system of jump stochastic differential equations (SDEs),” and pops out again on the other side with a simple online algorithm called RedQueen.

The problem of being noticed is quickly translated into the proxy problem “how to stay at the top of followers’ feeds for as long as possible.”

… recent empirical studies have shown that stories at the top of their followers’ feed are more likely to be likely to be noticed and consequently liked or shared. Can we find an algorithm that helps a user decide when to post to increase her chances to stay at the top?

Testing against previous recommendations for optimal posting times and historical Twitter datasets shows tweets posted according to the RedQueen algorithm last up to 3.5x longer towards the top of follower’s feeds.

Modelling the problem

Consider a tweeting user (a broadcaster in the terminology used in the paper). Such a user can be modelled by a temporal point process, which is a fancy way of saying a stochastic process generating a sequence of discrete events (tweets) over time. If we just care about the number of tweets a user makes, we can simplify further and use a counting process representation: let $N(t)$ be the number of tweets up to time $t$. The number of tweets we’ll see in that time depends of course on how actively the user tweets, which can in turn be modelled by an intensity function $\lambda^{*}(t)$.

The intensity function used in the paper has two terms: the first term $\lambda_{0}(t)$ is a time-varying function modelling the publication of tweets by users on their own initiative; the second α-weighted term models the publication of additional messages (e.g., replies, retweets), by users due to the influence of previous messages.

The second term makes the intensity dependent on history and a stochastic process by itself.

The combination of a counting process $N(t)$ and an intensity function $\lambda^{*}(t)$ results in ‘a doubly stochastic Markov process, whole dynamics can be defined by a jump SDE.’

A collection of broadcasting users is a vector $\mathbf{N}(t)$ in which the ith dimension is the number of tweets broadcast by user i up to time t. $\mathbf{M}_{j\\i}(t)$ represents the times of the stories user j receives due to other twitter users she follows .

Given a broadcaster i and one of her followers j, we define the visibility function rij(t) as the position or rank of the most recent story posted by i in j’s feed by time t, which clearly depends on the feed ranking mechanism in the corresponding social network. Here, for simplicity, we assume each user’s feed ranks stories in inverse chronological order.

Given this simplifying assumption, the rank of a tweet is given simply by the number of tweets from other broadcasters that appeared in the feed after it.

Given a broadcaster i and her followers N(i), our goal is to find the optimal conditional intensity μi(t) = μ(t) that minimizes the expected value of a particular nondecreasing convex loss function of the broadcaster’s visibility on each of her follower’s feeds and the intensity itself, μ(t)…

Still following?

At this point section 4 in the paper shows how to break this problem down into smaller sub-problems using Bellman’s principle of optimality and the HJB equation. It all gets a bit hairy (for my abilities), but the authors end up showing that: “The optimal intensity for the when-to-post problem defined by Eq. 6 with quadratic loss and penalty functions is given by $u^{*}(t) = \sqrt{s(t)/q} r(t)$. (Trust me, that expression is a lot simpler than the equations building up to it!),

When to post – RedQueen

What the equation we just saw tells us is that the optimal intensity depends only on the position of the most recent post by user i in her follower’s feed. This means determining the best time to post can be done via sampling.

Algorithm 1 summarizes our (sampling) method, which we name RedQueen. Within the algorithm, othersNextPost() returns the time of the next event by other broadcasters in the followers’ feeds, once the event happens. In practice, we only need to know if the event happens before we post. Remarkably, it only needs to sample M(tf) times from a (exponential) distribution (if significance is constant) and requires O(1) space.

In the above, $s(t)$ is a time significance function $s(t) \geq 0$, which favours some periods of times (e.g., times in which the follower is online), and $q$ is a given parameter which trades-off between visibility and the number of broadcast posts.

The algorithm as defined above only deals with a single follower. To deal with n followers simply sum the $\sqrt{s/q}$ term over all n.

Experiments

An evaluation was conducted on a dataset of 1.7B tweets by 52 million users, with 1.9B directed follow links between them. For each broadcaster, equal significance was assigned to each follower, and q was tuned to match the number of tweets made by the broadcaster (with tolerance of 10%) during the two month period the dataset covers.

Figure 4 summarizes the results by means of box plots, where position over time and time at the top are normalized with respect to the value achieved by the broadcasters’ actual true posts during the two month period. That means, if y = 1, the optimized intensity achieves the same position over time or time at the top as the broadcaster’s true posts. In terms of position over time and time at the top, RedQueen consistently outperforms competing methods by large margins and achieves 0.28× lower average position and 3.5× higher time at the top, in average, than the broadcasters’ true posts – in fact, it achieves lower position over time (higher time at the top) for 100% (99.1%) of the users.

The authors also picked an individual user, and looked at the position over time of their most recent ‘real’ tweet against the most recent one posted by a simulation run of RedQueen.

without significance information, RedQueen posts at nearly an even pace. However, when we supply empirically estimated significance, RedQueen avoids tweeting at times the followers are unlikely to be active, i.e., the weekends, denoted by the shaded areas in panel (c) of Figure 5. Due to this, the average position (maximum position) falls from 389.45 (1085.17) to 425.25 (1431.0), but is still lower than 698.04 (2597.9) obtained by the user’s original posting schedule.

(Click on image for larger view).

Reducing controversy by connecting opposing views Garimella et al., WSDM 2017

Society is often polarized by controversial issues that split the population into groups with opposing views. When such issues emerge on social media, we often observe the creation of ‘echo chambers’, i.e., situations where like-minded people reinforce each other’s opinion, but do not get exposed to the views of the opposing side. In this paper we study algorithmic techniques for bridging these chambers, and thus reduce controversy.

This is the follow-up to the paper on detecting controversial topics that we looked at yesterday. Suppose we find a controversial topic, and have the partitioned graph showing the two sides of the discussion, what can we do to bring them together? Adding edges (bridges) between users on opposite sides of the debate would reduce the controversy, and we can calculate (by looking at the impact on the RWC score) which edges we can add to have the greatest effect. Recall however that the graph is based on retweets and follows, so to ‘add an edge’ we need not only to make recommendations (e.g., to a recommendation to follow someone on the other side of the debate), but we also need users to act on those recommendations…

Clearly, some bridges are more likely to materialize than others. For instance, people in the ‘center’ might be easier to convince that people on the two extreme ends of the political spectrum. we take this issue into account by modeling an acceptance probability for a bridge as a separate component of the model.

Let’s ignore the acceptance probability issue temporarily so that we can focus on the simpler problem to start with. Graphs of controversial topics often have a structure that resembles star-graphs – a small number of highly popular vertices receiving incoming.

Connecting the centres of the two stars provides the maximum reduction in RWC score. More generally, the controversy reduction algorithm considers only edges between high-degree vertices on each side. For each of these candidate edges, the RWC score is calculated with the edge added. The k edges that lead to the lowest score when added to the graph individually then become the recommendations.

Adding back in the notion of acceptance probability, we need to calculate the expected RWC score for adding an edge, under some probabilistic model providing the probability the edge will be accepted once recommended.

We build such an acceptance model on the feature of user polarity [that we looked at yesterday]. Intuitively, this polarity score of a user, which takes values in the interval [-1,1], captures how much the user belongs to either side of the controversy… We employ user polarity as feature for our acceptance model because, intuitively, we expect users from each side to accept content from different sides with different probabilities, and wes assume these probabilities are encoded in, and can be learned from, the graph structure itself.

Let $P_u$ be the polarity score of user u (we’ll come back to how that is calculated in a moment). Then the probability $p(u,v)$ that user u accepts a recommendation to connect to user v is given by:

$N_{endorsed}(P_u,P_v)/N_{exposed}(P_u,P_v)$

That is, the fraction of times a user with polarity ~$P_v$ actually endorsed content generated by a user with polarity $P_u$ after being exposed to it. An underlying assumption is that if v follows u then v is exposed to all content from u. The expected decrease in RWC for adding an edge $(u,v)$ is then simply $p(u,v)$ times the decrease in RWC that will be seen if the edge actually is added.

So how do we calculate the polarity $P_u$ of user u? Starting with the vertex for u, we determine the expected number of time steps for a random walk to arrive at a high degree vertex in partition X ($l_{u}^{X}$), and again for partition Y ($l_{u}^{Y}$). Do this for all other users $u'$ as well.

Let $\rho^{X}(u)$ be the fraction of other vertices $u'$ for which $l_{u}^{X} < l_{u'}^{X}$, and $\rho^{Y}(u)$ be the fraction of other vertices $u'$ for which $l_{u}^{Y} < l_{u'}^{Y}$.

Now $P_u = \rho^{X}(u) - \rho{Y}(u) \in [-1,1]$.

We can compute the top-k edges under this model using Fagin’s algorithm: create two ranked lists of edges (u,v), one ranked by decreasing RWC as per the simplified algorithm, and one ranked by decreasing probability of acceptance. “Fagin’s algorithm parses the two lists in parallel to find the edges that optimize the expected decrease E(u,v).

Instead of having to recompute RWC from scratch each time we want to calculate how much it will decrease when adding an edge, the authors show a way to incrementally compute it – this provides an order of magnitude speed-up in the algorithm runtime.

Case study

In order to provide qualitative evidence on the functioning of our algorithms on real-world datasets, we conduct a case study on three datasets… We can verify that the recommendations we obtain are meaningful and agree with our intuition for the proposed methods…

In the table below, the ROV row represents the recommendations without acceptance probability taken into account, and ROV-AP the recommendations that do consider acceptance probability.

For example, for obamacare ROV recommends edges from mitromney to barackobama, and from barackobama to paulryanvp. Even though these edges indeed connect opposing sides, they might be hard to materialize in the real world. This issue is mitigated by ROV-AP which recommends edges between less popular users, yet connects opposing viewpoints.

You have it in your power to truly differentiate your platform and make the world a better place by implementing controversial topic and filter bubble detection (per the paper we looked at yesterday), together with letting users see their polarity score (per today’s paper) and making controversy reducing / filter-busting follower recommendations (also per today’s paper). This would be something new and unique in the world of mass media consumption, and could help to make Twitter great again.

Quantifying controversy in social media Garimella et al. WSDM 2016

Last week I had the pleasure of dropping in on WSDM (Web Search & Data Mining) 2017 for the conference start-up day in Cambridge (UK). This week I’ll be bringing you a selection of papers from the conference, but to kick things off I’m going back to WSDM 2016…

A couple of weeks ago I realised that since Trump’s election, my twitter feed had been composed entirely of negative tweets about Trump. Asking on twitter “how bad is my filter bubble?” led to some recommendations for additional people to follow to add some balance. Even if you don’t agree with the opposing position, the first basis for moving forward is to acknowledge and attempt to understand it – and you can’t do that if you never even see it! In such a blatant case, it was easy to spot there must be a filter bubble effect, but how often does the news bias go unnoticed? In “Quantifying controversy in social media,” Garimella et al. give us a method for identifying topics around which there is controversy, which is the first step in realising there might be another side to the story.

… we assume that controversial and polarized topics induce graphs with clustered structure, representing different opinions and points of view. This assumption relies on the concept of “echo chambers” which states that opinions or beliefs stay inside communities created by like-minded people, who reinforce and endorse the opinions of each other. This phenomenon has been quantified in many recent studies.

The research is carried out in the context of Twitter, but in theory can be applied to any social graph structure. A topic is simply defined as a query, often a hashtag. Given a query, we can build a conversation graph with vertices representing users, and edges representing activity and interactions between users. Using a graph partitioning algorithm, we can then try to partition the graph in two. If the partitions separate cleanly, then we have a good indication that the topic is controversial and has polarised opinions.

Our findings on controversy have many potential applications on news-reading and public-debate scenarios. For instance, quantifying controversy can provide a basis for analyzing the “news diet” or readers, offering the chance for better information by providing recommendations of contrarian views, deliberating debates, and connecting people with opposing opinions.

There are three stages in the process: (i) building the conversation graph so that it includes features useful for detecting controversy, (ii) partitioning the graph, and (iii) determining a controversy measure from the resulting partitions. In the paper, the authors explore several alternatives for graph construction and controversy measures – I’m going to focus just on the ones that proved to work the best!

Graph building

Twitter hashtags are used as seed queries (e.g. #beefban), but we can’t define a topic just by a single hashtag…

For instance, the opposing sides of a controversy might use different hashtags, as the hashtag itself is loaded with meaning and used as a means to express their opinion. Using a single hashtag may thus miss part of the relevant posts… Given a seed hashtag, we define a topic as a set of related hashtags, which co-occur with the seed hashtag.

Unweighted co-occurrence will tend to pick up popular (but neutral to the topic) hashtags such as #ff. A topic is therefore defined by the seed hashtag and the top-k co-occurring hashtags as scored by a weighted similarity measure (k=20 in the experiments).

Specifically, we compute the document frequency of all hashtags on a random 1% sample of the Twitter stream, and normalize the original similarity score between two hashtags by the inverse document frequency.

Here are the top-20 hashtag topic clusters for #baltimoreriots and #netanyahuspeech:

Even inspecting these hashtag clusters gives a sense of the controversy within those topics.

From the tweets containing these hashtags we build a graph where each tweeting user is represented by a vertex. Edges are created between users based on two criteria:

• If user A retweets user B, then this is seen as signalling an endorsement of the opinion expressed in the original tweet by propagating it further. Note that user A does not need to be following B in order to retweet a tweet from user B.
• If user A and user B both employ a given hashtag (from the topic cluster), and A follows B, then an edge is created from A to B. “We stress that the graph G built with this approach is topic-specific, as the edges in G are constrained to connections between users who discuss the topic that is specified as input to the pipeline.

The authors also experimented with created edges based on content similarity (e.g. posting links to the same URL), but these were found not to be reliable predictors of controversy in the evaluation.

Graph partitioning

No need to introduce any novelty here. The authors use the standard METIS algorithm and ask it to create two partitions. Here are visualisations of the results for: #beefban, #russia_march (both controversial) and #sxsw and #germanwings (not controversial). The top row shows these partitioned on retweets, and the bottom row by followers, indicating that both of these contribute to the predictive power.

Controversy scores

Of several evaluated measures to determine an overall controversy score, random walk based measures were found to work the best. In a controversial discussion, there are likely to be authoritative users on both sides, as evidenced by a large degree in the graph. A large (high) degree means that a user has received a large number of endorsements on the topic. The random walk controversy measure captures how likely it is that a given user will be exposed to authoritative content from the other side. Let the two partitions created by METIS be called X and Y.

We define the Random Walk Controversy (RWC) measure as follows. Consider two random walks, one ending in partition X and one ending in partition Y, RWC is the difference of the probabilities of two events: (i) both random walks started from the partition they ended in and (ii) both random walks started in a partition other than the one they ended in.

$RWC = P_{XX}P_{YY} - P_{XY}P_{YX}$

RWC will be close to one when the probability of crossing sides is low, and close to zero when the probability of crossing sides is comparable to that of staying on the same side. We can therefore view it as the probability that the topic is controversial (polarising).

RWC can be calculated using Monte Carlo sampling, but collecting a large number of samples (e.g., 10,000) is computationally expensive. A variant of RWC called random walk with restart was found to have the same predictive power as RWC, while being much cheaper to compute. RWC with restart (RWR) avoids the problem of getting stuck in dangling vertices (which are common in the star-like graphs created when a few authoritative users generate information that spreads through the graph). A walk that starts in X will be restarted from a random vertex in X when it reaches a dangling vertex (a vertex with no outgoing edges). Moreover, high-degree vertices have all of their outgoing edges removed so that walks restart upon reaching them.

In tests, RWR and RWC gave almost identical results, but RWR is up to 200 times faster.

It is also possible to define a controversy score for an individual user, in the range -1 to 1. 0 represents a balanced user, and scores of +/- 1 represent the extremes for both sides. Two measures for this are proposed – both based around the idea of starting random walks at the user and seeing either which partition the walk ends up in, or how long it takes to hit high-degree vertices on either side.

Quantifying controversy in the wild

There is a good evaluation in the paper on previously available datasets which enables comparisons with alternative methods (RWC/R does better). The authors also use the system on live Twitter data:

To check whether our system works in a real-world setting, we deploy it in the wild to explore actual topics of discussion on Twitter and detect the ones that are controversial. More specifically, we obtain daily trending hashtags (both US and worldwide) on the platform for a period of three months (June 25 – September 19, 2015). Then, we obtain all tweets that use these hashtags, and create retweet graphs. Finally, we apply the RWC measure on these conversation graphs to identify controversial hashtags. The results can be explored in our online demo.

Most hashtags are not controversial. Controversial hashtags identified by the system include #whoisburningblackchurches, #communityshield, and #nationalfriedchickenday (the latter being a debate between meat lovers and vegetarians).

Application

Now that we know how to detect controversy, wouldn’t it be great if tweets in your timeline with a controversy score over some threshold were automatically flagged as such? And furthermore, if your feed is then dominated by tweets only from one partition of a topic over some time window, a filter bubble warning is shown? This would handle the awareness part of the problem. The second part of the problem is introducing balance back into your feed. Tomorrow we’ll look at some follow-on work from WSDM 2017 that can help with that…