The Rise of Unconventional Data

The Rise of Unconventional Data

One of the lesser understood aspects of what you can do with massive stockpiles of data is the ability to use data that would traditionally have been overlooked or in some cases even considered rubbish. This whole new category of data is known as "exhaust" data—data generated as a by-product of some other process.

The Rise of Unconventional Data

Much financial market data is a result of two parties agreeing on a price for the sale of an asset. The record of the price of the sale at that instant becomes a form of exhaust data. Not that long ago, this kind of data wasn’t of much interest, except to economic historians and regulators.

A massive moment-by-moment archive of prices of shares and other securities sales prices is now key to many major banks and hedge funds as a "training ground" for their machine-learning algorithms. Their trading engines "learn" from that history and this learning now powers much of the world’s trading.

Traditional transactions such as house price sales history or share trading archives are one form of time-series data, but many other less conventional measures are being collected and traded too.

There are also other categories of unconventional data that are not time-series-based. For example, network data outlines relationships and other signals from social networks, geospatial data lends itself to mapping, and survey data concerns itself with people’s viewpoints. Time series or longitudinal data is, however, the most common form and the easiest to integrate with other time-series data.

Location data from mobile phones means many companies now have people-movement data. [Photo: via The Conversation, Flickr user Andrew Hyde]

Consistent Longitudinal Unconventional Exhaust Data or CLUE data sets, as I’m calling them, are many, varied and growing. They include:

foot traffic data

consumer spending data

satellite imaging data

biometrics

ecommerce parcel flow data

technology usage data

employee satisfaction data.

Footfall Data from Scotland.

Visualisation of footfall data from the past nine years at Glasgow’s Tramway arts venue. Photo: via The Conversation, Flickr user Kyle Macquarrie

Say, for example, you are interested in the seasonal profitability of supermarkets over time. Foot traffic data may not be the cause of profitability, as more store visitors doesn’t necessarily correlate directly to profit or even sales. But it may be statistically related to volume of sales and so may be one useful clue, just as body temperature is a good clue or one signal to a person’s overall well-being. And when combined with massive amounts of other signals using data analytics techniques, this can provide valuable new insights.

RISE OF "QUANTAMENTAL" INVESTMENT FUNDS

Leading hedge fund BlackRock, for example, is using satellite images of Chinataken every five minutes to better understand industrial activity and to give it an independent reading on reported data.

Traditionally, there have been two main types of actors in the financial world—traders (including high-frequency traders), who look to make money from massive volumes on many small transactions, and investors, who look to make money from a smaller number of larger bets over a longer time. Investors tend to care more about the underlying assets involved. In the case of company stocks, that usually means trying to understand the underlying or fundamental value of the company and future prospects based on its sales, costs, assets, and liabilities and so on.

Aerial photography from drones and new low-cost satellites are one key new source of unconventional data.[Photo: Flickr user BxHxTxCx]

A new type of fund is emerging that combines the speed and computational power of computer-based quants with the fundamental analysis used by investors: Quantamental. These funds use advanced machine learning combined with a huge variety of conventional and unconventional data sources to predict the fundamental value of assets and mismatches in the market.

Some of these new style of funds, including Two Sigma in New York andWinton Capital in London, have been spectacularly successful. Winton was founded by David Harding, a physics graduate from Cambridge University in 1997. After less than two decades it ranks in the top 10 hedge funds worldwide with US$33 billion in assets under advice and more than 400 people—many with PhDs in physics, math, and computer science. Not far behind and with US$30 billion in assets, Two Sigma also glistens with top tech talent.

New ones are emerging too, including Taaffeite Capital Management, run by computational biology and University of Melbourne alumnus Professor Desmond Lun. Understanding the complex data dynamics of many areas of natural science, including biology and ecology, is turning out to be excellent training for understanding financial market dynamics.

WEIRD DATA FOR ALL

But it’s not only the world’s top hedge funds that can or are using alternative data. A number of startups are on a mission to democratize access to new sources. Michael Babineau, cofounder and CEO of Bay Area startup Second Measure, aims to offer a Bloomberg-terminal-like approach to consumer purchase data. This will transform massive amounts of inscrutable text in card statements into more structured data, thus making it accessible and useful to a wide business and investor audience.

Others companies, like Mattermark in San Francisco and CB Insights in New York, are intelligence services that provide fascinating and valuable data insights into company "signals." These can be indicators and potential predictors of success—especially in the high-stakes game of technology venture capital investment.

Akin to Adrian Holovaty's pioneering work a decade ago mapping crime and many other statistics in Chicago online, Microburbs in Sydney provides a granular array of detailed data points on residential locations around Australia. It allows potential residents and investors to compare schooling, restaurants, and many other amenities in very specific neighborhoods within suburbs.

We Feel, designed by CSIRO researcher Cecile Paris, is an extraordinary data project that explores whether social media—specifically Twitter—can provide an accurate, real-time signal of the world’s emotional state.

We Feel is a research tool that creates "signals" data about the emotional mood of people around the world via their tweets.[Photo: via The Conversation, CSIRO]

WEIRD SMALL DATA HAS ITS BENEFITS AND ITS RISKS

More than simply pop-economics, Freakonomics (2005) showed how unusual yet good-quality data sources can be valuable in creating insights. Assiduous record-keeping of the accounts of an honesty system cookie jar in an office place revealed that people stole most during certain holidays (perhaps due to increased financial and mental stress at these times); access to drug gangster bookkeeping accounts explained why many drug dealers live with their grandparents (they are too poor to move out); and massive public school records from Chicago showed parental attention to be a key factor in students' academic success.

Many of the examples in Freakonomics were based on small quirky data samples. However, as many academics are aware, studies with small samples can present several problems. There’s the question of sampling—whether it’s large enough to represent a robust sample and whether it’s a random selection of the population the study aims to understand.

Then there’s the problem of errors. While one could expect errors to be smaller with smaller sample sizes, a recent meta-study of academic psychology papers found half the papers tested showed significant data inconsistencies and errors. In a small number of cases this may be due to authors fudging the results, whereas others may be due to transcription or other simple mistakes.

WEIRD DATA IS GETTING EASIER TO FIND

More and more large-scale unconventional data collections are becoming readily available. There are three blast furnaces driving its proliferation:

the interaction furnace: our own growing interactions with the web and web services (e-commerce, web mail, social media) etc.

the transaction furnace: the increasingly online ledger of commerce.

the automation furnace: an explosion of web-connected sensors.

While large data collections can’t help with avoiding fabrication, they can sometimes help with sample size and representation issues. When combined with machine learning they can:

provide accurate insights from incomplete, noisy, and even partially erroneous data.

offer associations, patterns and connections—blindly with no a priori assumptions.

help eliminate bias—by invoking multiple perspectives.