How We Are Exploring Mountains of Linked Data at BBC News Labs

First in a series on BBC News Labs’ data experiments

I was asked to join BBC News Labs a couple a weeks ago to work on a project that, when it was first briefly explained to me by email, left me clueless about what it was about. (Imagine the discomfort before my job interview with Matt Shearer, Innovation Manager at the Lab.)

The project is called #newsVane—and yes, we refer to it with the hash sign every time, don’t ask me why.

Making Use of Our Linked Data

tagging interface

Here is what the tagging looks like on the Tanzanian ivory story.

The idea behind this project is that there are certain undiscovered opportunities in the data we have at our fingertips. Indeed, we have access to an extraordinary amount of data: the BBC-produced content, the BBC News archives, all the content produced by other publishers in the UK and around the world, and of course, all the data that we can use thanks to the internet (open government sources, FOI aggregators, custom feeds…).

But so far, it’s just news stories, like the hundreds we produce every day. Every news organisation is in the position of having advanced monitoring tools. The catch is that the BBC has been working in the past months on a way of linking its data together, and making it more accessible and meaningful. Tools are available to us at the moment to explore via simple APIs a large proportion of the English-language news content published by the Beeb every day, and here is where the big fun begins: this content is tagged and referenced semantically.

Let me give you an example. As I’m writing, this article “Tanzania will not sell ivory stockpile, says minister” appeared on my monitor. It was published by the BBC World News Africa and, as part of the semantic tagging scheme, was associated with appropriate tags.

As you can see, more information can still be added to the tags, but it gives us a pretty decent idea about how this content is related to or can be associated with other pieces of content. Now, this possibility of exploring the data in so many sources of content is gold—but gold we cannot use at the moment. And here’s my job in a nutshell: to find the useful patterns in this vast amount of data, of connections…to turn rocks to see what’s under them.

Tools of the Trade

Although we have many tools available, I am focusing at the moment on Juicer, our linked data prototyping platform. The Juicer provides us with automatic semantic annotation of published BBC News articles, interrogating 650,000+ article databases and 150,000+ tags. The job is elegantly done by extracting the named entities from the raw articles, then matching them with the DBPedia concepts (a tool worth exploring), then making sure the tags are available for SPARQL querying via a RESTful API, and finally creating an UI for the user to be able to add his own tags.

I can then ask Juicer to return me 30 articles mentioning places within 25 miles of Chester, or the articles about Conservative politicians published in the section UK Politics, or the articles about companies within the aerospace industry, for example.

Juicer process flow

Everything you ever wanted to know about Juicer, from our blog post about it.

Here’s a sample of the API request that will return a nice JSON for the last example.


Quite simple, eh?

Eventually, my research and calls to friends led me to another smash-my-head-against-keyboard moment: machine learning. Don’t misunderstand me, I do find this field fascinating and I admire these literate people. But, you know, I failed to teach myself C++ when I was a kid, so I have hard times dealing with actual clever programming.

We were thinking about having a list of trending topics (in the strict sense of the term), but the question is how to group together similar articles? Matt’s words on my first days echoed in my head, and I looked into “pattern recognition,” and in particular k-nearest neighbour and k-means clustering.

Taking into consideration what I knew, the ideal for me would be to be able to do stuff in JavaScript directly. A good reference for that is Burak Kanber’s blog articles about machine learning in JavaScript. It is also worth having a look at KNIME open source workbench for analysis processing. Same thing for jStat, a JavaScript Statistical library. And I do need to dip my toes into Carrot2, an open-source search results clustering engine, which looks really interesting. All of these are open-source projects—partly because of my beliefs, partly because I don’t want to bother my boss with money problems.

I could name a couple of resources for reading which put my experiments on track:

The first one is an academic paper by the university of Konstanz, Germany, called Incremental Visual Text Analytics of News Story Development (PDF link). It is a nice read—not that difficult to get your head around—which explores the development of news in a certain period of time.

UI from tool described in academic paper

UI for analyzing the evolution of a news story, discussed in Incremental Visual Text Analytics of News Story Development (PDF link)

Experiment, Refine, Repeat

In the upcoming months, we will be prototyping and experimenting with various data sources.

My first priority—as I am a journalist first and a hacker second—is to prototype a tool useful to journos: the kind of app they’d launch first thing in the morning. As for today, the prototype is a dashboard gathering several kinds of information, from trending topics in the news at a given time to live analytics of our websites. I’m putting together a demo with Bootstrap to use it for a one-man alpha test. Hopefully, the project will be refined many times and people will punch holes in the idea so we can move towards an increasingly attractive concept to present to our newsroom.

This project is incredibly exciting for me, as it is pure exploration of unknown territory. I am trying to follow my instincts and to find ways to assemble data sources together to see if we can surface patterns or useful information, and to probe ways of making the information more meaningful by delving into its connections and co-occurrences. I can’t state enough how hard this project is pushing my own knowledge, from my skills in front-end development to the frontiers of machine learning and data mining.

So far, I am focused on these main concerns and ideas:

  • What if the tool could show what is “hot news” at the moment and suggest to the journos relevant data sources to work on it?
  • What if we could better monitor the news publishers to follow the original angles developed after the news leads?
  • What if we could use the vast semantic engine to do some meta-journalism and observe the patterns in the news coverage itself?

In upcoming weeks, we will present the first results of these experiments, and will seek feedback from the community, as well as talented people to get involved in the project.

You can follow our latest news at @BBC_News_Lab, and find me at @basilesimon.





Current page