Scoring news stories is hard

Frederic Filloux
Monday Note
Published in
5 min readOct 30, 2017

--

by Frederic Filloux

A crucial part of the News Quality Scoring Project is labeling the dataset, i.e., thousands of news articles. The process will be both automated and also rely on collaborative filtering. (Part of a series about NQS.)

In a machine learning process, the more data, the better. By data, I mean labeled data: “this is a cat, a dog, a lamp,” etc. If you build an image recognition model, go to ImageNet. It’s a repository of more than 14 million images covering a lot of things and arranged in nearly 22,000 “synonym set” or “synset”. It is mostly used for research and funded by Stanford, Princeton, Google, and A9 (Amazon's engine for its product and visual search, as well as advertising). For the sake of this column, I entered the terms “coffeepot”. Here is what ImageNet returned:

Whether you build a classifier for animals, physical objects, medical imaging, the number of labeled images required is usually in the hundreds of thousands (an emerging field involves re-creating “synthetic” or “artificial” datasets to drastically reduce the volume theoretically required, we'll put it aside for now).

My interest lies in data that are inherently complex to classify because of their fuzzy and unstable nature, such as news.

Hence my research on Netflix recommendation engines (a subject touched last week). When, in 2006, Netflix launched a competition to increase the performance of its rec system, it provided a training set of 100 million ratings, given by 481,000 users on 18,000 movies. That was a pretty robust dataset: each movie was rated by… 5000 users on average. The dataset, however, presented a high variance: while the average user rated 200 movies, some gave as few as 3 ratings while one single user rated 17,000 movies!

With such an extreme example in mind, let’s see how we can realistically label news stories.

First, while the news inventory is nearly infinite, there is no way we can get even close to Netflix numbers. In 2006, when it made available the dataset to the Prize contestants, Netflix had 6 million subscribers all of them in the United States (it has 109 million worldwide today). This means the scoring was performed by 8% of the users at the time, a fairly high number. But:

  • One, for Netflix, audience structures of and engagement can’t be compared to what digital news media can expect. People who paid 9 dollars a month for their video streaming service are heavier users and therefore more inclined to leave a rating and/or a comment.
  • Two, it is intuitively more acceptable to rate a movie right after viewing it, while the end credits are rolling, than asking readers who pore over their news pages at breakfast time or in a bus to score articles on the fly.
  • Three, a score collected that way would not make much sense anyway. In the case of the News Quality Scoring project, the ratings are not meant to be publicly visible. One reason is journalists might deem it unacceptable, especially when most are not often given the opportunity to write in-depth pieces that are true differentiators in a stream dominated by commodity news.
    The NQS scoring is aimed at assessing the value-added deployed by a media for a given news coverage in terms of resources, expertise, thoroughness of the process and ethical guarantees. This score is to be used by the media to improve their search and recommendation engines (see previous Monday Note ) or for revenue improvement (more on this later).
  • Four, rating film is a single criterion process (users give zero to five stars, they don’t evaluate the components of a films); classifying news involves multiple criteria, at least in the perspective of building a training set for a machine learning algorithm.

This leads to several additional questions:

  • Who should test stories? Random people, Mechanical Turks, a targeted set of people somehow part of the industry?
  • What should be the subjective criteria — assuming that each time you add one, you lose a sizable number of testers?
  • How should stories be presented: in their context or in a barebone version (see below)?

Already, preliminary tests reveal multiple biases that significantly affect the overall perception of a story. For instance:

  • Visual context: the page is neat or cluttered; or the text is perceived as light or super dense (lines and paragraphs spacing, length of lines, typefaces).
  • Hyperlinks density in a page and placement of additional reading suggestions (and their origin: third parties or publishers recommendations).
  • Subject proximity to the reader (which is different from affinities such as politics, science, sport); in many cases, editing impacts the proximity/appeal of a piece.
  • Headline structure (length, readability, tonality) is a hugely important indicator of the status of a piece.
  • Authors display is also critical. A bare byline will send a wrong signal as opposed to another linked to the author’s biography and an access to her body of work. That is why The Trust Project upcoming standard is so important for the qualitative evaluation of the news flow.

Obviously, this needs to be confirmed on a larger scale. For the NQS project, we are working on a testing interface that will be released in coming weeks. It will mostly be directed at people involved in the news business as we do not intend to offer an incentive to those who will volunteer to participate. There should be no more than five subjective criteria (a larger set of quantifiable signals is processed elsewhere). And we decided that stories will be presented in a stripped down version to avoid distractions and visual bias.

Be part of NQS upcoming evaluation tests: register here ◀︎

frederic.filloux@mondaynote.com

--

--