Five data lies that need to die … now streaming on Netflix

Let’s rewind the clock 25 years. Back then, the trendy company was Walmart and the trendy topic was supply chain management. You couldn’t throw a rock in the business section of the Wall Street Journal without hitting a journalist waxing philosophical about how the company was “reinventing retail” through ruthless supply chain efficiency. But it didn’t take long before those articles turned negative. By the early 2000s, Walmart was “destroying Main Street” and bullying suppliers.

Leaders who followed the pundits’ whipsawing advice – that supply chain would solve all their problems, or that ruthless supply chain management led to unsustainable relationships – largely wasted time and money. What could your small business take from Walmart’s strategy? Probably very little, but it made for a good story.

Trendy companies and fashionable opinions come and go, but the pattern remains the same: The articles are meant to tell good stories to drive increased readership. They rarely provide sound and actionable advice.

“Netflix” is simply the latest trendy company and “data” is simply the latest fashionable topic. The innumerable stories about the transformative power of the Netflix algorithm may make for good reading, but they aren’t necessarily good advice about how to use data.

Let’s have a look at the recent punditry and unmask the storytelling masquerading as advice.

Data Lie #1: Our company (or strategy, or marketing, or product) is data driven.

In this column on the Neil Patel website (who should know better), the author explains the multiple ways Netflix uses the data it gathers from its 130 million subscribers to refine suggestions for other content you might like to watch, viewer engagement levels (when you watch, and for how long), and even to predict attrition rates. What’s more, Netflix can use detailed viewer history (including stop points) to improve content development by providing valuable, real-time feedback to content creators.

That’s all fine. Here’s where it goes wrong. The column then quotes a few Netflix data geeks who – no surprise – were willing to highlight their successes: “Orange is the New Black” and “House of Cards” as Netflix content investments, and “The Dark Knight” as a licensing coup. At the time, with only a $7.99 per month subscriber fee, Netflix was “smart about their decisions” and “took full advantage of their analytics.”

The implication, of course, is that other companies simply ignore their data and make decisions based on gut instinct. If you would only be “data driven” like Netflix, then you also would have that success. By that logic, if I were to wear Michael Jordan’s shoes, I would be able to dunk like Mike. (Trust me, new shoes would not be enough.)

Data is simply another asset. With 130 million subscribers and data on billions of television hours watched, if Netflix were not using its assets appropriately, its shareholders should fire its management team. It is no different than a farmer leaving the tractor in the barn and plowing the field with a shovel. Additionally, most businesses (especially small businesses) do not have access to the rich stores of “big data” that would allow for that scale of sophisticated analysis. A concrete contractor with a dozen active customers isn’t likely to see many benefits from an even a historical statistical analysis. The key asset for that type of business is its relationships, not its data.

If you read the comments from Netflix officials carefully, even they understand the limits of their own data, huge though it may be. Data helps make content decisions; it does not drive them.

Netflix is not a data driven company; Netflix is an entertainment driven company.

Data Lie #2: Data can predict the future.

In this article in Forbes, the author falls victim to classic hindsight bias. He fawns over the decision by Netflix executives to invest $100 million in “House of Cards” with no script, no pilot, and no plan ­– relying on its “algorithms” that predicted success based on Kevin Spacey’s appeal, remnant fans of the British series, and the “subject matter.” He then decides to spin the wheel of cognitive biases again, this time landing on “cherry picking” with a similar process selecting Sandra Bullock’s “Bird Box” thriller.

The implication is that an “algorithm” can make terrific decisions just as well as Hollywood executives could. Creativity isn’t necessary. All you need is enough data.

When you look in the rearview mirror, of course you can find examples of success. And because Netflix is notoriously secretive with its viewership data, of course you only will see the successful experiments. But without seeing the failures the algorithm predicted, you cannot make the claim the algorithm can predict the future. A robustly thoughtful article would have asked to compare investments in multiple films using the same algorithm. It would then compare those results to expert analysis, as well as compare them to random guessing.

Bluntly, betting on star power, fans of a genre, and timely subject matter is not magic. Hollywood executives have been doing it for 100 years. If you read Netflix executive’s actual statements, they say as much.

No, Netflix algorithms cannot predict a show’s success. Netflix uses its data as an input to executive decisions, as any smart company would.

Data Lie #3: Data should make my decisions.

Finally, a real statistician to help us chew through this one! Roger Peng does a better job explaining the Wall Street Journal article than its authors do. Peng describes the situation Netflix faced when advertising “Grace and Frankie” starring Jane Fonda and Lily Tomlin. In its testing, Netflix discovered that more people clicked on the promotional image when it included only Tomlin, and not Fonda.

Apparently, the data team argued its case, but Netflix executives decided to use the poorer-performing image because it did not want to alienate its relationship with a big star. And if Fonda felt miffed, data be damned, she could go to Disney’s upcoming streaming service instead.

The implication is that “data” and not “egos” should make the decisions because egos are flawed, subjective measures.

But here is where WSJ falls flat and Peng shines. Peng calls out the flaws in the “data makes the decision” assumption that permeates the WSJ article. The data is unlikely to be able to account for all of the variables. Like any good analysis, it makes a narrow conclusion based on a wide sample of data. In this case, it sampled large segment of viewers with a choice between two promotional images. Yes, more people clicked on one image than another – probably a statistically significant amount – but was it really Jane Fonda that made the difference? Was it something else about the photo? What kind of regression analysis determined that it was Fonda, and not some other factor, that drove the choice?

Critically, even if the data are clear, the designers of the experiment are human. That means humans decide what counts in the analysis, and to what degree, even if they are unconscious about it. (Don’t get me started on so-called “learning” algorithms. They often do as much to amplify biases than dispel them.) What’s more, the data scientists are unlikely (Peng’s argument, and I agree) to have included a factor for the Net Present Value (NPV) of the ongoing relationship with Jane Fonda, because, as we’ve already seen in Data Lie #2, data cannot predict the future.

In the end, Netflix executives themselves say the decision is 70/30 (70 percent experience and instinct, and 30 percent data). It seems that they understand the data better than the WSJ does. There are always limits to data, even with billions of hours of viewership data.

No, data doesn’t make Netflix creative decisions. People do. Data helps.

Data Lie #4: Data are objective.

By now, we should be seeing a pattern. Data might be neutral, but human use of it (and interpretation of it) are not. The more we believe data is objective, the more blind we are to its biases.

Case in point: Netflix tends to recommend shows with black characters to black people. Forbes contributor Adam Candeub switches from gushy to judgy quickly in his article about the apparent racism embedded in the Netflix recommendation algorithm. Netflix defends itself by saying that it does not collect data on race, and that the algorithm responds to user inputs.

Specifically, Netflix responded:

“We don’t ask members for their race, gender or ethnicity, so we cannot use this information to personalize their individual Netflix experience. The only information we use is a member’s viewing history.”

Candeub doesn’t buy it. Netflix collects physical address data and “can predict” race based on “their data,” that the advertising is “discriminatory,” and that Netflix is “hypocritical,” “evasive,” and “disingenuous.” The implication is that with access to so much data, Netflix has a responsibility to hold itself to a higher standard, share its data with others, and advertise in a race/gender neutral manner.

You can agree with Candeub or you can agree with Netflix. It doesn’t matter to the central point: Data are never neutral. The more data you have, the less neutral it is. Collecting more data is like collecting more of any other asset. What responsibilities to huge farms have to the food supply beyond mere profit? What responsibilities to huge news outlets have to the public discourse beyond mere profit? What responsibilities to huge hedge funds have to the financial system beyond mere profit?

To paraphrase: With great data comes great responsibility.

No, Netflix data are not objective because people are not objective.

Data Lie #5: Data are free and easy.

According to some estimates, up to a third of all internet traffic is attributable to one source: Netflix. It’s not hard to understand why, as Phil Nickinson explains in his article. He takes a simple and effective approach to helping the average person understand just how much bandwidth Netflix requires to send streaming high-definition video content to your home. His point is to help people not overextend their data plans and incur overage charges. But there is a bigger issue at play than simple bandwidth.

To this point, we’ve talked a lot about the data Netflix gets back from you as the consumer, but this last lie relates instead to data management and delivery. The insights Netflix gets back from you might be valuable, but from a bandwidth perspective, that metadata is tiny.

Why is this a big deal? Data doesn’t magically float through the air, arrange itself logically, backup itself in the ether, and make itself presentable to you in a useful form at a whim. No, data management is hugely complex. Transmitting data requires costly investments in unsexy telecommunications infrastructure (the real core of the Net Neutrality argument). Storing data requires vast data centers (you could make a compelling argument that data centers are Amazon’s core skill). Retrieving and presenting complex data in a useful way requires immensely powerful software (Oracle and SAP are masters of this).

Netflix makes significant investments in data science not because it has to. Every major organization has to.

No, all data costs. Good data costs a lot.

Netflix clearly understands its data, how to use it, and its limitations. It’s the pundits who don’t. Before you follow their advice down the rabbit hole, teach yourself the statistics and the data science. You’ll realize quickly how challenging, and how limiting, even “big” data truly is.

In the end, you’ll appreciate data. You’ll use data. But you won’t rely on data.

Editor’s Note: This article republished with the author’s permission with first publication on LinkedIn.

Posted in: AI, Information Management, KM, Technology Trends