Editor's Note: This week, I am publishing a guest blog post by ... (drum-roll)... our newest Scilogs blogger Martin Angler! Martin will join the Scilogs blogging community in the next few days, on a blog titled 'Algoworld': algorithms meet science. Martin is a science journalist who studied computer science and project management, and who writes about science projects and technologies that can potentially improve our lives, including cyborgs, brain implants and crime prediction. He even as a book on business intelligence! He has been writing code and developing software since the mid-1990s, and can be found on Twitter @martinangler. A warm welcome to Martin!
By Martin Angler
"You've got to give a little, take a little
Pay the price, make the sacrifice“
- Jimmy Cliff, Give a little take a little, 1967
Some days ago, I came across this post by Paige Brown, which contains some interesting ideas on how web personalization algorithms influence and even restrict the information that is presented to us. I‘d like to expand a little bit on the idea of how much personalization costs us - in terms of quality, privacy and money.
100,000 DVDs of Facebook, Every Day
First and foremost, web personalization algorithms analyze large amounts of data. Buzzword warning! BIG data. This data may lie everywhere, from very large databases to social media platforms like Facebook or Twitter. It may be hosted across multiple data centers, or it may lie in just one location. Doesn‘t matter. Especially when it comes to Facebook, this data is not just big, but huge: according to Techcrunch, Facebook‘s datacenters grow by at least 500 terabytes per DAY. In 2012, they estimated a total datacenter size of 100 petabytes, with one petabyte being one million gigabytes (that is the size of 200,000+ DVDs).
So big data is an enabler of recommender systems or web personalization. But what enables big data? Storage, lots of it. Consumer storage (USB drives, hard drives) has become cheap lately. But datacenters do still cost a lot of money. Why? Because we are talking about secure, high performance storage devices and about the costs backups and redundancy creates. Storing data twice is rather fail-safe but doubles the storage cost. Not to mention air-conditioning, security and disaster recovery systems, datacenter buildings and their maintenance costs.
There is an environmental cost, too. Datacenters implicitly produce CO2 emissions, that‘s nothing new. In 2011, Forbes reported that by 2020 data centers will "indirectly produce more carbon emissions than the entire airline industry“. Not to mention the immense energy consumption.
But there is more than just the economic and environmental impact. Conclusions drawn from big data, personal recommendations and predictions might actually be wrong.
Kate Crawford is a principal researcher at Microsoft and works mainly on big data science. She is a frequent speaker at big data conferences and often warns about the bias and error-proneness of big data-based conclusions.
Crawford names three factors that determine the quality of the conclusions we can draw from big data: bias, signal quality and scale. No matter what and how much data we collect - as soon as it comes to interpreting it, bias is introduced. "Signal quality“ refers to the gaps that are present in the data. Analyzing large sets of data is not easy, as the data contains errors and irrelevant data that will result in wrong conclusions. "Scale“ simply means that sometimes the general overview that big data gives us hides important details inside it.
All this applies to recommender systems, too.
Last week, at MIT‘s EmTech conference, Crawford spoke about how the assessment of our personal data is discriminating us. Moreover, she points out that big data and recommender systems mean the end of our anonymity. Even if we don‘t consciously share our data, we leave digital traces behind that clearly identify us, just like a fingerprint. Google, for example, is currently working on a technology that avoids cookies but still uniquely identifies us while browsing.
"In data land, we are tracking you 24/7. We know what you like to eat. We know when you sleep. We know about the health of your body and your mind. We'd like to guide your path through the city to make sure that you are avoiding any risky areas that our security algorithms decide have decided are not quite right for you." - Kate Crawford, Big Data Gets Personal, MIT Technology Review‘s EmTech Conference 2013
Boon Or Bane?
In her piece, Paige correctly points out Eli Pariser‘s view on the filter bubble problem. The very nature of the web personalization is both its boon and bane. Everybody sees a different reality, based on her or his personal preferences. The recommender algorithms choose autonomously what we see. On the other hand, we would never be able to browse through all the information that we encounter, so we do actually need algorithmic help there. The key to success is probably to at least keep partial control over how we get. Again, as Pariser suggests in his TED talk, we need to take control of our preferences, and relevance as the only criterion for rating search results is certainly not enough.
Earlier this year, I came across a hybrid strategy which could at least partially solve the problem of losing control of these recommendations. Eric Colson, Chief Analytics Officer of the company Stitch Fix made a bold statement at O‘Reilly‘s Strata conference:
"We choose on behalf of the customers and send the merchandise right to their homes." - Eric Colson, Strata Conference 2013
He then explains how it works. Stitch Fix collects lots of data about the clothes it sells. It also collects lots of data about its customers. Stitch Fix‘ algorithms then analyze the data and pre-filter it for a team of human designers. This is the QA department of Stitch Fix, if you will. The only difference is: the designers do not improve the products‘ quality, but the algorithmic quality, by refining the results. The algorithms pick the most suitable clothes and present them to the designers - but they do not yet send them to the customers. That‘s up to the QA-designer team.
I think this hybrid technology is a serious accomplishment of recommendation algorithms, because it adds human common sense to algorithmic rules. However, one problem remains: If you want to buy these clothes, you have to give away a lot of your personal data. You are not giving this data to your family or to your best friend. You are giving it to strangers that might do with it whatever they think is right. Recommendation has a cost, and that cost is called personal data.
Perfection Is Far Away
Are the current recommender systems flawless? Nowhere near. Let me give you an example. Last year, I bought a couple of suitcases on Amazon. DAYS after they were delivered, I received an email from Amazon, recommending me that same type of suitcases. I mean, what is the probability that I am going to buy the same suitcases again just days after I have bought them? For me, zero.
Are you looking for a book? How about reading your own?
An isolated case, you say? Nope! In 2011, I wrote a book about business intelligence. Since then, Amazon has continually been sending me recommendation emails containing my own book. The solution to this problem looks pretty trivial, right? If the algorithm had compared the buyer‘s identity to that of the producer/author, I would have never received these mindless recommendations.
Everything Comes At A Cost - Even "Free" Services
Free web services are never really free. We pay for the pre-selection of products and services not with cash, but with our personal data. Sometimes we do this willingly and consciously, and sometimes we are not aware of the digital footprints we leave behind. We also pay for these services by allowing recommender systems to narrow our views and to show us a biased version of the reality. We trade freedom and objectivity for personalized recommendations.
However, I acknowledge there is more data surrounding us than we could ever handle. We need to pre-filter it. Or someone else pre-filters it for us. In the latter case, we need to gain insight into that process and be able to control parts of it, just as Eli Pariser suggests.
Until that happens, let‘s bear in mind forever N. Gregory Mankiw‘s first principle of economics: people face trade-offs. This has never been more true than it is for web personalization.
Did you receive weird or funny recommendations on Facebook, Twitter, Amazon, Google, etc.? I would love to hear about them on Twitter @martinangler #algoworld!