Web scraping is the Internet’s Wayback machine. There are dozens of web scraping techniques which operate automatically with the help of the Hypertext Transfer Protocol. In this article, we’ll take a look at the pros and cons of web scraping.
Web Scraping Pros
As a data collection tool, web scraping has a lot going for it.
Web scraping is an essential service for many websites. A website like Kayak uses web scraping to extract the latest prices of flights and hotels for easy comparison on their site. Without web scraping, sites and services like Kayak would not exist.
Web scraping can gather company data across multiple websites. For instance, scraping the website of an engineering firm could yield the names of executives, managers, and engineers. And scraping LinkedIn, social media, and news sources can add further meaningful information to the data collected by scraping the firm’s website.
Web scraping is a great tool for monitoring the health of a company’s brand. Companies don’t usually spend a lot of time manually tracking down ratings their consumers give them. There are so many rating platforms and systems that they use web scraping as an automated tool to extract the reviews in aggregate. Then they use the data to perform something called sentiment analysis, which helps the companies to figure out the consumers who love and hate them, and then quickly respond to the feedback received.
Web scraping gives businesses smart data rather than big data. Everyone in data analysis and data science is familiar with big data and the complex mathematical tools needed to extract and analyze it. But web scraping gives businesses smart data that it collects from “just” the websites relevant customers use. A company that sells spare parts needs to know what potential customers are paying for the specific parts it has in inventory, so it can price them competitively.
Web scraping provides the data needed for machine learning. Do you want to predict the stock market? You will get your data from web scraping. Do you want to predict your competitor’s prices? Your data will come from web scraping.
SEO tools like Ubersuggest and SEMRush get their data from web scraping. And web scraping is one of the best tools for testing a website’s performance.
Web scraping is easy to implement, cost-effective, accurate, and low-maintenance. Not every application of web scraping, however, yields undiluted good for the world. Some uses of web scraping are downright evil, like the infamous case of Clearview AI.
When Web Scraping Goes Bad
The Clearview AI incident is one of the most widely-cited examples of web scraping put to ethically dubious uses.
Clearview AI is a small tech startup that helps law enforcement agencies match photos of unknown people to their images on social media. The algorithm operated by Clearview AI was designed by an Australian techie named Hoan Ton-That, whose best-known accomplishment prior to Clearview AI had been developing an app that enabled users to put President Donald Trump’s hair on their own photos.
Hoan’s subsequent ground-breaking facial recognition app, Clearview AI, allowed users to take a photo of a person and use web scraping to find public photos of that person, along with links to the web pages where those photos appeared. Clearview AI claims to have a database of over 3 billion photos scraped from Facebook, Venmo, YouTube, and millions of other websites.
Police report that they have used Clearview AI to solve shoplifting, identify theft, credit card fraud, child sexual abuse, and murder cases.
Larger companies with similar technological capabilities, such as Google, have declined to release a similar product because of the potential for abuse. But Clearview AI claims to have leased the software to at least 600 police agencies and a number of private firms. The New York Times analyzed the source code for the app and found that it can be adapted to augmented-reality glasses. Wearers of these glasses would be able to identify the name and home address of anyone they happened to see.
Facial recognition technology makes people anxious about Big Brother. The anxiety about Clearview AI’s app is amplified by the fact that its ability to protect confidential data is untested. Investors in the company such as David Scalzo are blase about that risk.
“I’ve come to the conclusion that because information constantly increases, there’s never going to be privacy,” Mr. Scalzo told the New York Times. “Laws have to determine what’s legal, but you can’t ban technology. Sure, that might lead to a dystopian future or something, but you can’t ban it.”
Clearview AI’s web scraping tool has not been banned, even though it is only accurate about 75 percent of the time.
Web Scraping Gone Mild
If you ever fly from city to city in Europe, chances are that you have at least considered taking a low-cost flight with Ryanair. In a scathing article entitled “Snarling All the Way to the Bank,” the respected British business magazine described how Ryanair’s “cavalier treatment of its low-fare seeking passengers” had given Ryanair “an entirely deserved reputation for abject nastiness” and that the airline “has become a byword for appalling customer service.”
However fair or unfair these and thousands of similar remarks may be to the airline, there’s no doubt about one fact about Ryanair: no airline flying point-to-point in Europe offers lower fares. Ryanair calculates those fares with the help of web scraping. Ryanair’s principle offering to its customers is the lowest price, made possible with web scraping.
Like any other technological tool, web scraping can be used for morally good, bad, or indifferent ends. The key to web scraping is using it effectively so that its results, unlike those of Clearview AI, are never accidental.