As we continue to live in a world increasingly driven by the internet and the massive amounts of data that’s gushing out of it, we are also well aware by now that this has led to significant discussion and action more recently around data privacy and protection. This process could also be a never ending one, as sophistication levels and data volumes only grow.
In the process as the largest generators or sources, naturally Social Media platforms have been held largely responsible for certain ‘lapses’ on their part, but another area or avenue that spews out data or makes data so easily available at click of a button hasn’t quite come up for discussion, and this is pertains to web ‘scraping’ and web ‘crawling’.
…LIKEWISE, MANY TECH GIANTS ARE COLLECTING MOUNTAINS OF USERS’ LOCATION DATA, IN WAYS MANY CONSUMERS DON’T REALISE AND SOMETIMES CAN’T AVOID. THIS IS THROUGH THE USE OF APPS, BEACONS AND OTHER SOPHISTICATED WIRELESS TECH, WITH THE BEST OF INTENTIONS OF COURSE TOWARDS GATHERING INSIGHTS AT AN INDIVIDUAL LEVEL AND THUS COMMUNICATING BETTER WITH GREATER RELEVANCE…
Defining the terms
Before we proceed any further, whilst many of us may have actually used the open source tools out there, for the benefit of all, let’s take a look at these terms.
Web scraping is an automated process that can be used to efficiently collect and download large volumes of very targeted data from various websites, and this invariably involves accessing the site that is hosted by another company. This data, when put to good use and with good intent, can be used to increase accuracy of predictions. For example, a web scraper may be used to extract weather forecast data and further analyse it. Just some examples of open source web scraping tools, for those who may like to experience the concept and get the kick out of extracting nuggets of data from massive websites are : import.io, crapingexpert, webhose.io, and several more.
Web crawling involves using automation and tools to download a webpage’s data, extracting any hyperlinks it contains and following them, this data being stored in an index or database to make it easily searchable. Thus web crawling could be used to crawl data from various websites and build a search engine (e.g. Googlebot which is Google’s own web crawler).
The above would therefore suggest prima facie that web scraping or crawling should carry no real downsides or pose no threats, but unfortunately this is no more the case – not with the way brands and players operating in a fiercely competitive marketplace and scenario are clawing at each other and leaving few stones unturned. In fact this often leaves one to wish that the same effort towards developing product or service genius would see greater brand and product differentiation in certain segments with greater end benefits for the consumer.
I recall brainstorming with a prospective client some months ago, who was seeking to target high net worth individuals, and his proposing to target a database which he had put together, which he confessed had been done through scraping certain websites, a practice that is certainly not ethical, hence not recommended
or advisable as it only amounts to abusing data and privacy.
Whose Data is it anyway?
When we share data in the form of updates, images, likes and our comments on social media, crawling and scraping that data is common activity on the internet, of course based on privacy settings that we chose to exercise for ourselves and as offered by the concerned platforms.
There’s been a well-known and documented ongoing courtroom battle fought since sometime ago between LinkedIn and HiQ, a San Francisco based data mining startup which helps employers predict which of their employers are likely to be quitting, and thus built its database of public user profiles on LinkedIn. With it being known between both parties whereby HiQ was scraping data from LinkedIn to offer its core service, the relationship came to an abrupt end after three years and led to the courtroom with the Social Media networking giant that HiQ was a hacker. Both sides have their side to the story, with LinkedIn in the first place allowing its data to be scraped and later taking an aggressive stance to claim that it was supporting its member privacy and data security.
Likewise, many tech giants are collecting mountains of users’ location data, in ways many consumers don’t realise and sometimes can’t avoid. This is through the use of apps, beacons and other sophisticated wireless tech, with the best of intentions of course towards gathering insights at an individual level and thus communicating better with greater relevance. As with several smartphone apps, Facebook, Messenger, Whatsapp and Instagram also attempt to capture our location across devices throughout the day, from our
reading habits to Spotify playlists during commute and social browsing at night.
Responsibility as Users of Data
In the final analysis, whilst consumers (including ourselves) manage privacy levels through settings, it boils down to our maturity as responsible users of data or marketers, to respect data privacy against data greed, and
manage our roles with efficient and effective use of the increasing data available without restoring to ‘unfair practice’, and scraping data towards better marketing and ‘acquiring insights rather than inciting the end consumer’.
Salim is the Founder and Managing Director of On-Target Marketing Solutions, Malaysia, a digital and data analytics company. And given his immense passion for data has combined over three decades of marketing experience recently with a Data Science Certificate course. He is contactable at firstname.lastname@example.org