Another data company was checked. What did the crawler do wrong?


On the afternoon of September 6, several insiders said that Hangzhou Magic Scorpion Data Technology Co., Ltd., a well-known big data service company in Hangzhou, was suspected to be under the control of relevant law enforcement officers. One of the core executives surnamed Zhou was taken away by the police.

Above is a news spread in the technology circle a few days ago. Another data company was surveyed. Many data practitioners and crawler developers sent out “exclamation” — “crawlers use well, XX enters early; data play slippery, XX eats enough”.

As a data service company, scorpion technology has been pointed out in the 2017 article “reptile ferocious: climbing Alipay, climbing WeChat and stealing cash loan data”.

Of course, as to why Scorpio technology is being investigated, we can wait for the results of the investigation by the law enforcement department. Let’s not speculate here for no reason.

Today I’m going to talk about the legitimacy of reptiles. I want to discuss some cases: how to be a reptile developer without touching the red line.

Another data company was checked. What did the crawler do wrong?

As a kind of computer technology, reptiles are technically neutral. Reptilian technology has never been prohibited by law. The history of crawler development can be traced back to 20 years ago. Search engine, aggregate navigation, data analysis, artificial intelligence and other businesses need to be based on crawler technology.

But as one of the technical means of data acquisition, because of the sensitivity of some data, if we can not identify which data can be crawled, which will touch the red line, maybe the next protagonist on the news is you.

How to define the legitimacy of reptiles is not explicitly stipulated at present, but through reading a large number of articles, events, sharing, judicial cases, I summed up three key points of definition:Collection wayAcquisition behaviorUse purpose

Data Acquisition Ways

How to crawl data is the most important point. On the whole,Unpublished, unauthorized, and sensitive informationData, no matter through what channels, is illegal.

So when collecting this kind of sensitive data, it is better to inquire about the relevant laws and regulations, especially the personal information of users, the information of other business platforms and so on, in order to find a suitable way.

Personal data

Collection and analysis of personal information data should be one of the things that all the Internet will do at present, but most of the personal data are private. If you want to obtain it, you must go through legal channels, please refer to Article 41 of the “network security law”:

In collecting and using personal information, network operators should abide by the principles of legality, legitimacy and necessity, publicly collect and use rules, clearly indicate the purpose, mode and scope of collecting and using information, and obtain the consent of the collector.

That is to say, it must beNotify in advance the way, scope and purpose of collection, and after authorization or consent of the userIn order to collect and use, that is, the information collection part of the user agreement between our common websites and App.

Relevant negative cases:

On August 20, Peng Mei News learned from the Yuecheng District Public Security Bureau of Shaoxing City that the bureau had recently solved a case of hijacking with extraordinarily heavy traffic. Beijing Ruizhi Huasheng Science and Technology Co., Ltd., a new three-board listed company, was suspected of illegally stealing 3 billion user personal information, involving 96 Internet companies in China, such as Baidu, Tencent, Ali and Jingdong. At present, the police have already obtained information from the company and its customs offices. The joint company arrested six suspects.
In cooperation with regular operators, Beijing Ruizhi Huasheng and its affiliated companies will add illegal software to clean traffic and obtain cookies from users.

Excerpt from Peng Mei News: “New Third Board Listing Company Stealing 3 billion Personal Information, Illegally Making More than 10 million Yuan”

Public data

From the legal and open channels, and not obviously against the wishes of the personal information subject, there is no problem. But if it passesCrackInvasionWait for “hackers” to get data, there are laws waiting for you.

Paragraph 2 of Article 285 of the Criminal Law:

If, in violation of State regulations, an intrusion into a computer information system other than those stipulated in the preceding paragraph or other technical means is used to obtain data stored, processed or transmitted in the computer information system, or unlawful control is imposed on the computer information system, if the circumstances are serious, the offender shall be sentenced to fixed-term imprisonment of not more than three years or criminal detention, and a fine shall also be imposed or imposed only; if the circumstances are especially serious, the offender shall be sentenced to three years They shall be sentenced to fixed-term imprisonment of not more than seven years and shall be fined.

Violation of Robots Protocol

Although the Robots Agreement is not mandatory to comply with, but the Robots Agreement as an industry agreement, under the compliance will bring you legitimate support.

Because the Robots protocol is instructive, if you specify Disallow, it means that the platform obviously needs to protect the page data, you should think carefully before you want to crawl.

Data Acquisition Behavior

We should know how to restrain the use of technical means. We should fully measure the bearing capacity of some behaviors that are easy to cause interference or even damage to servers and businesses. After all, not every family is at bat level.

High concurrent pressure

Technologies are often focused on optimization, as is crawler development. Every effort should be made to increase concurrency and request efficiency, but high concurrency brings almost DDOS requests. If it causes pressure on the server of the other party and affects the normal business of the other party, it should be vigilant.

If serious consequences occur, the consequences can be found in Article 286 of the Criminal Law.

To delete, modify, increase or interfere with the functions of computer information system in violation of state regulations, which results in the failure of normal operation of computer information system and serious consequences, constitutes a crime.

So when crawling, even if there is no anti-crawling restriction, do not wantonly open high concurrency, weigh the strength of the other server.

Influencing normal business

In addition to high concurrent requests, there are also some business-impacting situations, such as snatching orders, which can affect the experience of normal users.

Purpose of data use

The purpose of data use is also a key point, even if you collect data through legal means, if the data is not used correctly, there will also be illegal behavior.

Beyond the agreed use

One is that data collected publicly is not used for the purposes previously announced, such as user agreements that only analyze user behavior to help improve the product experience, resulting in the sale of user portrait data.

Another situation is that works with intellectual property rights and copyrights may be allowed to download or quote, but clearly marked with the scope of use, such as can’t be reproduced, can’t be used for commercial acts, and can’t be embezzled. These are all protected by law, so pay attention to the use.

Others are not listed.

Selling Personal Information

As for the sale of personal information, do not do it. It is prohibited by law. See also:

According to Article 5 of the Interpretation of the Supreme People’s Procuratorate of the Supreme People’s Court on Several Questions Concerning the Application of Law in Criminal Cases of Infringing on Personal Information of Citizens, the interpretation of “Serious Circumstances” is as follows:
(1) Illegally obtaining, selling or providing trajectory information, communication content, credit information and property information of more than 50 articles;
(2) Illegally obtaining, selling or providing accommodation information, communication records, health and physiological information, transaction information and other personal information of citizens that may affect personal and property security, more than 500 articles;
(3) Illegal acquisition, sale or provision of more than 5,000 articles of personal information of citizens other than those stipulated in Items 3 and 4 constitutes the “serious circumstances” required by the “crime of infringing upon personal information of citizens”.
In addition, without the consent of the collected person, even if the lawfully collected personal information of citizens is provided to others, it belongs to the “provision of personal information of citizens” stipulated in Article 253 of the Criminal Law, which may constitute a crime.

Unfair Business Conduct

If the data of competing companies are taken as the business purpose of their own companies, there may be unfair business competition or violation of intellectual property protection.

This situation is quite common in commercial lawsuits involving reptiles. Two years ago, in a well-known case, App “Car Comes” grabbed bus data from its competitor “Coomick” and displayed it on its own products:

Although bus is a public transport tool, its real-time running route and running time are only objective facts, but after such information has been collected, analyzed, edited, integrated and precisely positioned with GPS, and used as background data of public transport information query software, this kind of information has practicability and can bring real or potential, current or future experience to the obligee. Economic interests have the attribute of intangible property. Yuanguang Company’s use of Internet crawler technology to obtain and use the real-time bus information data of the company’s “Coomike” software for free is a kind of “gain for nothing” and “eat people and fatten” behavior, which constitutes unfair competition.

Excerpt from “Civil Judgment No. 822 of Shenzhen Intermediate People’s Court (2017) Guangdong 03 Early Republic of China”

Reptile law coming soon

The good news is that the relevant measures are on the way.

At 0:00 on May 28, the state Internet Information Office released the draft for comments on data security management measures.

I also consulted this draft, which contains some provisions on data acquisition, storage, transmission and use, including some provisions on reptile behavior (which is still in the solicitation stage, so there may be changes in the future).

For example, Chapter II, Article 16:

Network operators should not interfere with the normal operation of websites by using automated means to access and collect website data; such actions seriously affect the operation of websites, such as automated access to collect traffic more than one third of the average daily traffic of websites, websites should stop automated access collection when required.

Chapter III Article 27:

Before providing personal information to others, network operators shall assess the possible security risks and obtain the consent of the personal information subject. Except for the following:
(1) Collecting information from lawful and open channels without obviously violating the wishes of the subject of personal information;
(2) the subject of personal information is actively open;
(3) After anonymity;
(4) It is necessary for law enforcement organs to perform their duties according to law;
(5) It is necessary to safeguard national security, social and public interests and the life safety of the subject of personal information.

Excerpt from “Data Security Management Measures (Draft for Comments)”


This is the end of the study on the legitimacy of reptiles. There are many cases and perspectives that are not mentioned in the article, and some viewpoints and conclusions may be wrong.

But I hope to give you some inspiration for crawler developers, including other developers: although technology is neutral, good and bad use, we must use technology reasonably, strictly and cautiously.

This article belongs to the original content, first published in the Wechat Public Number.Life oriented programmingIf you need to reproduce it, please leave a message behind the public number.

Another data company was checked. What did the crawler do wrong?

Respond to the following information for more resources
Reply to [data] to obtain Python / Java and other learning resources
Reply to Plug-ins to get Chrome plug-ins commonly used by Crawlers
Reply to [Zhizhi] to get the latest information about simulated Login

Recommended Today

[Q & A share the second bullet] MySQL search engine, after watching the tyrannical interviewer!

Hello, I’m younger brother. A few days ago, I shared the second interview question, the interview site of search engine in MySQL. This question is the interview at normal temperature. After reading it, I’m sure you will gain something in terms of database engine If you haven’t read my first share, you can refer to […]