As for Python data analysis, there are many learning resources available on the Internet, mainly divided into two categories:
One is to provide recommendations for various resourcesFor example, the list of books, the course, and the order of study;
The other is to provide specific learning content, knowledge points or actual cases.
But a lot of tedious and messy content, in addition to the noise of understanding and understanding for beginners, really can play a clear direction to guide, really not many.
So many people jump in at first without a clear direction,I’ve been learning for a long time, but I don’t know what I’m learning, or learning for a long time without knowing what to do.
Before you learn a technology, you should know what you want to achieve.
That is to say, what problems do you want to solve through this technology. You can know how its knowledge system is to achieve such a goal.
What’s more, each part is used to solve which problems, only with clear goal orientation,Learning the most useful part of knowledge can avoid invalid information and reduce learning efficiency。
There are many application scenarios for data analysis:
- For example, you need to conduct research to understand the macro situation of the market, spy on competitors, and do feasibility analysis
- For example, your work department produces a lot of data, which you need to consider to optimize products, marketing, technical solutions
- For example, you need to analyze products, businesses and users, dig out important conclusions, and give reasonable decision suggestions to superiors
Through these common data analysis scenarios, you can get the basic process of data analysis project.
Generally, you can press“Data acquisition data storage and extraction data preprocessing data modeling and analysis data reporting”This is a step to implement a data analysis project.
According to this process, each part needs to master the following knowledge points:
What is an efficient learning path? It’s in this order,You will know what you need to accomplish in each part and what knowledge points you need to learnWhich knowledge is temporarily unnecessary.
Then every time you learn a part, you will be able to have some actual output, positive feedback and sense of achievement, and you will be willing to spend more time in it. With the goal of solving problems, the efficiency will not be low.
Next, we will start from each part, tell what to learn and how to learn, and show the main knowledge points of each part in a structured way, and recommend learning resources specifically. Python learning buckles Qun: ⑧ ⑤ ⑤ – ④ zero ⑧ – ⑧ ⑨ ③ from zero foundation to project practice tutorials, development tools and e-books in various fields of Python. Share with you the current needs of enterprises for Python talents and learn Python’s efficient skills. Keep updating the latest tutorials!
How to get data
The data we analyze generally includes internal data and external data.
Internal data is generated in our business operation, such as common user data, product data, sales data, content data, etc.
The internal data is relatively more perfect and regular. The analysis data such as work report and product optimization that we often need to do are generally from here.
You can ask for it from the company’s technical personnel, or you can go to the database for extraction.
Of course, a lot of times, we need to use external data.
For example, when conducting market research, competitive product analysis, or output reports, external data analysis is essential, which can also help us draw more conclusions.
1. Open data source
UCI: the open classic data set of the University of California, Irvine, is really classic and adopted by many machine learning laboratories.
National data: the data comes from the National Bureau of statistics of China, including the data of China’s economy and people’s livelihood.
CEIC: the economic data of more than 128 countries can accurately find the depth data of GDP, CPI, import and export, international interest rate, etc.
China Statistical Information Network: the official website of the National Bureau of statistics collects a large amount of statistical information on national economic and social development of all levels of government.
Government data websiteAt present, all provinces are opening up government data to a large extent, such as Beijing, Shanghai, Guangdong, Guizhou, etc., and all have special data open websites, such as “Beijing government data open”.
2. Web crawler
Based on Internet crawling data, you can analyze a certain industry and a certain population. For example:
Position data: Lagou, Liepin, 51job, Zhilian
Financial data: it orange, snowball
Real estate data: Lianjia, anjuke, 58 same city
Retail data: Taobao, Jingdong, Amazon
Social data, Weibo, Zhihu, twitter
Video data: Douban, timenet, Maoyan
Before crawling, you need to know some basic knowledge of Python:Elements (lists, dictionaries, tuples, etc.), variables, loops, functions(Rookie course is good)
And how to use the mature * * Python Library (urllib, beautiful soup, requests, scrape) to implement web page crawler.
After mastering basic reptiles, you need some advanced skills.
such asRegular expression, simulate user login, use proxy, set crawling frequency, use cookieAnd so on, to deal with the anti crawler restrictions of different websites.
Crawler is the most flexible and effective way to get data, but the cost of learning is relatively high.
At first, it is suggested to use the public data for analysis, and then have more data requirements, and then start crawling.
At that time, you have mastered the python foundation, and it will be easier for crawlers to get started.
3. Other data acquisition methods
If you can’t crawl temporarily, but you have the need to collect data, you can try variousAcquisition software, can easily crawl information without programming knowledge, such as locomotive, octopus, etc.
quite a lotData contest websiteIt will also open good data sets, such as foreign kaggle, domestic data castle and Tianchi.
These data are real business data, and the scale is usually not small, which can be collected and collated frequently.
△ common data acquisition methods
Data storage and extraction
The database skill is here because it is a necessary skill for data analysts.
Most enterprises will require you to have the basic skills to operate and manage the database, to extract and analyze the data.
As the most classical relational database language, SQL provides the possibility for mass data storage and management.
Mongodb is a newly rising non relational database. You can master one.
SQL is recommended for beginners. You need to master the following skills:
1. Query / extract data under specific circumstancesThe data in the enterprise database must be huge and complex. You need to extract the part you want.
For example, you can extract all the sales data in 2017, the data of the top 50 products sold this year, and the consumption data of users in Shanghai and Guangdong according to your needs
2. Addition, deletion and modification of database: These are the most basic operations of the database, but they can be implemented with simple commands.
3. Grouping and aggregation of data, establishing the relationship between multiple tables: this part is the advanced operation of the database and the association between multiple tables.
It’s very useful when you’re dealing with multiple dimensions, multiple datasets, and it allows you to deal with more complex data.
The database sounds terrible, but in fact, the part of skills that can satisfy data analysis should not be too simple.
Of course, it is recommended that you find a dataset for practical operation, even the most basic query, extraction and other operations.
△ MySQL knowledge framework
△ mongodb knowledge framework
Data cleaning and pre analysis
Most of the time, the data we get is not clean,Data duplication, missing, outliers, etc.
At this time, it is necessary to clean the data and process the data of impact analysis, so as to obtain more accurate analysis results.
For example, air quality data, many of which are not monitored due to equipment reasons, some of which are recorded repeatedly, and some of which are invalid in case of equipment failure.
For example, there are many invalid operations of user behavior data that are meaningless for analysis and need to be deleted.
·Select:Data access (labels, specific values, Boolean indexes, etc.)
·Missing value handling:Delete or fill in missing data rows
·Duplicate value processing:Judgment and deletion of duplicate value
·Space and exception handling：Clear unnecessary spaces and extreme and abnormal data
·Related operations: descriptive statistics, apply, graph drawing, etc
From the beginning of data processing, you need to get involved in programming knowledge, but you don’t have to go through the python tutorial completely, just master the necessary part of data analysis.
·Basic data types:For example, string, list, dictionary, tuple, how to create, add, delete, and modify different data types, as well as the commonly used functions and methods;
·Python functions:Learn how to create your own functions, realize more customized programs, and know how to call them in use;
·Control statement: it is mainly conditional statement and circular statement. Different statements are used to control the process, which is the basis of program automation.
△ Python basic knowledge framework
In addition, numpy and pandas, two very important libraries in Python, also need to be mastered. Many of our data processing and analysis methods are derived from them.
If Python is our house and provides us with the basic framework, then numpy and pandas are the furniture and appliances in the house, providing us with various functions.
Of course, even if it’s just these two libraries, there are a lot of official documents,It is recommended to master the most commonly used methods first, so that you cansolutionMost practical problemsIn case of subsequent problems, you can query the document specifically.
·Array creation:Create from an existing array, from a numeric range
·Array slice:Select by slice
·Array operation:Element addition and deletion, array dimension modification, array division and connection
·Numpy function:String function, mathematical function, statistical function
Recommended numpy documentation:
Getting started with nump http://h5ip.cn/ypHr
Numpy Chinese document https://www.numpy.org.cn/
△ numpy knowledge framework
·Data preparation:Data reading and data table creation
·Data view:View basic data information, find null and unique values
·Data cleaning:Missing value processing, duplicate value processing, character processing
·Data extraction:Extract by tag value, extract by location
·Data statistics：Sampling, summary and basic statistics calculation
Recommended pandas documentation:
Ten minutes to get started pandas* http://t.cn/EVTGis7
Pandas Chinese document https://www.pypandas.cn/
△ pandas knowledge framework
Data analysis and modeling
If you have some knowledge, you will know that there are many Python data analysis books on the market at present, but each one is very thick and has great learning resistance.
If there is no overall understanding, often do not know why to learn these operations,What role does this play in data analysis.
In order to reach a general conclusion (or from the perspective of general data analysis projects), we usually need to conduct three types of data analysis:Descriptive analysis, exploratory analysisas well asPredictive analysis。
Descriptive analysisThe main purpose is to describe the data, which requires the help of statistical knowledge, such as basic statistics, overall samples, various distributions and so on.
Through these information, we can get the initial perception of the data, and also can get many conclusions that can not be easily observed.
So descriptive analysis needs two parts of knowledge,One is the basis of statistics, the other is the realization of descriptive tools,It can be realized with the knowledge of numpy and pandas mentioned above.
Exploratory analysisUsually, we need to use visualization,Use the graphical method to further view the distribution of data,Discover the knowledge in the data and get more in-depth conclusions.
The so-called “exploration”, in fact, has many conclusions that we can’t predict in advance, while graphics make up for the shortcomings of observation data and simple statistics.
Seaborn and Matplotlib libraries in Python provide powerful visualization capabilities.
Relative to Matplotlib,Seaborn is simpler and easier to understand, and basic graphicsHow many linesThe code thing,It is more recommended for beginners.
If you need customized graphics later, you can learn more about Matplotlib.
Predictive data analysisIt is mainly used to forecast the future data, such as the sales situation of a certain period of time in the future based on the historical sales data, such as the future user’s behavior through the user data
Predictive analysis is a little bit difficult. The more in-depth it will involve more knowledge of data mining and machine learning, so you can only do basic knowledge (or learn when there is a need).
For example, basic regression and classification algorithms, and how to use Python’s scikit learn library to implement them,As for the algorithm selection and model optimization related to machine learning, you don’t need to go deep (unless you can do it easily).
Recommended data analysis:
Books《On statistics》《Business and economic statistics》
Matplotlib Chinese document https://www.matplotlib.org.cn
Scikit learn Chinese document http://sklearn.apachecn.org
Data analysis modeling knowledge framework
Write data report
Data report is the final presentation of the whole data analysis project, as well as the summary of all analysis processes, output conclusions and strategies.
So no matter how wonderful your journey, data reporting is the product that ultimately determines your analytical value.
To write an analysis report,First, the goal of data analysis task should be definedTo explore the knowledge in the data, to optimize the product, or to predict the future data.
For these goals, we need to split the problem,What valuable information must be output to achieve the goal.
For the final decision-making, which data and information are useful, whether to further explore, which are invalid, and whether to directly discard.
After determining the general content of the output and drawing useful conclusions in the data analysis process, we should think about it,How to integrate these scattered information, in order to achieve the ultimate persuasion, what kind of logic should be integrated.
This is a process of establishing a framework, but also reflects the idea of dismantling this issue.
After the framework is built, it is to fill in the existing conclusions and choose the appropriate expression form.
Select more appropriate data, which needs more intuitive charts, which conclusions need to be explained in detail, and carry out the final beautification design, so a complete data analysis report is completed.
When writing the analysis report, there are some points to pay attention to:
1. There must be a framework,The simplest is to build the logic of problem splitting, fill in the content of each branch, and explain it point by point;
2. The choice of data should not be too one-sided,We should diversify and make comparative analysis, otherwise the conclusion may be biased.
The value of the data determines the upper limit of the analysis project. Collect as many useful data as possible for multi-dimensional analysis;
3. Conclusion must have objective data demonstration,Or strict logic deduction, otherwise it is not persuasive, and it is easy to fall into self – hi;
4. Charts are more intuitive than words,Moreover, it has higher readability, so we should make more use of graphical expression;
5.The analysis report is not only to explain the problem, but more importantlyMake suggestions, solutions and forecast trends based on Problems；
6. Read more industry reports and practice more,Business sense is more important later than skill.
Recommended data report related websites:
IResearch – Data Report http://report.iresearch.cn/
Allies + – Data Report http://t.cn/EVT6Z6z
Report of the world economic forum http://t.cn/RVncVVv
PwC Industry Report http://t.cn/RseRaoE
△ framework for writing data reports
The above is the complete learning path of Python data analysis. In fact, there are some huge things in this framework. It doesn’t all look like this (funny face).
But don’t worry about it at all. In fact, each of us is born with data sensitivity and our own talent for analyzing things. We only rely on experience and intuition before we have the blessing of analysis methods.
You don’t have to go back and rebuild completely, learn code like a development program, recite functions and methods like an exam, just need some common sense of business, such as mean value, extreme value, sorting, correlation, median
These things are often the majority of the data analysis content, and what you learn is just the tools to implement them.
Just like a 100 line data, for anyone with normal intelligence, without any tools and programming technology, he can also get a basic conclusion, and tools are to improve our efficiency, scalability and implementation dimensions, that’s all.
We have made a complete package of the above knowledge framework