[data analysis tool] basic SQL syntax

Time:2022-5-6

01. Write in front

SQL is one of the necessary skills for data analysts. Whether it is a junior analyst or a senior analyst, SQL has become a necessary option in the recruitment conditions of major companies. Why is SQL so important for data analysts? Before answering this question, let’s understand the following questions.

First question, what is SQL?

SQL is the abbreviation of structured query language, which means structured query language. It is a language for querying data in database management system (RDBMS) or changing data in database through RDBMS.

I don’t understand… Can you speak human words? OK, SQL is a language for adding, deleting, modifying and querying data tables or data in a database.

What is a database? “Database is” a warehouse that organizes, stores and manages data according to data structure “. Is a long-term storage in the computer, organized, shareable, unified management of a large number of data collection. Human talk is a warehouse that stores data according to a certain organizational structure. Our common Oracle, MySQL and SQL server are all databases, but some are commercial databases and some are open source and free.

Second question, why do you need to learn SQL for data analysis?

In fact, after understanding the relationship between database and SQL, this problem is nonsense. A skillful woman cannot make bricks without straw. The first step of data analysis must be to have data. Where does the data come from? It must be extracted from the database. SQL is such a convenient and universal data retrieval tool, because the SQL syntax of almost all databases is similar. Even now, the syntax of hive SQL used in big data analysis is 99% the same as SQL, so learning SQL basically mastered the data retrieval methods of all databases.

The third question is how to learn SQL?

Do you think SQL is just a simple data retrieval tool and you can pull the data out of the database?

1. SQL can not only get data, but also provide rich functions, which can do data processing such as data cleaning and conversion. Moreover, SQL can easily sum, count, de count, average and other operations on data in different dimensions like excel pivot table, so as to analyze the data. All this can be realized in just a few lines of SQL code.

2. If there are a lot of complex data, which are stored in different sheets like excel and need to be summarized and analyzed, how to deal with it? The strength of SQL is that it can easily connect different data according to a certain association. This association can be inner join (find the intersection of two tables), left join (intersection and left table all), right join (right join intersection and right table all), and full join outer join (find the Union of two tables), Various data connections can be realized through different association conditions, and finally the connected data can be analyzed.

3. From the above two points of view, there seems to be no difference between SQL and excel in function? Excel can also do data cleaning, PivotTable can also do aggregation operations such as summation and counting, and Excel’s power pivot can also realize the connection between multiple tables. In fact, in addition to the above functions, SQL also provides a very powerful function: window function. What’s the use of window function? If we want to calculate the ranking of each person under a specific group, the year-on-year and month on month sales, and the cumulative sales up to the end of each day, these problems often encountered in data analysis and can not be well solved by the basic SQL statements, the window function shows its power. Therefore, the window function is also an important standard to judge whether you are a basic SQL player or a high-level player. It is also one of the most favorite contents in the data analysis interview.

So for students who want to do data analysis, how can they quickly and efficiently master SQL, a sharp tool for data analysis? According to the previous introduction, you can follow this learning path to improve your SQL level:

[data analysis tool] basic SQL syntax

1. Basic syntax of SQL: first, be familiar with the basic syntax of SQL. For data analysis, focus on data query select, including: how to use where for data filtering, skillfully use arithmetic operator (+ – * /) and logical operator (and / or / not) for field calculation and condition filtering, and use aggregate functions such as sum, count and AVG in combination with group by for summary analysis under different dimensions, How to filter the aggregated results with having clause and sort the final query results with order by. The most important point in this part is to clarify the difference between the execution order of SQL statements and the writing order, which is very helpful to understand the execution process of SQL. As the first article in the SQL series, we will focus on it in this article.

2. Common functions of SQL: on the basis of mastering the basic syntax of SQL, we should master some functions commonly used in data analysis, including but not limited to date functions, string functions, numerical operation functions, etc. Skilled use of these functions can help us do data cleaning, conversion and other data processing work efficiently. We will focus on this part in the second part of the SQL series.

3. SQL subquery and table connection: on the basis of previous, we also need to master the connection between subquery and multiple tables. In data analysis, we often need to analyze multiple associated tables. For simple queries, we can use nested sub queries to solve them. However, if there are many tables involved and the association relationship between tables is complex, we need to use table connection to connect all tables together according to a certain association relationship. The common connection types include inner join, left join, right join and full join. We will focus on this part in the third article in the SQL series.

4. SQL window function: the first three parts basically cover the basic requirements of data analysis for SQL, but in practical work, we have some complex and common scenarios that cannot be solved well by using basic syntax, such as year-on-year / month on month sales of each month, cumulative sales up to each day, and sales ranking of each commodity under its category. The window function is created for these scenarios, If you can master window functions, you will basically not encounter any SQL problems in data analysis. As the final part of the SQL series, this part will be explained in the fourth article.

In addition, we have summarized some common Excel skills in data analysis. If you are interested, you can read the previous articles, including:

[data analysis tool] practical operation of data analysis cases, teach you to learn powerbi hand in hand!

[data analysis tool] can excel also play with big data analysis? It’s time to sacrifice the super pivot table power pivot!

[data analysis tools] PivotTable: rookies can also do data analysis (tutorial at the end of the article)

[data analysis tool] Excel functions that must be known and mastered in data analysis (get excel tutorial at the end of the text)

[data analysis tool] basic SQL syntax

02. Basic SQL syntax

1. SQL basic operation

The learning stages of SQL have been divided from basic to high-level. Now let’s talk about the first stage of learning SQL: basic SQL syntax.

SQL is a language for adding, deleting, modifying and querying data tables or data in a database. According to different operation objects, we divide the basic operations of SQL into the following categories:

DDL (data definition language)

It is used to create, delete or modify objects such as databases and data tables in databases. DDL contains the following instructions.

Create: create database, data table and other objects

Drop: delete database, data table and other objects

Alter: modify the structure of database, data table and other objects

DML (data manipulation language)

It is used to query, add, modify or delete records in the data table. DML contains the following instructions.

Select: query the data in the data table

Insert: inserts new data into the data table

Update: modify the data in the data table

Delete: delete the data in the data table

DCL (data control language)

Used to confirm or cancel changes to the data in the database. In addition, you can also set the permissions of database users. DCL contains the following instructions.

Commit: confirm the changes made to the data in the database

Rollback: cancels changes to data in the database

Grant: Grant user operation permission

Revoke: cancels the user’s operation permission

As a data analyst, our focus is to extract the existing data and analyze the business value behind the data. Therefore, most of the time, we only need to use the data query select, and do not need or allow the addition, deletion, modification and other operations on the database and data table. After all, the company is also afraid of you deleting the database and running away, hahaha! Therefore, our article will also focus on how to use SQL for efficient data extraction and analysis.

Now let’s use a practical case to manually create databases and data tables, manually insert some data, and then fully explain the basic syntax of SQL from beginning to end based on these data. We hope that small partners who want to learn SQL can also start to learn and operate while learning. Learning in practice is the most efficient learning method. The database we use here is mysql, and the database tool we use is MySQL workbench 8.0 CE. We can download and install the specific databases and tools on the MySQL official website, and we won’t expand them any more.

2. Create database

In order to facilitate the following demonstration, we first need to establish a database to store the data tables we need. Open MySQL workbench 8.0 CE, as shown in the figure below. The main functions are as follows:

[data analysis tool] basic SQL syntax

Enter the SQL code in the middle space:

–Note: this data is only for demonstration, and there is no need to pay attention to the authenticity of the data

–1. Create a sales database to store the data of the presentation

CREATE DATABASE Sales;

Note that – in the code is a comment, which means that the following content is only a description for a better understanding of the following code and will not be executed. Click the execute button in the figure to execute the code. After successful execution, it will be displayed in the query result area, and one more database sales can be found in the database preview area.

[data analysis tool] basic SQL syntax

3. Create table

We create a product sales detail under the sales database, which is used to record the sales of each commodity every day, including the following fields:

USE Sales; — Go to the sales database and do the following

CREATE TABLE Product

(

product_ ID – char (4) not null, — Product ID, character type char

product_ Name – varchar (100) not null, — product name, character type varchar

product_ Category – varchar (32) not null, — product category, character type varchar

sale_ Price – int, — product selling price, integer type int

cost_ Price int, — product cost price, integer type int

sale_ Date # date — sales date, date type

)

4. Insert data into table (insert)

The product table used to store the sales records of each product has been created. Next, we can insert data into the table. These data are only used for demonstration, and have no practical significance and authenticity.

Insert into product values (‘0001 ‘,’ iPhone ‘,’ mobile ‘, 8000, 6500,’ 2020-09-20 ‘);

Insert into product values (‘0002 ‘,’ MacBook Pro ‘,’ computer ‘, 9500, 8000,’ 2020-09-11 ‘);

Insert into product values (‘0003 ‘,’ Huawei mate40 Pro ‘,’ mobile phone ‘, 6000, 4800,’ 2020-09-11 ‘);

Insert into product values (‘0004 ‘,’ Sony TV ‘,’ TV ‘, 9000, 6800,’ 2020-09-20 ‘);

Insert into product values (‘0005 ‘,’ TCL TV ‘,’ TV ‘, 6800, 5000,’ 2020-01-15 ‘);

Insert into product values (‘0006 ‘,’ Skyworth TV ‘,’ TV ‘, 5000, 3000,’ 2020-09-20 ‘);

Insert into product values (‘0007 ‘,’ Xiaomi TV ‘,’ TV ‘, 3800, 2500,’ 2008-04-28 ‘);

Insert into product values (‘0008 ‘,’ Lenovo notebook ‘,’ computer ‘, 4000, 3000,’ 2020-11-11 ‘);

After successful insertion, the query result area will be marked with success.

5. Query data (select)

Use the select clause to query the required columns from the table.

1. In order to facilitate us to understand the meaning of each column, we can set aliases for columns through as;

2. We can add a new constant column constant, that is, all values of this column are constants specified by us;

3. If we want to sort the query results according to a certain field, we can use order by desc / ASC to sort in descending / ascending order;

4. In addition, in order to prevent the amount of data we query from being too large, we can use limit 5 to limit the query results to 5 rows.

For example, we want to see 5 records in ascending order of product ID.

SELECT

product_ ID as’ product ID ‘,

product_ Name as’ product name ‘,

cost_ Price as’ cost price ‘,

‘I’m a constant’ as constant

FROM Product

ORDER BY product_id ASC

limit 5;

The results are as follows:

[data analysis tool] basic SQL syntax

6. Specify query criteria (where)

In many cases, we only need to select part of the data rather than the data of the whole data table, so at this time, we need to set filter conditions during query to filter out the data we need.

For example, we want to view the product names of all “mobile phone” products in product.

SELECT DISTINCT product_name, product_category

FROM Product;

WHERE product_ Category = ‘mobile phone’;

The results are as follows:

[data analysis tool] basic SQL syntax

7. Arithmetic and comparison operators

Sometimes we need to add, subtract, multiply and divide (+ – * /) the fields in the data table to generate a new field. For example, if we want to calculate the net profit in each transaction, we can use the sales price sale_ Price minus cost_ Price gain.

In addition to the arithmetic operator (+ – * /), when we need to set filter conditions to query data, we often need to use the comparison operator shown below to make some complex filter conditions.

Equals=

Not equal to < >

Greater than >

Greater than or equal to >=

Less than < < span = “” >

Less than or equal to < = < span = “” >

Between two

We use an example to combine arithmetic operator and comparison operator.

For example, we want to view all sales records with a profit of no less than 500.

SELECT

product_id,

product_name,

sale_price,

cost_price,

sale_ price-cost_ Price as’ profit ‘– generate a new field through arithmetic operation

FROM Product

where sale_price-cost_price>=500;

The results are as follows:

[data analysis tool] basic SQL syntax

8. Logical operators (not, and, or)

In the above case, we need to set the logical and and and not operators for single query conditions. In the actual case, we need to set multiple logical operators for single query conditions.

NOT

The not operator is valid only when the current query condition is not valid, that is, it takes out the records that do not meet the current condition, which is equivalent to taking the opposite.

For example: we want to check the sales record with the selling price no more than 5000 yuan.

SELECT

product_id,

product_name,

sale_price,

cost_price

FROM Product

WHERE NOT sale_ price > 5000; — (i.e. sale_price < = 5000)

The results are as follows:

[data analysis tool] basic SQL syntax

AND

When the query conditions on both sides of the and operator are true, the whole query condition is true, which means “and”.

For example, we would like to view the sales data with the selling price of more than 6000 (inclusive) under the “TV” product category.

SELECT

product_id,

product_name,

sale_price,

cost_price

FROM Product

WHERE product_ Category = ‘TV set’

AND sale_price >= 6000;

The results are as follows:

[data analysis tool] basic SQL syntax

OR

When one of the query conditions on both sides of the or operator is true, the whole query condition is true, which means “or”.

For example, we want to check the sales records of products with the category of “mobile phone” or “computer”.

SELECT

product_id,

product_name,

sale_price,

cost_price

FROM Product

WHERE product_ Category = ‘mobile phone’

OR product_ Category = ‘computer’;

The results are as follows:

[data analysis tool] basic SQL syntax

9. Aggregate function (sum / count /…)

According to the above steps, we can set different screening conditions according to the needs, screen out the data we need, and complete the data extraction. Next, we can do some aggregation operations and simple analysis for the extracted data.

Five commonly used aggregate functions:

Count: calculate the number of records (rows) in the table

Sum: calculate the total value of the data in the value column in the table

AVG: calculate the average value of the data in the value column in the table

Max: find the maximum value of data in any column in the table

Min: find the minimum value of data in any column in the table

However, it should be noted that:

Count (*) calculates the number of rows of all data (including null)

Count (column) calculates the number of rows in a column (excluding null)

Count (distinct column) calculates the number of rows after deduplication

The sum / AVG function can only be used for columns of numeric type, while the max / min function can be applied to columns of any data type in principle.

Let’s show the use of each aggregation function through the following example.

SELECT

Count (*), — calculate the number of rows of all data (including null)

Count (product_name), — calculate the number of rows in a column (excluding null)

Count (distinct product_name), — calculate the number of rows after deleting duplicate data

Sum (sale_price) — calculate the sum of all rows in a column. Here is the total sales

AVG (sale_price), — calculate the average value of all rows in a column. Here is the customer unit price

Max (sale_date), — calculate the maximum value of all rows in a column. Here is the latest sales date

Min (sale_date) — calculate the minimum value of all rows in a column. Here is the earliest sales date

FROM Product;

10. Group tables (group by)

Group by can group data like cutting a cake. Usually, group by is used in combination with aggregation function to calculate aggregation values under various groups. For example, you can select product by product category_ Category and “sales date” sale_ Date group the data and then summarize it to calculate the daily sales of different commodity types.

When using aggregate functions and group by clauses, you should pay attention to the following four points:

1. When using the group by clause, column names other than aggregate keys cannot appear in the select clause. Only three elements can exist in the select clause: constants, aggregate functions, and column names specified in the group by Clause (that is, aggregate keys).

2. Field aliases defined in the select clause cannot be used in the group by clause.

3. The results of the group by clause are displayed unordered.

4. Execution sequence of select statement when group by and where are used together:

FROM → WHERE → GROUP BY → SELECT

For example, we want to check the daily sales of different commodity types since July 1, 2021.

You can filter the sales date sale first_ The data of date after July 1, 2021 is displayed by “product type”_ Category and “sales date” sale_ Date groups the data, and then the sale_ Price for summation.

SELECT product_category,

sale_date,

SUM(sale_price)

FROM Product

WHERE sale_date > ‘2021-07-01’

GROUP BY product_category,

sale_date;

The results are as follows:

[data analysis tool] basic SQL syntax

11. Filtering aggregation results (having)

Sometimes we need to filter the aggregated data. What do you mean? We know that the where clause can be used to filter the original data and take out the qualified records in the data table. If you need to filter the aggregated records, you need to use the having clause, so the having clause should be written after the group by clause.

When using the having clause to filter aggregation results, you should pay attention to the following points:

1. Because the having clause filters the aggregated results, it should be written after the group by clause.

2. The aggregate value after the having clause cannot use alias, but can only use aggregate formula. As for why, we will talk about it in the SQL execution order section next year.

For example, in the previous step, we have calculated the daily sales of different commodity types since July 1, 2021. Now we need to filter out the records with sales of more than 10000 on this basis.

SELECT product_category,

sale_date,

SUM(sale_price) as GMV

FROM Product

WHERE sale_date > ‘2021-07-01’

GROUP BY product_category,

sale_date

Having sum (sale_price) > 10000 — filter the aggregation results. Gmv alias cannot be used

The results are as follows: it can be seen that records with sum (sale_price) below 10000 have been filtered out.

[data analysis tool] basic SQL syntax

12. Query result sorting (order by)

In order to meet our habit of viewing data, we often need to sort the query results, so we need to use order by.

When using order by, you should pay attention to the following points:

1. Use the keyword ASC / desc after the column name in the order by clause to sort in ascending / descending order. The default is ascending order.

2. The order by clause is usually written at the end of a select statement.

3. When there are multiple sorting columns in the order by clause at the same time, the rule is to use the key on the left first, and so on.

4. Aliases for fields and aggregate values defined in the select clause can be used in the order by clause.

Why should order by be written at the end of the select statement? Why can order by use field aliases and aggregate functions? Here comes a very key problem: the execution order of SQL statements. The writing order of SQL statements is as follows:

SELECT → FROM → WHERE → GROUP BY → HAVING → ORDER BY

But the actual execution sequence:

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY

It can be seen here that order by is the last step in the execution. The aggregation operation has been completed in the group by phase, and the alias has taken effect in the select phase. Therefore, the alias of the aggregate value can be used in this step. Of course, it is also explained here by the way: why can’t the alias be used when using having to filter the aggregation results, because the alias takes effect in the select phase, In the having phase, there is no alias at all.

For example, we show the results of the previous step in ascending order of date and descending order of sales, and pay attention to the difference in the use of aliases in having and order by.

SELECT product_category,

sale_date,

SUM(sale_price) as GMV

FROM Product

WHERE sale_date > ‘2021-07-01’

GROUP BY product_category,

sale_date

Having sum (sale_price) > 10000 — filter the aggregation results. Gmv alias cannot be used

ORDER BY sale_ Date, Gmv desc — sort by date in ascending order and sales in descending order

The results are as follows: first arrange in ascending order according to the date, and then arrange in descending order according to Gmv in the case of the same date

[data analysis tool] basic SQL syntax

The above is the content of data analysis tools – SQL basic syntax. For some data analysis tools, please read the historical articles. The articles of more data analysis tools are constantly updated. Please look forward to it. If you feel good, you are also welcome to share, praise and collect it~

Recommended Today

Custom starter for springboot

1、 Springboot starter mechanism Unified management, unified introduction, reducing cumbersome configuration and resolving version conflicts 2、 Custom starter 1. There are many configuration modules independent of business in development, which can be used when referenced repeatedly in POM. For example: Dynamic data source. Login module. Realize log facet based on AOP technology. …… 2. Customize […]