Post calculation script of SQL

Time:2021-3-16

In most cases, we can complete the database calculation with SQL (stored procedure), but if we encounter some complex operations that SQL is not good at, we can only read the data out of the database with other programming languages, and then complete the calculation outside the database. Such programming languages often appear in the form of simple scripts, which we call SQL post calculation scripts here.

The operations SQL is not good at mainly include complex set calculation, ordered calculation, association calculation, multi-step calculation and so on. SQL aggregation is not thorough enough, and there is no explicit set data type, which makes it difficult to reuse the sets generated in the calculation process. For example, after grouping, it must be forced to summarize, and the subsets based on grouping cannot be calculated any more. SQL is designed based on unordered set theory, and it is very troublesome to deal with inter-bank group and ranking and other orderly operations. It often uses join or sub query to generate serial number temporarily, which is not only difficult to write but also easy to write The operation efficiency is very low. SQL does not support the reference of records, so it can only use subqueries or join statements to describe the association relationship. Once there are many levels or self association, the code will be extremely complex. SQL itself does not advocate multi-step code, and often forces programmers to write long statements with many levels nested. Although stored procedures can solve this problem to a certain extent, sometimes the actual environment does not allow me We use stored procedures, such as DBA strictly control the permissions of stored procedures, old database and small database do not support stored procedures, and the debugging of stored procedures is also very inconvenient, which is not very suitable for writing calculation with procedures.

In addition to the above complex operations, there are also some cases where SQL post calculation scripts are used. For example, computing logic needs to be migrated between different kinds of databases, involving non relational databases; input source or output target is not only database, but also excel, text and other files; it may also carry out mixed computing among multiple databases. These will involve the calculation outside the library and the post calculation script of SQL.

For the post calculation script of SQL, the most important function of course is to implement the complex operations that SQL is not good at. In addition, it is better to have some more advanced features, such as computing files, non relational databases and other diverse data, can handle a large amount of data, computing performance can not be too slow, etc. Of course, the most basic thing is to easily support reading and writing database, so as to realize the post calculation of SQL.

Common scripts used for post SQL calculation include Java, python panda and esproc. Let’s take a closer look at these scripts and see the differences in their abilities in post SQL calculation.

JAVA

C + +, Java and other high-level languages are omnipotent in theory, and naturally they can realize the operations that SQL is not good at. Java supports generics, and it can realize complex set operations. Java array has serial number, which can realize ordered operation. Java supports object references, which can be used to represent relationships, and there is no problem with association operation. Java supports procedural syntax such as branch and loop, and can easily realize multi-step complex operation.

However, Java lacks structured class library, even the simplest structured computing must be implemented by hard coding, and the most basic structured data types must be established manually, which will lead to tedious code.

Take an example of orderly calculation: find the longest consecutive rising days of a stock. The library table AAPL stores the stock price information of a stock. The main fields are trading date and closing price. Please calculate the longest consecutive days of the stock rising.

According to the natural way of thinking to achieve this task: cycle the stock records with orderly dates. If this record is up compared with the previous record, add 1 to the consecutive days of up (initially 0). If it is down, compare the consecutive days of up with the current maximum consecutive days of up (initially 0), select the new current maximum consecutive days of up, and then add 1 to the consecutive days of up The number of days to rise is 0. Until the end of this cycle, the current maximum continuous rising days is the final maximum continuous rising days.

SQL is not good at orderly calculation. It can’t be realized with the above natural ideas. It can only use some strange and difficult skills: divide the orderly stock records by date into several groups, and the continuous rising records into the same group. That is to say, if the stock price of one day is higher than that of the previous day, it will be divided into the same group with the previous day’s records. If it falls, it will start a new group. Finally, look at the largest number of members in all groups, that is, the largest number of consecutive days.

The specific SQL is as follows:

select max(continue_inc_days)
from (
select count(*) continue_inc_days from (
select sum(inc_de_flag) over(order by transDate) continue_de_days from (
select transDate,case when price>LAG(price) over(order by transDate) then 0 else 1 end inc_de_flag from AAPL)) group by continue_de_days)

This section of SQL is not very long, but it is nested in four layers. The skills used are strange and difficult to understand. It is difficult for ordinary people to come up with such code.

When implemented in Java, you can return to the natural way of thinking

package stock;
 import java.sql.*;
 public class APP{
     public static void main(String[] args) throws SQLException,                     ClassNotFoundException   {              
    Connection con = null;             
    Class.forName("com.mysql.cj.jdbc.Driver");              
    con = DriverManager.getConnection(                                         "jdbc:mysql://127.0.0.1:3306/mysql?&useSSL=false&serverTimezone=UTC",                                         "root", "");              
    String dql = "select   * from AAPL order by transDate";                  
    PreparedStatement stmt = con.prepareStatement(dql,                               
    ResultSet.TYPE_SCROLL_INSENSITIVE,       
    ResultSet.CONCUR_READ_ONLY);                  
    ResultSet aapl = stmt.executeQuery();              
    int continue_inc_days = 0;              
    int max_continue_inc_days = 0;              
    float last_Price = 0;              
    while (aapl.next())         
        {float price = aapl.getFloat("price");                     
        if (price >= last_Price) {                           
            continue_inc_days++;                     
            } else{                           
                if (continue_inc_days >= max_continue_inc_days) {                                  
                max_continue_inc_days = continue_inc_days;                                   
                }                           
            continue_inc_days = 0;                     }                     
            last_Price = price;              }                      
                System.out.println("max_continue_inc_days=" + max_continue_inc_days);              if (con != null)                     
                con.close();       }}

The later part of the code is the idea mentioned above, as long as a layer of loop can be completed.

However, we also found that although the code written by Java is simple and not difficult, the code is obviously lengthy.

The complexity of this problem is not high, and it does not involve the common grouping, connection and other structured data calculation. Otherwise, the amount of code will be even more amazing. Limited to space, Java will not be used as an example.

In terms of high-level functions such as data diversity, performance optimization and big data processing, Java is also characterized by “can be realized, but too cumbersome”, which will not be repeated here.

Java is an excellent enterprise level common language, but the other meaning of common is often unprofessional. In other words, Java lacks a professional structured computing class library, and its code is tedious and tedious, which is not an ideal SQL post computing script.

Python pandas

Python has a simple syntax and many third-party function libraries, including pandas, which serves structured computing. Because of this, pandas is often used as post calculation script of SQL.

As a structured computing function library, pandas is much better than Java in simplifying complex SQL operations.

For example, in the same orderly operation of “seeking the longest consecutive rising days”, pandas code is as follows:

import pymysql
import pandas as pd
conn =   pymysql.connect(    
    host = "127.0.0.1",    
    port = 3306,       
    user = "root",    
    password = "",    
    database = "mysql",
    )
aapl = pd.read_sql_query("select price from AAPL order by   transDate", conn)
continue_inc_days=0 ; max_continue_inc_days=0
for i in aapl['price'].shift(0)>aapl['price'].shift(1):    
continue_inc_days   =0 if i==False else continue_inc_days +1    
max_continue_inc_days   = continue_inc_days if max_continue_inc_days   < continue_inc_days else max_continue_inc_days
print(max_continue_inc_days)
conn.close() |

In the above code, pandas provides a data structure dataframe for structured computing. This data structure naturally has serial number, which can simplify the code in the orderly operation, and is easier to cross line fetching than Java. In addition, the encapsulation of pandas for SQL fetching is also very compact, which is shorter than Java code.

Another example is set calculation: one line is split into multiple lines. The library table TB has two fields, in which anomalies stores the string with space as the separator. It is necessary to split anomalies by space, so that each ID field corresponds to a member.

Data before processing

ID ANOMALIES
1 A1 B1 C1 D1
2 A2
3 A3 B3 C3
4 A3 B4 D4

Processed data:

ID ANOMALIES
1 A1
1 B1
1 C1
1 D1
2 A2

The core code of pandas is as follows (omitting database input and output, the same below)

…
 split_dict = pd.read_sql_query("select * from tb", conn)
 split_list = []
 for key,value in split_dict.items():    
     anomalies = value[0].split(*' '*)    
    key_array = np.tile(key,len(anomalies))    
    split_df =   pd.DataFrame(np.array([key_array,anomalies]).T,columns=[*'ID'*,*'ANOMALIES'*])    
    split_list.append(split_df)
    df = pd.concat(split_list,ignore_index=True)

In the above code, pandas uses the set function to split the string into dataframes directly, and then uses the set function to merge multiple dataframes directly. The code is very concise. Although Java can achieve similar functions, it needs to be implemented manually, and the code is much more cumbersome.

As a structured computing function library, pandas code is indeed more concise than Java, but this is only limited to the case of limited complexity. If the complexity is further improved, pandas code will become lengthy and difficult to understand.

For example, this example involves set calculation + ordered calculation: continuous duty. The duty table records the daily duty situation. A person is usually on duty for several working days, and then changes. Now please calculate the continuous duty situation of each person according to duty. The data structure is as follows:

Before treatment (duty)

Date Name
2018-03-01 Emily
2018-03-02 Emily
2018-03-04 Emily
2018-03-04 Johnson
2018-04-05 Ashley
2018-03-06 Emily
2018-03-07 Emily

After treatment

Name Begin End
Emily 2018-03-01 2018-03-03
Johnson 2018-03-04 2018-03-04
Ashley 2018-03-05 2018-03-05
Emily 2018-03-06 2018-03-07

| | |

The core panda code is as follows:

……
 duty = pd.read_sql_query("select date,name from duty order by   date", conn)
 name_rec = ''
 start = 0
 duty_list = []
 for i in range(len(duty)):    
    if name_rec == '':        
        name_rec = duty['name'][i]    
    if name_rec != duty['name'][i]:        
        begin =   duty['date'].loc[start:i-1].values[0]        
        end =   duty['date'].loc[start:i-1].values[-1]              
        duty_list.append([name_rec,begin,end])        
        start = i        
        name_rec = duty['name'][i]
begin = duty['date'].loc[start:i].values[0]
end = duty['date'].loc[start:i].values[-1]
duty_list.append([name_rec,begin,end])
duty_b_e = pd.DataFrame(duty_list,columns=['name','begin','end'])

The above has omitted the database output process, you can see that the code is still a bit cumbersome.

Another example is set calculation + multi-step operation: Calculation of installment loan details. The library table loan records loan information, including loan ID, total loan amount, period number and annual interest rate

LoanID LoanAmt Term Rate
L01 100000 5 4.8
L02 20000 2 5.0
L03 500000 12 4.5

Need to calculate the details of each period, including: current repayment, current interest, current principal, residual principal. The results are as follows

LoanID LoanAmt Payment Term Rate interest princepal princeplebalance
L01 100000 20238.13 5 4.75 395.83 19842.29 80159.71
L01 100000 20238.13 5 4.75 317.29 19920.83 60236.87
L01 100000 20238.13 5 4.75 238.44 19999.69 40237.18
L01 100000 20238.13 5 4.75 159.27 20078.85 20158.33

The core code of pandas is as follows

loan_data = pd.read_sql_query("select loanID,LoanAmt,Term,Rate from   loan", conn)
 loan_data['mrate'] = loan_data['Rate']/(100*12)
 loan_data['mpayment'] =   loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term'])   \                          /(np.power(1+loan_data['mrate'],loan_data['Term'])-1)
 loan_term_list = []
 for i in range(len(loan_data)):    
     loanid =   np.tile(loan_data.loc[i]['LoanID'],loan_data.loc[i]['Term'])    
    loanamt = np.tile(loan_data.loc[i]['LoanAmt'],loan_data.loc[i]['Term'])    
    term =   np.tile(loan_data.loc[i]['Term'],loan_data.loc[i]['Term'])    
    rate =   np.tile(loan_data.loc[i]['Rate'],loan_data.loc[i]['Term'])    
    payment =   np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])    
    interest = np.zeros(len(loanamt))    
    principal = np.zeros(len(loanamt))    
    principalbalance  = np.zeros(len(loanamt))    
    loan_amt = loanamt[0]    
    for j in range(len(loanamt)):        
        interest[j] =   loan_amt*loan_data.loc[i]['mrate']        
        principal[j] = payment[j]   - interest[j]        
        principalbalance[j] =   loan_amt - principal[j]        
        loan_amt =   principalbalance[j]    
        loan_data_df =   pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])),                  
            columns =   ['loanid','loanamt','term','rate','payment','interest','principal','principalbalance'])    
        loan_term_list.append(loan_data_df)
loan_term_pay = pd.concat(loan_term_list,ignore_index=True)
print(loan_term_pay)

As you can see, Python is much better than Java in simplifying complex SQL operations, but it is only limited to simple cases. If the requirements are more complex, the code will become lengthy and difficult to understand. The reason for this phenomenon may be that pandas is only a third-party function library, which can not get the underlying support of Python from the syntax level, and the design is not professional enough.

The lack of professionalism of pandas is also reflected in the diversity of data. Pandas does not develop a unified interface for all kinds of data sources. It only supports common local files, but does not support complex data sources, such as Hadoop and mongodb. Users have to find third-party (actually fourth party) function libraries and write complex access codes. Pandas doesn’t even have a unified database interface. For example, MySQL has several third-party function libraries, including pymysql, Sqlalchemy, and MySQL dB. However, this problem is not serious for most desktop application scenarios, and common database Python can simply support it.

For the problem of multi-source hybrid Association, as long as the data of various data sources can be read out, it can be basically realized. Pandas’s performance in this aspect is basically satisfactory. However, as mentioned above, pandas is easy to implement for simple mixed association relations, but once there are more complex association operations, the implementation process will become difficult.

In terms of large amount of data, pandas’s performance is not satisfactory. There is no cursor data type in pandas, which leads to that when solving the calculation of large amount of data, it must be hard coded to achieve circular data retrieval, instead of automatic internal and external memory exchange, so the code is extremely cumbersome. Please refer to《How Python Handles Big Files

The performance of pandas is also average, but it is basically enough. It is often criticized that multithreading is parallel, which is difficult to implement in Python. For example, database IO is generally slow, but parallel access can be used when the database is not busy to improve the access performance. Python has to use other third-party function libraries to achieve parallel, the code is extremely cumbersome, and there is no guarantee in the expression efficiency, execution efficiency, stability and other convenience.

Although pandas is a structured computing function library, it is not easy to use.

esProc

Similar to pandas, esproc also has rich structured computing functions. Different from pandas, esproc is a product supported by commercial companies, a professional structured computing language, not a third-party library function of open source community, and there is no loose superior organization. Esproc can design a consistent structured computing syntax from a global perspective, and design a unified structured data type from the bottom to the top, so that the functions can be combined with maximum flexibility, so as to quickly and easily solve the problems encountered in the post SQL calculation.

As a professional structured computing language, esproc is good at simplifying complex SQL operations. For example, to find the longest consecutive days of rising, esproc only needs two lines

A
1 [email protected](“select price from AAPL order by transDate”)
2 =a=0,A1.max(a=if(price>price[-1],a+1,0))

The above code uses ordinal table and loop function. Ordinal table is a data structure dedicated to structured calculation, which can be easier to cross row fetching than pandas, and it can be more convenient to realize orderly calculation. Loop function can avoid most for statements (or for should be used in complex cases), which can greatly simplify the code. In addition, esproc encapsulates SQL data more compactly and is shorter than pandas code.

For another example, if a line is split into multiple lines, the esproc code is still short

A
1 [email protected](“select * from tb”)
2 =A1.news(ANOMALIES.split(“ “);ID,~: ANOMALIES)

Even if the requirement complexity is further improved, esproc can still be easily implemented.

For example, in the case of continuous duty, esproc code is much shorter than pandas code

A
1 =orcl.query(“select date,name from duty order by date”)
2 [email protected](name)
3 =A2.new(name,.m(1).date:begin,.m(-1).date:end)

For example, esproc is also shorter than pandas in calculating the installment loan details

A
1 [email protected](“select loanID,LoanAmt,Term,Rate from loan”)
2 =A1.derive(Rate/100/12:mRate,LoanAmtmRatepower((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment)
3 =A2.news((t=LoanAmt,Term);LoanID, LoanAmt, mPayment:payment, Term, Rate, t* mRate:interest, payment-interest:principal, t=t-principal:principlebalance)

Esproc is usually easy to implement complex operations that pandas is difficult to implement, and the code is not difficult. For example, the task involving multi-step algorithm + set operation + dynamic table structure: insert the sub table horizontally into the sub table.

Source table relationship

Order (main table) Orderdetail (subtable)
ID(pk) ß- OrderID(PK)
Customer Number(pk)
Date Product
Amount

Target results

ID Customer Date Product1 Amount1 Product2 Amount2 Product3 Amount3
1 3 2019-01-01 Apple 5 Milk 3 Salt 1
2 5 2019-01-02 Beef 2 Pork 4
3 2 2019-01-02 Pizza 3

Esproc can greatly simplify this Code:

A B
1 [email protected](“select * from OrderDetail left join Order on Order.ID=OrderDetail.OrderID”)
2 =A1.group(ID) =A2.max(.count()).(“Product”+string()+”,”+”Amount”+string(~))[email protected]()
3 =create(ID,Customer,Date,${B2}) >A2.run(A3.record([ID,Customer,Date]

As a professional structured computing language, esproc can not only greatly simplify the complex operations SQL is not good at, but also have more advanced ability to solve some special cases.

In terms of diverse data, esproc supports multiple file formats and complex data sources, such as Hadoop, mongodb, etc. Furthermore, with the same code, data analysts can calculate data from different sources, including databases and non databases.

It is similar to proesc in terms of data hiding and data processing.

For example, the library table orders records a large number of E-Commerce orders, all of which will be read out of memory. Now we need to calculate the three orders with the largest sales volume of each salesperson outside the library. Esproc code is as follows:

A
1 [email protected](“select sellerid,amount from orders”)
2 =A1.groups(sellerid;top(3; -amount):top3)
3 =A2.conj(top3)

Esproc also provides many easy-to-use methods for performance optimization. For example, the data of orders table is approximately equal every month. Please perform multi thread parallel query by month, so as to greatly improve the query performance. Esproc only needs the following code:

A B
1 fork to(1,12) /12 thread parallel
2 =connect(“my”)
3 [email protected](“select * from orders where month(orderdate)=?”,A1)
4 =A1.conj() /Merge query results

After the previous comparison, we can find that esproc has rich structured functions, is a professional structured computing language, can greatly simplify the complex operation logic that SQL is not good at, and is a more ideal SQL post calculation script.

This work adoptsCC agreementReprint must indicate the author and the link of this article