Overview of related technologies of automatic generation of logic code

Time:2021-1-15

Introduction:How to effectively improve the efficiency and quality of software development has always been a concern in the field of software engineering. Among them, the automatic program generation technology is considered to be an important method to improve the degree of automation and the final quality of software development, which is widely concerned by the academic and industrial circles. Automatic program generation technology refers to the use of certain technologies to automatically generate source code for software, to achieve the purpose of automatic programming according to the needs of users. It greatly reduces the development burden of programmers and enables them to pay more attention to business value empowerment.

Author Miao Jing

Overview of related technologies of automatic generation of logic code

The author Zhou Tingting (Name: Miaojing, wechat: weekendingting) is a senior front-end technology expert in the Technology Department of alitao, who can generate code intelligently from visual manuscriptsimgcookThe head of the platform, an important member of the front-end intelligent direction of Ali economy, is from the first AI front-end team in the industryF(x) Team。 Last year, we focused on the front-end view code field of intelligent generation of design draft, but the intelligent generation of logic code has been constantly trying. This paper introduces the overview of related technologies of automatic generation of logic code.

Introduction to program generation

How to effectively improve the efficiency and quality of software development has always been a concern in the field of software engineering. Among them, the automatic program generation technology is considered to be an important method to improve the degree of automation and the final quality of software development, which is widely concerned by the academic and industrial circles. Automatic program generation technology refers to the use of certain technologies to automatically generate source code for software, to achieve the purpose of automatic programming according to the needs of users. It greatly reduces the development burden of programmers and enables them to pay more attention to business value empowerment.

In academia, program generation and code completion is an important branch of program synthesis, whose purpose is to assist or even replace programmers to write programs. Back to the automatic generation of (front) end code, it is divided into view code and logic code. Currently, there are similar schemes based on automatic generation of design draft (such as This paper focuses on the generation technology of logic code. The common schemes of generating logic code include visual layout generation, input and output sample generation, code corpus generation and completion, function description generation and so on.

Comparison of technical schemes for automatic generation of logic code

Overview of related technologies of automatic generation of logic code

Code generation based on visual layout

Visual programming has also developed for decades, which means that with the help of some component-based integrated code visualization platform, some “Xiaobai” people who do not have professional code skills and development experience can also organize or participate in application development independently, so as to expand the code development from a programmer’s exclusive function to a wider range of people. It is mainly for programmers to use the various controls provided by the software itself, such as building blocks to construct various interfaces of the application program, and the visual arrangement is more suitable for the generation of interface view code. At present, there are at least a dozen foreign platforms focusing on visual programming and low code programming, among which the most representative can be said to be outsystems, mendix and salesforce And there are many visualization schemes for logic code, as follows:
Overview of related technologies of automatic generation of logic code
The above kind of blockly is suitable for relatively simple logic programming, and it has a good application in the field of children’s programming at present; but the logic is complex, and the building blocks are huge. In large companies, if the system is directly used by the business party, it is too complex and requires the business party to have programming thinking, which is completely different from PRD thinking, and the threshold is high; if it is used by the programmers, only graphics are used to replace the programming language, because the programmers are familiar with the programming language and related debugging and other supporting environment, the visual construction method to generate code is more inefficient.

The advantage of other mainstream visual logic code arrangement (as shown in the figure below) is to highlight the high efficiency brought by “visual flow” and reusing logic nodes. The main body is a large process, and the details of input and output, logic processing, etc. are included in the specific form rules of each node. This scheme is suitable for scenarios where there are many reuse logics, which can be abstracted as “nodes” in the flow chart. The granularity of node reuse is the key. When the granularity is large enough, a function service is business process choreography. The scenario with high degree of business process reuse is more suitable for business parties to directly see and get the choreography. When the granularity is small enough, it can be regarded as business logic choreography. If the granularity is too small, users need to have programming thinking, and the granularity is too small Small visualization is more efficient for students with programming thinking to write code directly.
Overview of related technologies of automatic generation of logic code
Summary: the cost of visual logic layout to generate code is still relatively high, even higher than handwritten code programming in complex cases, but it is more suitable in the scenario of high logic reuse, which can be abstracted as reusable logic nodes to layout, and the reuse of nodes can bring partial efficiency.

Code generation based on input and output samples

Based on the input and output examples, the generated logic program is automatically derived, also known as PBE(Programming by Examples)It is a technical solution of program synthesis mentioned in Microsoft’s paper in 2016. In reality, a more successful application case is that there is a function in excel that you should be familiar with, that is, flash fill, which can quickly generate table item formula according to several examples. For example, in the figure below, even numbers can be filled according to 2 and 4 in the first column; the results of arithmetic series formula can be automatically deduced according to 50 and 40 in the second column.
Overview of related technologies of automatic generation of logic code
Take a more complex case
Overview of related technologies of automatic generation of logic code
The above formula is more complex. In reality, the correct rate of flashfill is less than 50%. Later, npbe (neural programming by example) was proposed on the basis of PBE to complete 45 string operations in Excel. The intention of npbe is to let the model learn the relevant features from the input and output strings, and use the learned features to induce the correct program. General PBE technology implementation process is as follows:
Overview of related technologies of automatic generation of logic code

Generating complete code based on code corpus

With the accumulation of GitHub open source code and the rise of deep learning, there are many applications based on source code understanding (program understanding), such as code completion recommendation, and guessing the meaning of code fragments according to the code.

Code completion

Commonly used integrated development environment often integrates code completion tools, which are generally limited to keywords or syntax based prompts, such as this. In TSX file will prompt the currently available class attributes and methods; the existing intelligent code completion tools are usually based on model performance, and are generally based on static word frequency statistics.

For example, our co construction team developed a code completion plug-in based on JavaScript code corpus and gpt2, using n-gram probability model. The main flow chart is as follows:
Overview of related technologies of automatic generation of logic code

Code intent generation

Code intention generation means that the function of code can be inferred according to the content of code. The well-known open source models and services in the industry include code2vec and code2seq, in which code2vec is code summarization and code2seq is code capping.

The example of code2vec shows as follows: according to the code fragment on the left, the function of analyzing the code is that 90.93% probability is contains, and on the right is the analysis and visualization process of the code, focusing on the connection thickness between nodes, which is used to represent the weight of decision-making information.
Overview of related technologies of automatic generation of logic code
The description of code2seq example is as follows: according to the code fragment on the left, the function description of analyzing the code is save bitmap to file. On the right is the analysis and visualization process of the code. The same focus is on the connection thickness between nodes, which is used to represent the weight of decision information.
Overview of related technologies of automatic generation of logic code
Its internal model is shown in the following table: its current open source model data set size is billon level, and the model based on typescript is also open source.
Overview of related technologies of automatic generation of logic code

Code generation based on function description

People usually use natural language to describe the function of the program, from natural language description to automatic generation of the program is a very challenging problem. The diversity of natural language text and program, the ambiguity of text and the complex structure of code make it a difficult problem to establish the connection between text and code. At present, there are some explorations in the industry, including nl2sql, nl2tfttt and code generation based on function description.

NL2SQL

Natural language to SQL (nl2sql) is a new research hotspot of Cui (conversation user interface). The purpose of nl2sql is to convert the natural language input by users into available SQL Statements to improve the efficiency of users’ data query. Now the biggest problem that hinders the realization of big data value is that the threshold of accessing data is too high, and it depends on the database administrator to write complex SQL, and the Chinese expression is more diverse and complex. There are many researches in this field at home and abroad. At present, the most famous datasets in nl2sql field are English version datasets, including wikisql, spider, cosql and so on.

The typical three-tier architecture of nl2sql is as follows:
Overview of related technologies of automatic generation of logic code
Nl2sql is divided into three parts: user query interface, processing unit and database. Processing unit is the center of the whole architecture and the core of semantic analysis. It opens up the interaction channel between user and database, including intelligent word segmentation, entity recognition, knowledge retrieval and other technical points. In the above research, the internal algorithm of processing unit is also gradually moving towards the direction of deep learning. The following is the m-sql model proposed by the champion team in the Chinese nl2sql competition, which has achieved 92% accuracy on the Chinese dataset.

In the industry, Google’s analyza is built by semantic analysis and rules, which has strong controllability, but needs to maintain some rules manually. The end-to-end solution is to implement nl2sql through deep learning and encoder decoder. The whole algorithm system is divided into several SQL clauses recognition, including select clause, where clause, and sometimes group by, limit and other operators. Each part also involves many details, such as table recognition, attribute recognition and so on. Different algorithms are in such a framework system, in the details of the place to do some changes and optimization, in order to achieve a better effect. Although the end-to-end nl2sql solution can reduce the cost of human maintenance, it has certain effect only in the simple scenario of wikiql. For the relatively complex spider or cosql scenario, the accuracy is very low, which can not meet the requirements of commercial applications. The actual business implementation scheme of the company’s Xiaomi team is very similar to Google’s analyza scheme. At the same time, the end-to-end nl2sql scheme has been studied and implemented. The next step is to integrate nl2sql and semantic rule parsing to solve the needs of complex scenarios. [the conclusion of this part comes from the internal Xiaomi team]

NL2TFTTT

What is ifttt? If this is then that; is a new network service platform, through the conditions of other different platforms to decide whether to execute the next command. For example, “if it rains tomorrow, please let me know today.” “when someone marks your photo on Facebook, the photo will be automatically backed up to the photo album of iPhone.” this meets the user’s need to connect the content of service a to service B, and users don’t have to do it by themselves. Ifttt can automatically help complete the above actions.

Function example

Such as the following application:

  • Please let me know if it rains tomorrow
  • Please give me a weather forecast at 7 o’clock every day

The settings are as follows:
Overview of related technologies of automatic generation of logic code
Nl2ifttt is to generate if this then that (ifttt) code through natural language. Compared with common programming languages, ifttt program is simpler in structure and easier to learn its structural rules. Ifttt is task-based conditional trigger, similar to minimalist programming language, that is, “if XXX performs YYY behavior, execute zzz”. Each website that can trigger or act as a task is called a channel. The trigger condition is called trigger, and the subsequent task is called action. The above process is called recipe.

In 2016, Liu et al. Proposed a hidden attention mechanism, which can effectively learn which words are more important for the prediction of triggers and which words are more important for the prediction of actions in natural language. In the same year, beltagy and others regarded ifttt program generation as semantic analysis.

Sample dataset

Microsoft has an open source ifttt sample set, and its sample data is similar to the following, so that you can better understand the problem definition of nl2ifttt:
Overview of related technologies of automatic generation of logic code

NL2Code-TranX

Compared with generating tfttt code and SQL according to the function description of natural language, it is much more difficult to directly generate logic code of programming languages such as python, Java and JavaScript according to the requirement description of natural language. Up to now (August 2020), there is no relevant online service available. At present, we find that Carnegie Mellon University has related research output TranX, which can achieve single function description to generate expression level code, similar to generating corresponding code according to single line code annotation. See its website demo as follows.

TranX function example

Overview of related technologies of automatic generation of logic code
Overview of related technologies of automatic generation of logic code
Overview of related technologies of automatic generation of logic code
Sample TranX dataset


{
    "intent": "Sending http headers with python",
    "rewritten_intent": "sending http headers to `client`",
    "snippet": "client.send('HTTP/1.0 200 OK\r\n')",
    "question_id": 8315209
  },
  {
    "intent": "Python -Remove Time from Datetime String",
    "rewritten_intent": "Format a datetime string `when` to extract date only",
    "snippet": "then = datetime.datetime.strptime(when, '%Y-%m-%d').date()",
    "question_id": 26153795
  },
  {
    "intent": "How do I split a multi-line string into multiple lines?",
    "rewritten_intent": "split a multi-line string `inputString` into separate strings",
    "snippet": "inputString.split('\n')",
    "question_id": 172439
  },
  {
    "intent": "How do I split a multi-line string into multiple lines?",
    "rewritten_intent": "Split a multi-line string ` a \n b \r\n c ` by new line character `\n`",
    "snippet": "' a \n b \r\n c '.split('\n')",
    "question_id": 172439
  },

TranX model example

Programming language has strict syntax, which can’t tolerate spelling and grammatical errors; this model builds a new tree based model for ast, which can fully express all the syntax of programming language; at the same time, the corresponding corpus annotation is very expensive and time-consuming, and the limited availability of tag samples is the bottleneck of the supervision model, so this model introduces the Structvae, a semi supervised automatic coding model, can learn from both limited samples and unlabeled NL. The paper model is shown in the following table:
Overview of related technologies of automatic generation of logic code

NL2Code-debuild

At present, the demo website is closed. From the publicity effect picture, we can see the new possibility of generating code from function description. It can generate the corresponding layout code according to the layout language description; it can also generate the corresponding code according to the simple high-level function description.

Example of build function
Overview of related technologies of automatic generation of logic code
Example of build model

The model used in this platform is based on gpt3.0 of openai. The author says that the natural language code generation based on gpt3.0 can advance the previously unimaginable code generation in 50 years to 5 years. The current version is more experimental, and more simple function description can generate high-level (packaged component library, module library, etc.) function code. If you are interested, you can try GPT 3.0.

summary

In general, it is a trend to generate program code automatically based on deep learning, but the current program generation and code completion using deep learning technology is still in its infancy. Compared with the traditional methods, deep learning has been used to generate program code and code completion, but the program generation technology has not been used in industrialization. It faces the following challenges:

1) The quality of training corpus varies.In the common work, the corpus used to train deep learning model can be roughly divided into two categories: one is the program based on DSL, and the other is the project crawled from open source community, such as GitHub. DSL based programs tend to have simple syntax, short program length, and easy to train and test, but at the same time, the model designed for DSL is difficult to be popularized to other languages; while the projects crawled from the open source community are closer to the actual software development, But it is also difficult to guarantee the quality of the code – low quality, non-standard code will bring extra noise to the neural network, and the use of different programming codes will make the neural network model confused in training and prediction. It is also a challenge to obtain a uniform and high-quality program corpus.

2) The generalization ability of user-defined logic is weak.Many program codes, especially logic codes, are business closed domain logic codes. They need a lot of business closed domain logic materials. They need a high-level understanding of various services that existing businesses rely on. New businesses need to train their own models in combination with their own material library. However, the services that new businesses rely on also need to start from 0 At the beginning of development, it is difficult to provide new materials unless all the dependent services are inseparable atomic services. So the universality of logic model is difficult, but methodology can learn from each other.

3) Information asymmetry between function description and program code.If we really do the best, the function description is made by PRD Directly provide, can directly generate the corresponding program code; requirements description is higher and abstract than the general function description, requirements description, function description, code description, program code, each of these links will have a lot of information loss, these losses are currently made up by programmers through business experience and programming experience. There are three ways to generate code directly from function description. One is end-to-end requirement description and AST generation. If PD’s description of software function is accurate enough, it is equivalent to creating a higher level description language; The second is that the later link part (function description code description program code) is encapsulated and abstracted into a more granular function block description. At present, most of them are based on this kind of high feasibility (such as nl2ttftt) The third is to really go deep into every business domain, let the model to understand one layer at a time, and finally gradually generate code. The quality of corpus and the accuracy of the final model are relatively big challenges.

In order to reduce the development burden of programmers, improve the degree of automation of software development, and improve the efficiency and quality of software development, academic and industrial circles have been trying to study the technology of automatic program generation. With the rapid development of deep learning technology, it is believed that in the future, more and more repetitive program development will be replaced by machines, and programmers will pay more attention to the underlying architecture and upper business value empowerment.

For the intelligent generation of logic code, the front-end intelligent team of Alibaba is currently conducting experiments in various dimensions. Welcome to communicate.

reference material