Overview of automatic generation of logical code

Time:2020-11-4

Introduction:How to effectively improve the efficiency and quality of software development has always been a concern in the field of software engineering. Among them, automatic program generation technology is considered to be an important method to improve the degree of automation and final quality of software development, and has been widely concerned by academic and industrial circles. Automatic program generation technology refers to the use of some technologies to automatically generate source code for software, to achieve the purpose of automatic programming according to the needs of users. It greatly reduces the development burden of programmers and makes programmers pay more attention to business value empowerment.

Author: Miao Jing

Overview of automatic generation of logical code

The author Zhou Tingting (flower name: Miaojing, wechat: weekendingting) is a senior front-end technical expert in the Technology Department of Ali Tao, the person in charge of imgcoke platform for intelligent generation of visual manuscripts, and an important member of the front-end intelligent direction of Alibaba economy. She comes from the f (x) team of the first AI front-end team in the industry. Last year, we focused on the field of intelligent generation of front-end view code of design draft, but we have been trying to generate logic code intelligently. This paper introduces the related technologies of automatic generation of logical code.

Introduction to program generation

How to effectively improve the efficiency and quality of software development has always been a concern in the field of software engineering. Among them, automatic program generation technology is considered to be an important method to improve the degree of automation and final quality of software development, and has been widely concerned by academic and industrial circles. Automatic program generation technology refers to the use of some technologies to automatically generate source code for software, to achieve the purpose of automatic programming according to the needs of users. It greatly reduces the development burden of programmers and makes programmers pay more attention to business value empowerment.

In academic circles, program generation and code completion is an important branch of program synthesis. Its purpose is to assist or even replace programmers to write programs. The automatic generation of the back (front) end code is divided into view code and logic code. In view code, there are similar schemes based on the automatic generation of design draft (e.g This paper focuses on the generation technology of logical code. The common schemes for generating logical code include visual layout generation, input and output sample generation, code corpus generation and completion, and function description generation.

Comparison of technical schemes for automatic generation of logical code

Overview of automatic generation of logical code

Generating code based on visual layout

Visual programming has also developed for decades, which means that with the help of some component-based integrated code visualization platform, some “small white” people who do not have professional code skills and development experience can also organize or participate in application development independently, so as to expand code development from a programmer’s exclusive function to a wider population. It mainly allows programmers to use the various controls provided by the software itself, such as building blocks to construct various interfaces of application programs, and the visual layout is more suitable for the generation of interface view code. At present, there are at least a dozen foreign platforms for visual programming and low code programming, among which the most representative ones are outsystems, mendix and salesforce In addition, there are many visualization schemes for logical code, as follows:
Overview of automatic generation of logical code
The above categories such as blockly are suitable for simple logic programming, and they are well used in the field of children’s programming at present; but the logic is complex, and the building blocks are also huge. If such a system is directly used by the business side in a large company, it is too complex and requires the business side to have programming thinking, which is completely different from PRD thinking, and the threshold is high; if it is used for programmers, it only uses graphics to replace the programming language. Because programmers are familiar with programming language and related debugging environment, visual building method to generate code is more inefficient.

Other mainstream visual layout of logic code (as shown in the figure below) has the advantage of highlighting the efficiency brought by “visual flow” and reusing logical nodes. The main body is a large process, and the details of input and output, logic processing and other details are included in the specific form rules of each node. This scheme is suitable for scenarios where there are a lot of reuse logic, and the reused logic can be abstracted into “nodes” in the flow chart. The granularity of node reuse is critical. If the granularity is large enough to a function service, it is business process choreography. The scenario with high business process reuse degree is more suitable for the business side to directly see the choreography. If the granularity is small to the expression level, it can be regarded as business logic choreography. If the granularity is too small, the user should have programming thinking, and the granularity is too small Small visualization for students with programming thinking, it may be more efficient to write code directly.
Overview of automatic generation of logical code
Summary: the cost of visual logic layout generation code is relatively high, even higher than handwritten code programming in complex cases, but it is more suitable for scenarios with high degree of logic reuse. It can be abstracted into reusable logical nodes for choreography, and node reuse can bring partial efficiency.

Generating code based on input and output examples

Based on the input and output examples, the generated logic program, also known as PBE, is automatically derived(Programming by Examples)It is a technical solution of program synthesis mentioned by Microsoft in 2016. In reality, a more successful application case is that excel has a function that I believe you should be familiar with, that is, flash fill, which can quickly generate table item formula according to several examples. For example, in the figure below, you can fill in even numbers according to 2 and 4 in the first column; in the second column, the result of arithmetic sequence formula can be automatically derived according to 50 and 40.
Overview of automatic generation of logical code
Take more complex situations as an example:
Overview of automatic generation of logical code
The above formula is more complex. In reality, the accuracy of flashfill is less than 50%. Later, npbe (neural programming by example) was proposed based on PBE to complete 45 string operations in Excel. The intention of npbe is to let the model learn relevant features from the input and output strings, and use the learned features to induce the correct program. The technical implementation process of general PBE is as follows:
Overview of automatic generation of logical code

Generating complete code based on code corpus

With the accumulation of GitHub open source code and the rise of deep learning, there are more applications based on source code understanding (program understanding), such as code completion recommendation, and guessing the meaning of code fragments according to code.

Code completion

Common integrated development environments often integrate code completion tools, which are generally limited to keyword or syntax based prompts. For example, in TSX files, this. Will prompt the currently available class attributes and methods; existing intelligent code completion tools are usually based on model performance considerations, and are as simple as static word frequency statistics.

After that, we use the code completion model of our colleagues to build a corpus based on the complete code completion model of JavaScript. The main flow chart is as follows:
Overview of automatic generation of logical code

Code intent generation

Code intent generation means that the function of code can be inferred according to the code content. Code2vec and code2seq are well-known open source models and services in the industry. Code2vec does code summation (code function summary) and code2seq does code captioning (code function description).

The example of code2vec is as follows: according to the code fragment on the left, the function of analyzing code is that 90.93% of the probability is contains, and on the right is the analysis and visualization process of the code, focusing on the connection thickness between nodes, which is used to represent the information weight of decision-making.
Overview of automatic generation of logical code
The example of code2seq is as follows: according to the code fragment on the left, the function of analyzing the code is: save bitmap to file; on the right is the analysis and visualization process of the code; the same emphasis is placed on the thickness of the connection between nodes, which is used to represent the information weight of decision-making.
Overview of automatic generation of logical code
Its internal model is shown in the following: its current open source model dataset size is billon level, and it also has an open source model based on typescript.
Overview of automatic generation of logical code

Generating code based on function description

People usually use natural language to describe program functions. From natural language description to program automatic generation is a very challenging problem. The diversity of natural language text and program, the ambiguity of text and the complex structure of code make it difficult to establish the relationship between text and code. At present, there are some explorations in the industry, including nl2sql, nl2tfttt, and code generation based on function description.

NL2SQL

Natural language to SQL (nl2sql) is a new research hotspot of Cui (conversation user interface). Its research purpose is to transform the natural language input by users into usable SQL Statement, improve the efficiency of user query data, now the biggest problem hindering the realization of the value of big data is that the threshold of accessing data is too high, the database administrator is dependent on writing complex SQL, and the Chinese expression is more diverse and complex. There are many researches in this field at home and abroad. At present, the most famous dataset in nl2sql field is English version dataset, including wikisql, spider, cosql, etc.

The typical three-tier architecture of nl2sql is as follows:
Overview of automatic generation of logical code
Nl2sql is divided into three parts: user query interface, processing unit and database. The processing unit is the center of the whole architecture and the core of semantic parsing. It opens up the interaction channel between user and database, including intelligent word segmentation, entity recognition, knowledge retrieval and other technical points. In the above research, the internal algorithm of processing unit is also gradually moving towards the direction of deep learning. The following is the m-sql model proposed by the champion team in the Chinese nl2sql competition, which has achieved 92% accuracy on the Chinese dataset.

Google’s analyza in the industry is built by semantic parsing and rules. This method is controllable, but it needs to maintain some rules manually. The end-to-end scheme is to implement nl2sql by deep learning and encoder decoder. The whole algorithm system is divided into several SQL clauses recognition, including select clause, where clause, and sometimes group by and limit operators. Each part also involves many details, such as table recognition, attribute recognition, etc. Different algorithms are in such a framework system, in the details of the place to do some changes and optimization, in order to achieve a better effect. Although the end-to-end nl2sql scheme can reduce the cost of human maintenance, it has certain effect only in the simple scenario of wikisql. For the relatively complex spider or cosql scenario, the accuracy rate is very low, which can not meet the requirements of commercial application. The solution of the company’s internal small honey team’s actual business implementation is very similar to Google’s analysis solution. At the same time, it has researched and implemented the end-to-end nl2sql solution. In the next step, it is expected to integrate nl2sql and semantic rule analysis to solve the needs of complex scenarios. [the conclusion of this part comes from the internal Xiaomi team]

NL2TFTTT

What is ifttt? It means if this then that; is a new network service platform that decides whether to execute the next command according to the conditions of other different platforms. For example, “if it rains tomorrow, please let me know today” and “when someone marks your photo on Facebook, the photo will be automatically backed up to the photo album of iPhone”. This meets the user’s need to connect the content of service a to service B. the ifttt can automatically help complete the above actions.

Function examples

The application is as follows:

  • If it rains tomorrow, please let me know
  • Please give me a weather forecast at 7 o’clock every day

The settings are as follows:
Overview of automatic generation of logical code
Nl2ifttt is to generate if this then that (ifttt) code through natural language. Compared with common programming languages, ifttt program has simpler structure and easier to learn its structural rules. Ifttt is triggered based on the condition of the task, similar to the minimalist programming language, that is: “if XXX performs YYY behavior, execute zzz”. Every website that can trigger or serve as a task is called a channel, the trigger condition is called trigger, and the task to be executed is called action. The above process is called recipe.

In 2016, Liu et al. Proposed a hidden attention mechanism, which can effectively learn which words in natural language are more important for the prediction of triggers and which words are more important for the prediction of actions. In the same year, beltagy et al. Regarded ifttt program generation as a semantic analysis problem.

Sample dataset

Microsoft has an open source ifttt sample set. The sample data is similar to the following, which can help you better understand the problem definition of nl2ifttt
Overview of automatic generation of logical code

NL2Code-TranX

Compared with generating tfttt code and SQL according to the function description of natural language, it is much more difficult to directly generate logic code of programming language such as python, Java, JavaScript according to the requirement description of natural language. Up to now (August 2020), no relevant online services have been found. At present, we have found that Carnegie Mellon University has a related research output TranX, which can generate expression level code with a single function description. It is similar to generating corresponding code according to single line code annotation. See its website demo below.

Examples of TranX features

Overview of automatic generation of logical code
Overview of automatic generation of logical code
Overview of automatic generation of logical code
TranX dataset example


{
    "intent": "Sending http headers with python",
    "rewritten_intent": "sending http headers to `client`",
    "snippet": "client.send('HTTP/1.0 200 OK\r\n')",
    "question_id": 8315209
  },
  {
    "intent": "Python -Remove Time from Datetime String",
    "rewritten_intent": "Format a datetime string `when` to extract date only",
    "snippet": "then = datetime.datetime.strptime(when, '%Y-%m-%d').date()",
    "question_id": 26153795
  },
  {
    "intent": "How do I split a multi-line string into multiple lines?",
    "rewritten_intent": "split a multi-line string `inputString` into separate strings",
    "snippet": "inputString.split('\n')",
    "question_id": 172439
  },
  {
    "intent": "How do I split a multi-line string into multiple lines?",
    "rewritten_intent": "Split a multi-line string ` a \n b \r\n c ` by new line character `\n`",
    "snippet": "' a \n b \r\n c '.split('\n')",
    "question_id": 172439
  },

TranX model example

Programming language has strict syntax, which can not tolerate spelling errors and grammatical errors. This model constructs a new model based on tree for ast, which can fully express all syntax of programming language. Meanwhile, the corresponding corpus annotation is very expensive and time-consuming, and the limited availability of tag samples is the bottleneck of the supervision model Structvae, a semi supervised automatic coding model, can learn from both limited samples and unlabeled NL languages. The model of the paper is as follows:
Overview of automatic generation of logical code

NL2Code-debuild

The application of function description generation code also includes the gpt3.0-based debuild platform. At present, the demo website is closed. From the perspective of publicity, we can see the new possibility of generating code from function description. It can generate the corresponding layout code according to the layout language description; it can also generate the corresponding code according to the simple high-level function description.

Example of rebuild function
Overview of automatic generation of logical code
Example of debuild model

The model used in this platform is based on openai’s gpt3.0. The author says that the natural language generation code based on gpt3.0 can bring the previously unimaginable code generation in 50 years to five years. The current version is still experimental, and more simple function description is used to generate high-level (packaged component library, module library, etc.) function code. Those who are interested can try GPT 3.0.

summary

Generally speaking, it is a trend to generate program code automatically based on deep learning, but the program generation and code completion based on deep learning technology is still in its infancy. Using deep learning to generate program code and code completion has been greatly improved compared with the traditional method, but the program generation technology has not been used in industrialization. It faces the following challenges:

1) The quality of training corpus varies.In common work, the corpus used to train deep learning model can be roughly divided into two categories: one is based on DSL artificially constructed programs; the other is projects crawled from open source communities such as GitHub. Programs based on DSL tend to have simple syntax, short program length, and are easy to train and test, but at the same time, the model designed for DSL is difficult to be extended to other languages; while the projects crawled from the open source community are closer to the actual software development, However, it is difficult to ensure the quality of the code – low quality, non-standard code will bring extra noise to the neural network, and using code with different programming specifications will cause confusion in the training and prediction of neural network model. It is also a challenge to obtain a unified and standardized high-quality program corpus.

2) The generalization ability of user-defined logic generation is weak.Many program codes, especially logic codes, are business closed domain logic codes, which require a lot of business closed domain logical materials. It is necessary to have a high-level understanding of the various services that the existing business depends on. New businesses need to train their own models in combination with their own material warehouse. However, often new business dependent services also need to start from 0 Starting development, it is difficult to provide new materials unless all dependent services are inseparable atomic services. Therefore, the universality of logic model is difficult, but methodology can learn from each other and has universality.

3) The information of function description and program code is asymmetric.If it is really the best, the function is described by PRD It can directly provide the corresponding program code; the requirement description is higher-level and abstract than the general function description, which includes requirement description, function description, code description, and program code. There will be a lot of information loss in each of these links. At present, these losses are made up by programmers through business experience and programming experience. There are three solutions to generate code directly from function description. One is end-to-end description of requirements and generation of AST. If the description of software function by PD is accurate enough, it is equivalent to creating a more advanced description language; The second is that the link part (function description code description program code) is encapsulated and abstracted into function block description with large granularity. At present, most of them are based on this kind of high feasibility (such as nl2tfttt) The third is to really go deep into each business domain. Each of the above links allows the model to understand layer by layer, and finally generate code gradually. The quality of corpus and the final model accuracy effect are relatively big challenges.

In order to reduce the burden of programmers, improve the degree of automation of software development, and improve the efficiency and quality of software development, academia and industry have been trying to study automatic program generation technology. With the rapid development of deep learning technology, it is believed that in the future, more and more repetitive program development will be replaced by machines, and programmers will pay more attention to the underlying architecture and upper business value empowerment.

As for the intelligent generation of logical code, Ali’s front-end intelligent team is conducting various dimensional experiments. Welcome to exchange views.

reference material

  • Code generation platform of design draft
  • Procedure understanding: present situation and future
  • Research progress of program generation and completion based on deep learning
  • Overview of automation concepts for codeless tools
  • code2vec
  • code2seq
  • Generating function level code according to function description
  • https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf
  • https://openreview.net/pdf?id=H1gKYo09tX
  • https://dl.acm.org/doi/10.1145/2950290.2950334
  • Generate code demo build according to natural language
  • Rebuild Chinese PR version
  • https://debuild.co/