Write a NoSQL database in Python

Time:2020-10-21

This paper is translated from what is a NoSQL database? Learn by writing one in Python

  • The complete sample code has been put on GitHub, please click here, this is only a minimalist demo, designed to understand the concept

  • If you have any comments or suggestions on the translation, please refer to issue for discussion

Follow up if there are updates, see the blog

Write a NoSQL database in Python


NoSQL has become a ubiquitous word in recent years. But what does “NoSQL” mean? How and why is it so useful? In this article, we will answer these questions by writing a NoSQL database in pure python (I prefer to call it “light structured pseudo code”)

OldSQL

In many cases, SQL has become a synonym for “database”,SQLyesStrctured Query LanguageRather than the database technology itself, it refers to theRDBMS(relational database management system,Relational Database Management System)Mysql, MS, SQL server and Oracle are all part of RDBMS

R in RDBMS is the most abundant part of RDBMSTableEach table is organized byTypeAssociatedColumnComposition. The types of all tables, columns and their classes are called databaseschemaSchema describes the structure of a database by describing the information of each tableCarThe table for may have the following columns:

  • Make: a string

  • Model: a string

  • Year: a four-digit number; alternatively, a date

  • Color: a string

  • VIN(Vehicle Identification Number): a string

In a table, each single entry is called a oneRowOr oneRecordIn order to distinguish each record, aPrimary key. in the tablePrimary keyIs one of the columns that uniquely identifies each rowCarTwo different rows may have the same value in the make, model, year and color columns, but for different vehicles, there must be different vins. On the contrary, as long as two rows have the same VIN, they will have the same Vin, We don’t think the other two trains need to go to the same row

Querying

SQL enables us to useQueryTo get useful informationqueryTo put it simply, a query is to ask a question to an RDBMS in a structured language and interpret the returned line as the answer to the questionbe-allRecord, we can do the following through the databaseSQL query :


SELECT Make, Model FROM Car;

Roughly translate SQL into Chinese:

  • “Select”: show me

  • “Make, model”: value of make and model

  • “From car”: for each row in the table car

If we only care about the color of cars registered in 1994, we can:


SELECT Color FROM Car WHERE Year = 1994;

At this point, we will get a list similar to the following:

Black
Red
Red
White
Blue
Black
White
Yellow

Finally, we can use the(primary key) primary keyHere is the VIN to specify and query a vehicle:


SELECT * FROM Car WHERE VIN = '2134AFGER245267'

The above query will return the property information of the specified vehicle

The primary key is defined as unique and non repeatable. That is, vehicles with a specified VIN can only appear once in the table at most. This is very important. Why? Let’s take an example

Relations

Suppose we are running a car repair business. In addition to other necessary things, we also need to track the service history of a car, that is, all the repair records on that car. Then we may create a service history that contains the following columnsServiceHistoryTable:

VIN | Make | Model | Year | Color | Service Performed | Mechanic | Price | Date

In this way, every time the vehicle is repaired, we add a new line to the table and write in the service what we did, which repairman it was, how much and how long it took

But wait a minute, we all know that for the same car, all the columns related to the vehicle’s own information are unchanged. In other words, if my black 2014 Lexus RX 350 is trimmed 10 times, even if the information such as make, model, year and color does not change, the information is still recorded repeatedly every time. Compared with invalid duplicate records, it is more reasonable to store this information only once and query it when necessary.

So what to do? We can create a second table:VehicleIt has the following columns:

VIN | Make | Model | Year | Color

In this way, for theServiceHistoryWe can simplify the table into the following columns:

VIN | Service Performed | Mechanic | Price | Date

You may ask, why do vins appear in both tables? Because we need to have a way to confirm thatServiceHistoryTabularthisA car is a carVehicleIn the tablethatVehicles, that is to confirm that the two records in the two tables represent the same vehicle. In this way, we only need to store the information of each vehicle once. Every time the vehicle comes for maintenance, we willServiceHistoryCreate a new row in the table without having toVehicleAdd a new record to the table. After all, they mean the same car.

We can expand through SQL query statementsVehicleAndServiceHistoryImplicit relationships contained in two tables:


SELECT Vehicle.Model, Vehicle.Year FROM Vehicle, ServiceHistory WHERE Vehicle.VIN = ServiceHistory.VIN AND ServiceHistory.Price > 75.00;

The query aims to find the model and year of all vehicles with maintenance costs greater than $75.00VehicleAndServiceHistoryThe VIN values in the table are used to filter the records that meet the conditions“ Vehicle.Model “And” Vehicle.Year “, which means we just want toVehicleThe two columns in the table

If we don’t have a databaseIndexesThe above query needs to be executedTable scanTo locate the row that matches the query requirements. Table scan checks each row in the table in order, which is usually very slow. In fact, table scan is actually the slowest of all queries.

You can avoid scanning the table by referring to the Le CaSO. We can think of index as a data structure, which can quickly find a specified value (or some values in a specified range) on the indexed column by pre sorting. That is to say, if we have an index on the price column, we do not need to scan the whole table row by row to determine whether the price is greater than or not75.00Instead, you just need to use the information contained in the index to “jump” to the first price higher than75.00And return each subsequent row (since the index is ordered, the prices of these rows are at least75.00)。

When dealing with a large number of data, index is an indispensable tool to improve the query speed. Of course, as with all things, there are gains and losses. Using an index leads to some extra consumption: the data structure of the index consumes memory, which can be used to store data in the database. This requires us to weigh the pros and cons and find a compromise, but indexing columns that are frequently queried isveryCommon practice.

The Clear Box

Thanks to the database’s ability to check a table’sschema(it describes what type of data each column contains), advanced features like index can be implemented, and a reasonable decision can be made based on the data. In other words, for a database, a table is actually the antonym of “black box” (or transparent box)?

Keep this in mind when we talk about NoSQL databases. When it comes toqueryThis is also a very important part of the capabilities of different types of database engines.

Schemas

We already know, a watchschemaDescribes the name of the column and the type of data it contains. It also includes other information, such as which columns can be empty, which columns do not allow duplicate values, and all other restrictions on columns in the table. A table can only have one schema at any time, andAll rows in the table must comply with the schema

This is a very important constraint. Suppose you have a database table with millions of consumer information. Your sales team wants to add some additional information (e.g., age of users) to improve the accuracy of their email marketing algorithms. This needs to comeAlter (change)Existing table – add a new column. We also need to decide whether each row in the table requires a value for that column. In general, it makes sense to have a value in a column, but doing so may require information that we cannot easily access (such as the age of each user in the database). Therefore, at this level, there are also some trade-offs.

In addition, making changes to a large database is usually not a trivial matter. In case of errors, it is important to have a rollback scheme. But even so, once schema changes are made, we are not always able to undo them. Schema maintenance is probably one of the most difficult parts of a DBA’s work.

Key/Value Stores

Before the word NoSQL existed, likememcachedIn this wayKey / storesThe function of data storage can be provided without table schema. In fact, in K / V storage, there is no concept of “table”. onlyKeysAndValuesIf key value storage sounds familiar, it’s probably because the building principles of this concept are similar to those of PythondictAndsetConsistent: use hash table to provide fast key based data queries. A primitive NoSQL database based on Python is simply a large dictionary

To understand how it works, write one yourself! Let’s start with some simple design ideas:

  • A pythondictAs the main data storage

  • Only string type is supported as key

  • Supports storing integer, string and list

  • A simple TCP / IP server using ascll string is used to deliver messages

  • Some likeINCREMENT, DELETE , APPENDandSTATSSuch an advanced command

One advantage of having an ASCII based TCP / IP interface for data storage is that we use simpletelnetThe program can interact with the server without the need for a special client (although this is a great exercise and only requires 15 lines of code).

We need a “wired format” for the information we send back to the server and the rest. Here is a simple explanation:

Commands Supported

  • PUT

    • Parameters: key, value

    • Objective: to insert a new entry into the database

  • GET

    • Parameter: key

    • Purpose: to retrieve a stored value from the database

  • PUTLIST

    • Parameters: key, value

    • Purpose: to insert a new list entry into the database

  • APPEND

    • Parameters: key, value

    • Purpose: to add a new element to an existing list in the database

  • INCREMENT

    • Parameter: key

    • Objective: to increase an integer value in the database

  • DELETE

    • Parameter: key

    • Purpose: to delete an entry from the database

  • STATS

    • Parameter: none (n / a)

    • Purpose: to request statistics on success / failure of each execution command

Now let’s define the structure of the message itself.

Message Structure

Request Messages

OneRequest messageIt contains a command, a key, a value and a value type. The last three depend on the message type and are optional and not required.;Is used as a separator. Even if the options are not included, there must be three in the message;Character.


COMMAND; [KEY]; [VALUE]; [VALUE TYPE]
  • COMMANDIs one of the commands in the list above

  • KEYIs a string that can be used as a database key (optional)

  • VALUEIs an integer, list or string in the database (optional)

    • A list can be represented as a string separated by commas, for example, “red, green, blue”

  • VALUE TYPEDescribedVALUEWhat type should be explained

    • Possible type values are: int, string, list

Examples
  • "PUT; foo; 1; INT"

  • "GET; foo;;"

  • "PUTLIST; bar; a,b,c ; LIST"

  • "APPEND; bar; d; STRING"

  • "GETLIST; bar; ;"

  • STATS; ;;

  • INCREMENT; foo;;

  • DELETE; foo;;

Reponse Messages

OneResponse messageIt consists of two parts;Separate them. The first part is alwaysTrue|FalseIt depends on whether the command you are executing is successful. When an error message appears, the second message will be displayed. For successful commands, if we don’t want the default return value (for examplePUT)The message of success will appear. If we return the value of the successful command (for exampleGET)Then the second part will be its own value.

Examples
  • True; Key [foo] set to [1]

  • True; 1

  • True; Key [bar] set to [['a', 'b', 'c']]

  • True; Key [bar] had value [d] appended

  • True; ['a', 'b', 'c', 'd']

  • True; {'PUTLIST': {'success': 1, 'error': 0}, 'STATS': {'success': 0, 'error': 0}, 'INCREMENT': {'success': 0, 'error': 0}, 'GET': {'success': 0, 'error': 0}, 'PUT': {'success': 0, 'error': 0}, 'GETLIST': {'success': 1, 'error': 0}, 'APPEND': {'success': 1, 'error': 0}, 'DELETE': {'success': 0, 'error': 0}}

Show Me The Code!

I’ll show you all the code in a block summary. The entire code is only 180 lines, and it won’t take long to read.

Set Up

Here are some sample code for our server:


"""NoSQL database written in Python"""

# Standard library imports
import socket

HOST = 'localhost'
PORT = 50505
SOCKET = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
STATS = {
    'PUT': {'success': 0, 'error': 0},
    'GET': {'success': 0, 'error': 0},
    'GETLIST': {'success': 0, 'error': 0},
    'PUTLIST': {'success': 0, 'error': 0},
    'INCREMENT': {'success': 0, 'error': 0},
    'APPEND': {'success': 0, 'error': 0},
    'DELETE': {'success': 0, 'error': 0},
    'STATS': {'success': 0, 'error': 0},
    }

It is easy to see that the above is just a package import and some data initialization.

Set up(Cont’d)

Next, I’ll skip some code so that I can continue to show the rest of the code in the preparation section above. Note that it involves some functions that do not yet exist, but that’s OK, we’ll cover them later. In the full version (which will be presented at the end), everything will be arranged in order. Here is the rest of the installation code:


COMMAND_HANDERS = {
    'PUT': handle_put,
    'GET': handle_get,
    'GETLIST': handle_getlist,
    'PUTLIST': handle_putlist,
    'INCREMENT': handle_increment,
    'APPEND': handle_append,
    'DELETE': handle_delete,
    'STATS': handle_stats,
}

DATA = {}


def main():
    """Main entry point for script"""
    SOCKET.bind(HOST, PORT)
    SOCKET.listen(1)
    while 1:
        connection, address = SOCKET.accept()
        print('New connection from [{}]'.format(address))
        data = connection.recv(4096).decode()
        command, key, value = parse_message(data)
        if command == 'STATS':
            response = handle_stats()
        elif command in ('GET', 'GETLIST', 'INCREMENT', 'DELETE'):
            response = COMMAND_HANDERS[command](key)
        elif command in (
                'PUT',
                'PUTLIST',
                'APPEND', ):
            response = COMMAND_HANDERS[command](key, value)
        else:
            response = (False, 'Unknown command type {}'.format(command))
        update_stats(command, response[0])
        connection.sandall('{};{}'.format(response[0], response[1]))
        connection.close()


if __name__ == '__main__':
    main()

We createdCOMMAND_HANDLERSIt is often referred to as aLook up table . COMMAND_HANDLERSThe job of is to associate the command with the function used to process it. For example, if we receive oneGETOrders,COMMAND_HANDLERS[command](key)It’s equivalent to sayinghandle_get(key)Remember, in Python, a function can be considered a value and can be stored in a value like any other valuedictMedium.

In the above code, although some commands request the same parameters, I decided to process each command separately. Although it’s easy to force allhandle_Function takes akeyAnd avalueBut I hope that these handler functions will be more organized, easier to test, and less likely to go wrong.

Note that socket related code is extremely simple. Although the whole server is based on TCP / IP communication, there is not much underlying network interaction code.

Finally, we should pay attention to the following points:DATADictionary, because this point is not very important, so you are likely to miss it.DATAIt is the key value pairs that are actually used to store. It is precisely these key value pairs that actually constitute our database.

Command Parser

Let’s take a look at someCommand parser, which interprets the received message:


def parse_message(data):
    """Return a tuple containing the command, the key, and (optionally) the
    value cast to the appropriate type."""
    command, key, value, value_type = data.strip().split(';')
    if value_type:
        if value_type == 'LIST':
            value = value.split(',')
        elif value_type == 'INT':
            value = int(value)
        else:
            value = str(value)
    else:
        value = None
    return command, key, value

Here we can see that type conversion has taken place. If we want the value to be a list, we can call it by calling stringstr.split(',')To get the value we want. aboutintWe can simply use theint()That’s fine. For strings andstr()The same is true.

Command Handlers

Here is the code for the command handler. They are very intuitive and easy to understand. Note that although there are a lot of error checking, but it is not comprehensive, very complex. In the course of your reading, if you find any errors, please move here to discuss


def update_stats(command, success):
    """Update the STATS dict with info about if executing *command* was a
    *success*"""
    if success:
        STATS[command]['success'] += 1
    else:
        STATS[command]['error'] += 1


def handle_put(key, value):
    """Return a tuple containing True and the message to send back to the
    client."""
    DATA[key] = value
    return (True, 'key [{}] set to [{}]'.format(key, value))


def handle_get(key):
    """Return a tuple containing True if the key exists and the message to send
    back to the client"""
    if key not in DATA:
        return (False, 'Error: Key [{}] not found'.format(key))
    else:
        return (True, DATA[key])


def handle_putlist(key, value):
    """Return a tuple containing True if the command succeeded and the message
    to send back to the client."""
    return handle_put(key, value)


def handle_putlist(key, value):
    """Return a tuple containing True if the command succeeded and the message
    to send back to the client"""
    return handle_put(key, value)


def handle_getlist(key):
    """Return a tuple containing True if the key contained a list and the
    message to send back to the client."""
    return_value = exists, value = handle_get(key)
    if not exists:
        return return_value
    elif not isinstance(value, list):
        return (False, 'ERROR: Key [{}] contains non-list value ([{}])'.format(
            key, value))
    else:
        return return_value


def handle_increment(key):
    """Return a tuple containing True if the key's value could be incremented
    and the message to send back to the client."""
    return_value = exists, value = handle_get(key)
    if not exists:
        return return_value
    elif not isinstance(list_value, list):
        return (False, 'ERROR: Key [{}] contains non-list value ([{}])'.format(
            key, value))
    else:
        DATA[key].append(value)
        return (True, 'Key [{}] had value [{}] appended'.format(key, value))


def handle_delete(key):
    """Return a tuple containing True if the key could be deleted and the
    message to send back to the client."""
    if key not in DATA:
        return (
            False,
            'ERROR: Key [{}] not found and could not be deleted.'.format(key))
    else:
        del DATA[key]


def handle_stats():
    """Return a tuple containing True and the contents of the STATS dict."""
    return (True, str(STATS))

There are two points to note:Multiple assignmentSome functions simply wrap existing functions to be more logical, such ashandle_getandhandle_getlist. because sometimes we only need the return value of an existing function, but at other times we need to check what the function actually returnsMultiple assignment

Take a lookhandle_append. if we try to callhandle_getBut if the key does not exist, we simply return ithandle_getThe content returned. In addition, we also hope to be able tohandle_getThe tuple returned is referenced as a separate return value. Then, when the key does not exist, we can simply use itreturn return_value .

If itIt does existThen we need to check the return value. And we also hope to be able tohandle_getThe return value of is referenced as a separate variable. In order to be able to deal with the above two cases and consider the case where the results need to be processed separately, we use multiple assignments. In this way, you don’t have to write multiple lines of code, while keeping the code clear.return_value = exists, list_value = handle_get(key)Be able to explicitly indicate that we are going to refer to it in at least two different wayshandle_getThe return value of.

How Is This a Database?

The above program is obviously not an RDBMS, but it is definitely a NoSQL database. The reason it’s so easy to create is that we don’t have anything to do withData (data)The actual interaction of. We just do a minimalist type check to store whatever the user sends. If we need to store more structured data, we may need to create a schema for the database to store and retrieve data.

Since NoSQL databases are easier to write, maintain, and implement, why don’t we just use mongodb? Of course, there is a reason. We need to weigh the searchability of database on the basis of data flexibility provided by NoSQL database

Querying Data

Let’s use the NoSQL database above to store earlier car data. Then we may use the VIN as the key and a list as the value of each column, that is to say,2134AFGER245267 = ['Lexus', 'RX350', 2013, Black]Of course, we’ve lost theMeaningWe only need to know that index 1 stores the car model and index 2 stores year

The bad thing is, what happens when we want to execute the previous query statement? Finding all the colors of 1994 will be a nightmare. We have to traverseDATAMediumEach valueFor example, check index 2 to see whether the value of index 2 is equal to 1994, and then continue to get the value of index 3. This is worse than table scan, because it not only needs to scan every row of data, but also needs to apply some complex rules to answer the query.

Of course, the authors of NoSQL databases are aware of these problems, and they’ve come up with ways to make queries less “out of reach.”. One way is to structure the data used, such as JSON, to allow references to other rows to represent relationships. At the same time, most NoSQL databases have the concept of namespace. A single type of data can be stored in the unique “section” of the database, which enables the query engine to use the “shape” information of the data to be queried.

Of course, although there are (and have been implemented) some more complex methods to enhance queryability, it is always an unavoidable problem to compromise between storing less schema and enhancing queryability. In this case, our database only supports query through key. If we need to support richer queries, things will be much more complicated.

Summary

At this point, I hope the concept of “NoSQL” is very clear. We learned a little bit about SQL and how RDBMS works. We saw how to retrieve data from an RDBMS (using SQLQuery)By building a toy level NoSQL database, we understand some problems between query and conciseness, and discuss some methods adopted by some database authors to deal with these problems.

Even with simple key value storage, knowledge about databases is vast. Although we’ve only discussed a few of them, I hope you have understood what NoSQL really means, how it works, and when it’s better to use it. If you want to share some good ideas, please feel free to discuss them