Real time is the future? Flow computing in the heart of a small and micro enterprise

Time:2020-8-1

Abstract: This paper is shared by Mr. Tang duo, a technical team of Mozhi. It mainly describes the whole process of introducing stream computing within the technical team, including the initial decision-making, the choice during the period and the final landing. Along the way, their thinking, perception and experience sharing have been shared.

  1. Meet Flink
  2. Why must I go to Flink
  3. A small example
  4. summary

Tips:“Real time is the future” may be just a slogan in the eyes of many people, but in Mo Zhi, this is a story they created by themselves.

Hello everyone, we are Zhejiang Mozhi Information Technology Co., Ltd., a just over 3 years of entrepreneurial team, the main business is e-commerce agent operation, currently Taobao four-star service provider.

Our core team has served domestic well-known women’s wear, household appliances, mother and baby, men’s wear, children’s wear, jewelry, cosmetics and other categories of well-known brands. We have rich experience in brand operation and management, and the brands we have served are in the forefront of the industry.

Its main business focuses on Internet platform brand operation and whole network brand promotion in Pan fashion field (clothing, infant children, beauty, living home and jewelry), involving comprehensive end-to-end services such as brand positioning and promotion, e-commerce operation, commodity planning and operation, visual design, marketing promotion, customer service, warehousing and logistics.

This article will share the story of Mozhi and stream computing.

01 meet Flink

The first time I came into contact with Flink stream computing was at the cloud habitat conference held in September 18. Mr. Dasha shared Flink with the developers present and online. The venue was full of seats, and there were three or five floors of audience outside the venue. Although the teacher’s explanation time is not long, listening to is also a little knowledge, but there is a strong feeling, “real time is the future.”.

After coming back from Yunqi Town, we discussed with our team. We decided to drive to Flink, but the difficulty was unexpected. At that time, there were very few learning materials, and a “Flink basic course” was looked over and over by us. The practical operation threshold was high and the progress was not ideal.

Real time is the future? Flow computing in the heart of a small and micro enterprise
Figure 1 cloud habitat assembly flow calculation sub venue

In March, 19, I had the honor to attend the Flink user exchange meeting held in Hangzhou. When I signed up, I just watched with the mentality of learning. However, I was shocked when I arrived at the site. Not only the deep users of Flink, but also everyone from large factories with a valuation of more than 10 billion yuan were present at the meeting. Both the content of the discussion and the origin make us feel inferior.

The next day after we came back, all five of us went to work overtime in the company. Even if we didn’t explain clearly, the impact of this meeting on everyone was huge, which also prompted us to make up our minds to apply Flink even though it was very difficult.

A month later, our Flink job, written in Java, came online. Even though the function is very simple, it is a small solid step for us.

Real time is the future? Flow computing in the heart of a small and micro enterprise
Figure 2 a picture widely circulated in the community

At the beginning of 2020, the epidemic situation was rampant, the team members changed, and the objective conditions forced us to give up everything written in Java and switch to python. It was a very difficult decision and we knew it would be back to the beginning.

But our relationship with Flink is not over. Just then, we saw that the community launched the pyflink support program, so the email consultation was also lucky to be favored. In the next month, with the help of teachers from Jinzhu, Fu Dian and duanchen, we transferred the original Flink job to pyflink, and learned the characteristics of pyflink with the needs. This is the opportunity to share the learning results with you.

02 why must I go to Flink

Speaking of this, some colleagues must have asked,Why does a small and micro enterprise still need upper class computing?

We are faced with a number of grim facts

  1. The expansion of the number of people has brought about a doubling of expenses。 It took the company three years to expand the team size to 150 people. In Jiaxing, a small city, this is not easy. Moreover, the main business is e-commerce agent operation, which is more like the project outsourcing of our software industry. When it comes to outsourcing, peers will surely associate with human resource allocation. In short, if there is a project to support people, if there is no project, idle labor cost is a loss making business.
  2. Difficulty in improving human efficiencyNo matter how strict the KPI is, there will be bottlenecks. The first thing that colleagues do when they go to work every day is to report the sales performance of the day before. Just for this small daily report, it takes half an hour, and the data timeliness is “t + 1”, which is slightly lagging behind.
  3. During the promotion of through train, due to the negligence of colleagues, some commodities that no longer need to pay for promotion or can reduce the bidding price are still burning money according to the original plan,It is difficult to find these problems in time by manual monitoring

As the leader of it planning, I always hope to rely on the team’s rich experience in e-commerce operation and trading ability, so the goal is very clear, that is to build our own real-time data decision-making platform.

Decision making, let’s take it apart for a while, decision and strategy. The team has its own experience and judgment logic. We put it on the side of strategy. What we lack now is the ability to make decisions,Both accuracy and timeliness should be considered in decision-makingOf course, it’s also great if you can gradually optimize your strategy when making decisions. So we’ve outlined the architecture in Figure 3. From bottom to top are our data source, swarm, DW, Nb and radical. Data is collected, saved, calculated, presented and applied layer by layer. Flink plays an important role in real-time computing in the life cycle of data.

Do you remember the news that the merchants were collected under the e-commerce scene?

At present, no e-commerce ERP has a functional design for this aspect. If we can write a plug-in for real-time monitoring of abnormal sales based on Flink flow calculation, we can get the actual payment amount in the order to compare the previous commodity price, and then judge the result by combining with the latest inventory calculation, and timely pop up an alarm. Can such a tragedy be avoided?

Of course, there is no boundary between the brain holes that can be opened for the application points of real-time calculation in the e-commerce scenario. Moreover, if the above systems are continuously iterated and optimized, will they replace the labor cost? If it does, it must be a new beginning.

In just three years, our small and micro enterprise has not recorded a large amount of data, which is nothing more than store operation and order data. The data collection platform helps us monitor 15 stores in operation in seconds, and each store has more than 60 data monitoring points. However, only by relying on Flink’s stream computing, can we view the data results we want and make correct decisions as soon as possible. The example we share today is also in this context.

Real time is the future? Flow computing in the heart of a small and micro enterprise
Figure 3 architecture diagram and technology stack (data flow direction)

03 a small example

According to our own requirements and the characteristics of Flink, we built a real-time monitoring system based on Flink flow calculation to monitor abnormal conditions. The following is a small example of online commodity price real-time monitoring, which we completed during the period of participating in the pyflink support program. We hope that we can feel the convenience of pyflink development.

Project background

There is a beauty distribution project in the company, that is, there are thousands of dealer stores while there are flagship stores. Business colleagues hope that through technical means, the product price of dealers’ stores will not be lower than that of flagship stores, so as to avoid affecting the sales of flagship stores. Therefore, we think of the following ideas:

Real time is the future? Flow computing in the heart of a small and micro enterprise
Figure 4 problem solving ideas

Practice process

According to the above ideas, we first collected the following data samples:

{"shop_ Name ":" dealer 1 ",
             "item_ Name: "when it comes to coagulation,"
             "item_url": "https://*****",
             "item_img": "https://*****",
             "item_price": 200.00,
             "discount_ Info ":" ['reduce 20 for every 200, do not cap'] ",
             "item_size": ""},
            {"shop_ Name ":" dealer 2 ",
             "item_ Name: "essential oil 1".
             "item_url": "https://*****",
             "item_img": "https://",
             "item_price": 200.00,
             "discount_ Info ":" ['subtract 15 for every 200, do not cap '] ",
             "item_size": "125ml"}

Then, according to the data sample, you can write a method to register Kafka source.

# register kafka_source
def register_rides_source_from_kafka(st_env):
    st_env \
        .connect(  # declare the external system to connect to
        Kafka()
            .version("universal")
            .topic("cbj4")
            # .topic("user")
            .start_from_earliest()
            .property("zookeeper.connect", "localhost:2181")
            .property("bootstrap.servers", "localhost:9092")
    ) \
        .with_format(  # declare a format for this system
        Json()
            .fail_on_missing_field(True)
            .schema(DataTypes.ROW([
            DataTypes.FIELD('shop_name', DataTypes.STRING()),
            DataTypes.FIELD('item_name', DataTypes.STRING()),
            DataTypes.FIELD('item_url', DataTypes.STRING()),
            DataTypes.FIELD('item_img', DataTypes.STRING()),
            DataTypes.FIELD('item_price', DataTypes.STRING()),
            DataTypes.FIELD('discount_info', DataTypes.STRING()),
            DataTypes.FIELD('item_size', DataTypes.STRING()),
        ]))) \
        .with_schema(  # declare the schema of the table
        Schema()
            .field("shop_name", DataTypes.STRING())
            .field("item_name", DataTypes.STRING())
            .field("item_url", DataTypes.STRING())
            .field("item_img", DataTypes.STRING())
            .field("item_price", DataTypes.STRING())
            .field('discount_info', DataTypes.STRING())
            .field("item_size", DataTypes.STRING())
    ) \
        .in_append_mode() \
        .register_table_source("KafkaSource")

The sample data of the CSV file referenced by commodity price is as follows:

1, essence oil 1125ml, * * * * * * * * * *, sensitive muscle available body essence oil, * * *, 200180
2, the essence oil 1200ml, * * * * * * * * * * *, brighten the whitening, dilute the scar, * * *, 300280
3. Massage oil 1125ml, * *, ≡ 200180;, which can effectively increase skin elasticity
4. Massage oil 1200ml, * *, ≡, ≡ and  300280, continuously soften and nourish the skin
5. Massage oil 2125ml, * *, ≡ 300280, moisturizing and firming, moisturizing and firming, deeply moisturizing skin
6. Shower gel, 500ml, * *, ≡, 100,80, soothing, calming, moistening and preventing dry skin
7, when the essence, 4x6ml, * *, * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
8, the essence oil 2,30ml, * * * * * * * * * *, improve fragile sensitive dry skin. Small molecule essence penetrated muscle bottom, dense replenishment, * * *, 200180
9, cleansing gel, 200ml, * * * * * * * *, pox cleanser preferred, * * * *, 100,80

So we can write a method to register the CSV source.

# register csv_source
def register_rides_source_from_csv(st_env):
    #Data source file
    source_ file = '/demo_ Job1 / price control table.csv '

    #Create data source table
    st_env.connect(FileSystem().path(source_file)) \
        .with_format(OldCsv()
                     .field_delimiter(',')
                     .field('xh',  DataTypes.STRING ()) serial number
                     .field('spmc',  DataTypes.STRING ()) ා trade name
                     .field('rl',  DataTypes.STRING ()) capacity
                     .field('xg',  DataTypes.STRING ()) box gauge
                     .field('txm',  DataTypes.STRING ()) bar code
                     .field('gx',  DataTypes.STRING ()) efficacy
                     .field('myfs',  DataTypes.STRING ()) mode of trade
                     .field('ztxsjg',  DataTypes.STRING ()) ා the main chart shows the price
                     .field('dpzddsj',  DataTypes.STRING ()) ා the lowest price for a single bottle
                     ) \
        .with_schema(Schema()
                     .field('xh',  DataTypes.STRING ()) serial number
                     .field('spmc',  DataTypes.STRING ()) ා trade name
                     .field('rl',  DataTypes.STRING ()) capacity
                     .field('xg',  DataTypes.STRING ()) box gauge
                     .field('txm',  DataTypes.STRING ()) bar code
                     .field('gx',  DataTypes.STRING ()) efficacy
                     .field('myfs',  DataTypes.STRING ()) mode of trade
                     .field('ztxsjg',  DataTypes.STRING ()) ා the main chart shows the price
                     .field('dpzddsj',  DataTypes.STRING ()) ා the lowest price for a single bottle
                     ) \
        .register_table_source('CsvSource')

And the style we want to export to CSV:

Dealer 1, massage oil 1, https://**********, https://**********, 200, [], 125ml, 200.0,3, massage oil 1125ml, * * * * * * * * *, sensitive muscle available body essence oil, * * * *, 200180
Dealer 2, essential oil 2, https://**********, https://**********, 190, [], 30ml, 190.0,8, essence oil 2,30ml, * * * * * * * * *, improve fragile sensitive dry skin. Small molecule essence penetrated muscle bottom, dense replenishment, * * *, 200180
Dealer 3, essential oil 2, https://**********, https://**********, 200, [], 30ml, 200.0,8, essence oil 2,30ml, * * * * * * * * *, improve fragile sensitive dry skin. Small molecule essence penetrated muscle bottom, dense replenishment, * * *, 200180
Dealer 1, essential oil 2, https://**********, https://**********, 200, ['every 200 minus 20, no top'], 30ml, 180,8, essence oil 2,30ml, * * * * * * * * *, improve fragile sensitive dry skin. Small molecule essence penetrated muscle bottom, dense replenishment, * * *, 200180
Dealer 1, massage oil 1, HTTPS: // * *, HTTPS: // * *, *, 200.00, ['reduce 20 per 200 full; do not cap '], 125ml, 180,3, massage oil 1125ml, * *, * *, * * *, 200180
Dealer 3, massage oil 1, HTTPS: // * *, HTTPS: // * *, *, 200.00, ['reduce 20 per 200 full; do not cap '], 125ml, 180.0,3, massage oil 1125ml, * *, * * *, 200180
Dealer 2, essential oil 1, https://**********, https://**********, 200, [every 200 minus 20, no top], 125ml, 180.0,1, essential oil 1125ml, * * * * * * * * *, sensitive muscle available essential oil, * * * *, 200180
Dealer 3, essential oil 1, https://**********, https://**********, 200, [every 200 minus 20, no top], 125ml, 180.0,1, essential oil 1125ml, * * * * * * * * *, sensitive muscle available essential oil, * * * *, 200180
Dealer 1, essential oil 1, https://**********, https://**********, 300, [every 200 minus 20, no top], 200ml, 280.0,2, essence oil 1200ml, * * * * * * * * * *, brighten whitening, dilute scars, * * * *, 300280
Dealer 1, essential oil 1, https://**********, https://**********, 190, [every 200 minus 20, no top], 125ml, 190.0,1, essential oil 1125ml, * * * * * * * * *, sensitive muscle available essential oil, * * * *, 200180

According to the output style, we will write a method to register the CSV sink.

# register csv sink
def register_sink(st_env):
    result_file = "./result.csv"
    sink_field = {
        "shop_name": DataTypes.STRING(),
        "item_name": DataTypes.STRING(),
        "item_url": DataTypes.STRING(),
        "item_img": DataTypes.STRING(),
        "item_price": DataTypes.STRING(),
        "discount_info": DataTypes.STRING(),
        # "discount_info": DataTypes.ARRAY(DataTypes.STRING()),
        "item_size": DataTypes.STRING(),
        "min_price": DataTypes.FLOAT(),
        "xh": DataTypes.STRING(),
        "spmc": DataTypes.STRING(),
        "rl": DataTypes.STRING(),
        "xg": DataTypes.STRING(),
        "txm": DataTypes.STRING(),
        "gx": DataTypes.STRING(),
        "myfs": DataTypes.STRING(),
        "ztxsjg": DataTypes.STRING(),
        "dpzddsj": DataTypes.STRING(),
    }

    st_env.register_table_sink("result_tab",
                               CsvTableSink(list(sink_field.keys()),
                                            list(sink_field.values()),
                                            result_file))

We have both input and output. We write it according to the calculation and judgment logic required by our business colleagues, that is, the commodity price of the retailer’s store is not lower than that of the flagship store.

According to the logic of writing business, the operators in the table API can not meet all the requirements. Therefore, we need to customize UDF to process several of the fields, which are “match the actual commodity according to the commodity name”, “identify the commodity capacity according to the commodity name and commodity page price”, “calculate the preferential price on demand”, “format coupon information”, etc.

# -*- coding: utf-8 -*-

import re
import logging

from pyflink.table import DataTypes
from pyflink.table.udf import udf, ScalarFunction


# Extend ScalarFunction
class IdentifyGoods(ScalarFunction):
    "Identify the product name, corresponding to the standard product name"

    def eval(self, item_name):
        logging.info (enter UDF)
        logging.info(item_name)

        #Standard commodity name
        regexes =  re.compile (r'[essence oil cleansing gel shower gel when the essence of 12]')
        Items = "essential oil 1", "Massage Oil 1", "Massage Oil 1", "essence oil 2", "cleansing gel", "shower gel", "essence of time".
        items_set = []
        for index, value in enumerate(items):
            items_ set.append (set (list (value))) ා a title is converted to set to facilitate intersection

        #First match characters other than the product name, and then remove them
        sub_str = re.sub(regexes, '', item_name)
        spbt =  re.sub (repr([sub_ str]), '', item_ Name) # repr mandatory variable non escape

        #Find the most matching product title, otherwise it will be considered as unknown product
        intersection_len = 0
        items_index = None

        for index, value in enumerate(items_set):
            J = value & set (list (spbt)) ා intersection
            j_len = len(list(j))
            if j_len > intersection_len:
                intersection_len = j_len
                items_index = index

        item_ Name: 'unknown item' if items_ index is None else items[items_ index]

        logging.info(item_name)
        return item_name


identify_goods = udf(IdentifyGoods(), DataTypes.STRING(), DataTypes.STRING())


class IdentifyCapacity(ScalarFunction):
    "Identify product capacity based on product name and product page price"

    def eval(self, item_name, price):

        #Initializing commodity prices and specifications
        price = 0 if len(price) == 0 else float(price)
        item_size = ''

        #Here, the judgment logic needs to be modified and reconstructed, and there are external bugs
        if float(price) <= float(5):
            logging.info ('This is a coupon!!! )
        elif item_ Name = = "essential oil 1" and price > 200 and price < 300:
            item_size = '200ml'
        elif item_ Name = = "essential oil 1" and price < = 200:
            item_size = '125ml'
        elif item_ Name = = "essential oil 1" and price > = 300:
            item_ Size = 'essential oil 1 combination'.
        elif item_ Name = = Massage Oil 1 "and price > 200 and price < = 300:
            item_size = '200ml'
        elif item_ Name = = Massage Oil 1 "and price < = 200:
            item_size = '125ml'
        elif item_ Name = = Massage Oil 1 "and price > = 300:00
            item_ Size: 'Massage Oil 1 pack'
        elif item_ Name = "Massage Oil 2":
            item_size = '125ml'
        elif item_ Name = = "essential oil 2":
            item_size = '30ml'
        elif item_ Name = = "cleansing gel":
            item_size = '200ml'
        elif item_ Name = "shower gel":
            item_size = '500ml'
        elif item_ Name = = "essence when coagulated":
            item_size = '4x6ml'
        return item_size


identify_capacity = udf(IdentifyCapacity(), [DataTypes.STRING(), DataTypes.STRING()], DataTypes.STRING())


# Named Function
@udf(input_types=[DataTypes.STRING(),DataTypes.STRING()],result_type=DataTypes.FLOAT())
def get_min_price(price, discount_info):
    "Discount on demand"
    price = 0 if len(price) == 0 else float(price)

    #Match all coupons
    coupons = []
    for i in eval(discount_info):
        regular_ v1 =  re.findall (R "full / D + minus / D +", I)
        if len(regular_v1) != 0:
            coupons.append(regular_v1[0])

        regular_ v2 =  re.findall (R "every full + minus / D +", I)
        if len(regular_v2) != 0:
            coupons.append(regular_v2[0])

    #If there is coupon information, the lowest price will be calculated
    min_price = price
    mayby_price = []
    if len(coupons) >= 0:
        regexes_v2 = re.compile(r'\d+')
        for i in coupons:
            a = re.findall(regexes_v2, i)
            cut_price = min(float(a[0]), float(a[1]))
            flag_price = max(float(a[0]), float(a[1]))
            if flag_price <= price:
                mayby_price.append(min_price - cut_price)

    if len(mayby_price) > 0:
        min_price = min(mayby_price)

    return min_price


# Callable Function
class FormatDiscountInfo(object):
    "Format coupon information for output. CSV files to use"

    def __call__(self, discount_str):
        discount_str = str(discount_str).replace(',', ";")
        return discount_str


format_discount_info = udf(f=FormatDiscountInfo(), input_types=DataTypes.STRING(), result_type=DataTypes.STRING())

Finally, the main method of calculating job is written

# query
def calculate_func(st_env):
    #The coordinates are the acquisition data table
    left = st_env.from_path("KafkaSource") \
        .select("shop_name, "
                "identify_ goods(item_ name) as item_ Name, "ා identify the product name corresponding to the basic information table
                "item_url, "
                "item_img, "
                "item_price, "
                "discount_info, "
                "item_size"
                ) \
        .select("shop_name, "
                "item_name, "
                "item_url, "
                "item_img, "
                "item_price, "
                "discount_info, "
                "identify_ capacity(item_ name, item_ price) as item_ Identify commodity capacity according to commodity name and price
                "get_ min_ price(item_ price, discount_ info) as min_ Price "ා calculate the lowest price according to the page price and preferential information
                ) \
        .select("shop_name, "
                "item_name, "
                "item_url, "
                "item_img, "
                "item_price, "
                "format_ discount_ info(discount_ info) as discount_ Info "ා" format discount information, easy to store in. CSV file
                "item_size, "
                "min_price "
                )

    #The right table is the basic information table
    right = st_env.from_path("CsvSource") \
        .select("xh, spmc, rl, xg, txm, gx, myfs, ztxsjg, dpzddsj")

    result =  left.join (right).where("item_ name = spmc && item_ Size = RL ") ා join two tables by commodity name and commodity capacity

    result.insert_ into("result_ Tab ") ා output the join result to CSV sink

# main function
def get_price_demo():
    # init env
    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_ Parallelism (1) ා sets the parallelism
    env.set_stream_time_characteristic(TimeCharacteristic.EventTime)
st_env = StreamTableEnvironment.create(
env,environment_settings=EnvironmentSettings.new_instance().use_blink_planner().build())

    # register source
    register_rides_source_from_csv(st_env)
    register_rides_source_from_kafka(st_env)

    # register sink
    register_sink(st_env)

    # register function
    st_env.register_function("identify_goods", identify_goods)
    st_env.register_function("identify_capacity", identify_capacity)
    st_env.register_function("get_min_price", get_min_price)
    st_env.register_function("format_discount_info", format_discount_info)

    # query
    calculate_func(st_env)

    # execute
    Print ("submit job")
    st_env.execute("item_got_price")

04 summary

After the project was launched, due to the continuous data provided by the data collection end, with the help of flow calculation, we have identified 200 product links suspected of illegal pricing. While maintaining the brand power and price power, we have avoided the sales loss of the flagship store of more than 400000, which is just one of our many supervision and control Flink jobs.

In line with the idea of “make the enterprise smaller and the market bigger” today, we use it It is the quickest way for technology to replace manual work. Even if small and micro enterprises like us are not technology led as big factories do, as long as the output can be used by business colleagues and improve work efficiency, and is valued by the company’s senior management, then our work is meaningful and sustainable.

If you, like us, are mainly based on Python language, and the development work is mostly for data analysis and real-time decision-making, and you are eager to enjoy the accuracy, efficiency and convenience brought by stream computing, then welcome to join pyflink ecology, and let’s make contributions to her tomorrow. At the same time, Flink 1.11 is expected to be released in mid late June, when pyflink will bring pandas to attack.

Finally, thanks again to all the people who have helped us in the support program! All in all, pyflink, you deserve it.

Real time is the future? Flow computing in the heart of a small and micro enterprise

If you are also interested in the pyflink community support program, you can fill out the questionnaire below and work with us to build pyflink ecology.

Pyflink community support program:

https://survey.aliyun.com/app…