I Fundamentals of machine learning and Feature Engineering

Time:2022-1-3

1. Fundamentals of machine learning

1.1 fundamentals of Mathematics

Mathematical knowledge required:
Advanced mathematics, linear algebra, probability and statistics.

Of course, you don’t have to go deep at the beginning, but you can accumulate gradually in the learning process.

1.2 programming language

The popular field in artificial intelligence is naturally python, which has a low threshold and can be used as the preferred language for machine learning.
If you have energy, it’s not a bad thing to learn more C / C + +.

1.3 development process of machine learning

I Fundamentals of machine learning and Feature Engineering
1) Obtain original data from the company’s original database or crawler
The data we get can be roughly divided into two types: discrete data and continuous data.
2) The pandas library can be used in Python to process the original data.
Pandas library reference documentationhttps://pandas.pydata.org/pandas-docs/stable/
3) Feature processing: a very important step, which directly affects the prediction results of the model
4) Then select the appropriate algorithm to build the pattern
5) To evaluate whether the model meets the expectation and how it is not ideal, it is necessary to re select the algorithm or re process the characteristics of the data
6) After the evaluation is OK, the model is directly applied

2. Scikit learn Library

Scikit learn is a machine learning tool based on Python language:
Simple and efficient data mining and data analysis tools
It can be reused in various environments
Built on numpy, SciPy and Matplotlib
Open source, commercially available – BSD license

This library will gradually unveil the veil of machine learning

2.1 characteristic Engineering

2.1.1 feature extraction

The scikit learn library provides an API for feature extraction: sklearn feature_ extraction

  • Dictionary feature extraction sklearn feature_ extraction. DictVectorizer
from sklearn.feature_extraction import DictVectorizer

def dict_ex():
    """
    Dictionary data extraction
    :return:
    """
    list_ Demo = [{'city': 'United States',' new_add ': 24500},
                 {'city': 'Russia', 'new_add': 4000},
                 {'city': 'UK', 'new_add': 5600}]

    dict_ins = DictVectorizer(sparse=False) 

    data = dict_ins.fit_transform(list_demo)

    print(dict_ins.get_feature_names()) 
    print(data.astype('int')) 

if __name__ == '__main__':
    dict_ex()

result

['city = Russia', 'city = United States',' city = United Kingdom ',' new_add ']
[[    0     1     0 24500]
 [    1     0     0  4000]
 [    0     0     1  5600]]
  • Text feature extraction sklearn feature_ extraction. text. CountVectorizer
    Limit English, you can use word segmentation tools
from sklearn.feature_extraction.text import CountVectorizer

def text_ex():
    """
    Text feature extraction
    :return:
    """
    text_demo = ["""
        Beautiful is better than ugly.
        Explicit is better than implicit.
        Simple is better than complex.
        Complex is better than complicated.
        Flat is better than nested.
        Sparse is better than dense.
        Readability counts.
        Special cases aren't special enough to break the rules.
        Although practicality beats purity.
        Errors should never pass silently.
        Unless explicitly silenced.
        In the face of ambiguity, refuse the temptation to guess.
        There should be one-- and preferably only one --obvious way to do it.
        Although that way may not be obvious at first unless you're Dutch.
        Now is better than never.
        Although never is often better than *right* now.
        If the implementation is hard to explain, it's a bad idea.
        If the implementation is easy to explain, it may be a good idea.
        Namespaces are one honking great idea -- let's do more of those!
    """]

    cv = CountVectorizer()
    data = cv.fit_transform(text_demo)
    print(cv.get_feature_names())
    Print (data) # defaults to spark format
    Print (data. Toarray()) # convert to array
  
if __name__ == '__main__':
    text_ex()

result

['although', 'ambiguity', 'and', 'are', 'aren', 'at', 'bad', 'be', 'beats', 'beautiful', 'better', 'break', 'cases', 'complex', 'complicated', 'counts', 'dense', 'do', 'dutch', 'easy', 'enough', 'errors', 'explain', 'explicit', 'explicitly', 'face', 'first', 'flat', 'good', 'great', 'guess', 'hard', 'honking', 'idea', 'if', 'implementation', 'implicit', 'in', 'is', 'it', 'let', 'may', 'more', 'namespaces', 'nested', 'never', 'not', 'now', 'obvious', 'of', 'often', 'one', 'only', 'pass', 'practicality', 'preferably', 'purity', 're', 'readability', 'refuse', 'right', 'rules', 'should', 'silenced', 'silently', 'simple', 'sparse', 'special', 'temptation', 'than', 'that', 'the', 'there', 'those', 'to', 'ugly', 'unless', 'way', 'you']
  (0, 9)    1
  (0, 38)    10
  (0, 10)    8
  :    :
  (0, 40)    1
  (0, 42)    1
  (0, 73)    1
[[ 3  1  1  1  1  1  1  3  1  1  8  1  1  2  1  1  1  2  1  1  1  1  2  1
   1  1  1  1  1  1  1  1  1  3  2  2  1  1 10  3  1  2  1  1  1  3  1  2
   2  2  1  3  1  1  1  1  1  1  1  1  1  1  2  1  1  1  1  2  1  8  1  5
   1  1  5  1  2  2  1]]

Feature extraction example of Chinese characters

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def zh_exc():
    """
    Chinese eigenvalue
    :return:
    """
    text_demo1 = """
     I haven't seen my father for more than two years. What I can't forget is his back. That winter, my grandmother died and my father gave up his job,
    It was a day when misfortunes never come singly. I went from Beijing to Xuzhou and planned to go home with my father. When I went to Xuzhou, I saw my father and the mess in the yard,
    When I think of my grandmother again, I can't help crying. The father said, "it's already so. Don't be sad. Fortunately, there's no way for people!
    """
    text_demo2 = """
    Along the lotus pond, there is a winding small coal road. This is a secluded road; Few people walk during the day, and the night is more lonely. There are many trees around the lotus pond,
    Lush. On one side of the road are some willows and some trees whose names are unknown. On a moonless night, the road is gloomy and scary.
    Tonight is very good, although the moonlight is still faint.
    """

    c1 = jieba.cut(text_demo1)
    c2 = jieba.cut(text_demo2)

    #Convert to string
    text1 = ' '.join(list(c1))
    text2 = ' '.join(list(c2))

    cv = CountVectorizer()
    data = cv.fit_transform([text1, text2])
    print(cv.get_feature_names())
    print(data.toarray())
 
 if __name__ == '__main__':
    zh_exc()

result

['some', 'one side', 'one path', 'don't have to', 'can't help', 'can't', 'things',' things have been like this', 'two years',' hand over ',' tonight ',' winter ',' Beijing ',' name ',' four sides', 'home', 'Night', 'heaven has no unique way', 'running for mourning', 'lonely', 'few people', 'job', 'secluded', 'Xuzhou', 'forget', 'fear of others',' think of ',' plan ',' Day ',' Night ',' twists and turns', 'more', 'moonlight', 'some ',' willows', 'exactly', 'no', 'along', 'down', 'light', 'full yard', 'coal dust', 'father', 'mess',' daytime ',' meet ',' see ',' tears', 'know', 'grandmother', 'misfortune never comes singly', 'rustle', 'back', 'Lotus Pond', 'lush', 'although', 'many', 'follow', 'road', 'still', 'this is',' that year ',' grow ',' gloomy ',' sad ']
[[0 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 2 1 0 1 1 1 0 0 0 0 0 0 1
  0 0 1 0 1 0 5 1 0 1 1 1 0 2 1 1 1 0 0 0 0 1 0 0 0 1 0 0 1]
 [1 1 2 0 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 0 0 0 1 1 1 2 1 1 0
  1 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 2 1 1 1 0 1 1 1 0 1 1 0]]
  • The core content of TF-IDF is to evaluate the importance of a word
    Corresponding API sklearn feature_ extraction. text. TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba


def tfidf_ext():
    text_demo1 = """
         I haven't seen my father for more than two years. What I can't forget is his back. That winter, my grandmother died and my father gave up his job,
        It was a day when misfortunes never come singly. I went from Beijing to Xuzhou and planned to go home with my father. When I went to Xuzhou, I saw my father and the mess in the yard,
        When I think of my grandmother again, I can't help crying. The father said, "it's already so. Don't be sad. Fortunately, there's no way for people!
        """
    text_demo2 = """
        Along the lotus pond, there is a winding small coal road. This is a secluded road; Few people walk during the day, and the night is more lonely. There are many trees around the lotus pond,
        Lush. On one side of the road are some willows and some trees whose names are unknown. On a moonless night, the road is gloomy and scary.
        Tonight is very good, although the moonlight is still faint.
        """

    c1 = jieba.cut(text_demo1)
    c2 = jieba.cut(text_demo2)

    #Convert to string
    text1 = ' '.join(list(c1))
    text2 = ' '.join(list(c2))

    tf = TfidfVectorizer()
    data = tf.fit_transform([text1, text2])
    print(tf.get_feature_names())
    print(data.toarray())
   
 if __name__ == '__main__':
    tfidf_ext()

result

['some', 'one side', 'one path', 'don't have to', 'can't help', 'can't', 'things',' things have been like this', 'two years',' hand over ',' tonight ',' winter ',' Beijing ',' name ',' four sides', 'home', 'Night', 'heaven has no unique way', 'running for mourning', 'lonely', 'few people', 'job', 'secluded', 'Xuzhou', 'forget', 'fear of others',' think of ',' plan ',' Day ',' Night ',' twists and turns', 'more', 'moonlight', 'some ',' willows', 'exactly', 'no', 'along', 'down', 'light', 'full yard', 'coal dust', 'father', 'mess',' daytime ',' meet ',' see ',' tears', 'know', 'grandmother', 'misfortune never comes singly', 'rustle', 'back', 'Lotus Pond', 'lush', 'although', 'many', 'follow', 'road', 'still', 'this is',' that year ',' grow ',' gloomy ',' sad ']
[[0.         0.         0.         0.12598816 0.12598816 0.12598816
  0.12598816 0.12598816 0.12598816 0.12598816 0.         0.12598816
  0.12598816 0.         0.         0.12598816 0.         0.12598816
  0.12598816 0.         0.         0.12598816 0.         0.25197632
  0.12598816 0.         0.12598816 0.12598816 0.12598816 0.
  0.         0.         0.         0.         0.         0.12598816
  0.         0.         0.12598816 0.         0.12598816 0.
  0.62994079 0.12598816 0.         0.12598816 0.12598816 0.12598816
  0.         0.25197632 0.12598816 0.12598816 0.12598816 0.
  0.         0.         0.         0.12598816 0.         0.
  0.         0.12598816 0.         0.         0.12598816]
 [0.15617376 0.15617376 0.31234752 0.         0.         0.
  0.         0.         0.         0.         0.15617376 0.
  0.         0.15617376 0.15617376 0.         0.15617376 0.
  0.         0.15617376 0.15617376 0.         0.15617376 0.
  0.         0.15617376 0.         0.         0.         0.15617376
  0.15617376 0.15617376 0.31234752 0.15617376 0.15617376 0.
  0.15617376 0.15617376 0.         0.15617376 0.         0.15617376
  0.         0.         0.15617376 0.         0.         0.
  0.15617376 0.         0.         0.         0.         0.31234752
  0.15617376 0.15617376 0.15617376 0.         0.15617376 0.15617376
  0.15617376 0.         0.15617376 0.15617376 0.        ]]

2.1.2 feature processing

What is feature processing? It is to convert the data into the data required by our algorithm through specific mathematical methods.

  • normalization
    The normalization formula of min max is:
    I Fundamentals of machine learning and Feature Engineering
    The formula of mean normalization is:
    I Fundamentals of machine learning and Feature Engineering
    Where mean (x), min (x) and max (x) are the average, minimum and maximum values of the sample data respectively.
    Find x again“

I Fundamentals of machine learning and Feature Engineering
Where MX and MI are the specified interval [MI, MX], generally MX is 1 and MI is 0.

Purpose of normalization: map the data to the specified interval through the transformation of the original data, generally [0,1], that is to omit the calculation of X `, which can be omitted.

The sklearn library provides a normalization API: sklearn preprocessing. MinMaxScaler
Code example:

from sklearn.preprocessing import MinMaxScaler

def mm_ex():
    """
    Normalization example
    :return:
    """
    test_dict = [[99,1,18,1002],
                [88,4,18,1400],
                [89, 4,25,1201],
                [97,2,19,2800]]
    mm = MinMaxScaler()
    data = mm.fit_transform(test_dict)
    print(data)
    
if __name__ == '__main__':
    mm_ex()

result

[[1.         0.         0.         0.        ]
 [0.         1.         0.         0.22135706]
 [0.09090909 1.         1.         0.11067853]
 [0.81818182 0.33333333 0.14285714 1.        ]]

The result of normalization changes due to the maximum and minimum values, so it is easy to be affected by specific point data, such as

    test_dict = [[99,1,18,200000],
                [88,4,18,1400],
                [89, 4,25,1201],
                [97,2,19,2800]]

The value of 200000 has a great impact on the results.

  • Standardization
    formula
    I Fundamentals of machine learning and Feature Engineering
    U is the average value, σ Is the standard deviation

Sklearn provides a standardized API: scikit learn preprocessing. StandardScaler
Code example

from sklearn.preprocessing import StandardScaler

def standard_ex():
    """
    Standardized API example
    :return:
    """
    test_dict = [[99, 1, 18, 1002],
                 [88, 4, 18, 1400],
                 [89, 4, 25, 1201],
                 [97, 2, 19, 2800]]
    
    ss = StandardScaler()
    data = ss.fit_transform(test_dict)
    print(data)


if __name__ == '__main__':
    standard_ex()

result

[[ 1.1941005  -1.34715063 -0.68599434 -0.84743801]
 [-1.09026568  0.96225045 -0.68599434 -0.28413057]
 [-0.88259602  0.96225045  1.71498585 -0.56578429]
 [ 0.7787612  -0.57735027 -0.34299717  1.69735287]]

Recommended Today

Vue2 technology finishing 3 – Advanced chapter – update completed

3. Advanced chapter preface Links to basic chapters:https://www.cnblogs.com/xiegongzi/p/15782921.html Link to component development:https://www.cnblogs.com/xiegongzi/p/15823605.html 3.1. Custom events of components 3.1.1. Binding custom events There are two implementation methods here: one is to use v-on with vuecomponent$ Emit implementation [PS: this method is a little similar to passing from child to parent]; The other is to use ref […]