Python some unknown very popular library, but the function is really powerful!


The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time

The following article is from Tencent cloud by Yu Liang

In this article, we will introduce some Python libraries for data science, which are not as well-known as panda, scikit learn, and Matplotlib, but they are also very practical. Welcome to the comments section~


Extracting data, especially from the Internet, is one of the main tasks of data scientists. WGet is a free utility for non interactive file downloads from the web. It supports HTTP, HTTPS and FTP protocols, as well as retrieval through HTTP proxy. Because it is non interactive, it can run in the background even if the user is not logged in. So if you need to download all the pictures from a website or page, WGet can help you


$ pip install wget


import wget
url =

filename =
100% [................................................] 3841532 / 3841532




If you’re still bothered with the time and date processing in Python, you need to use the pendulum. It is a python package to simplify datetime operations. It is a temporary replacement for Python native classes.


$ pip install pendulum



import pendulum

dt_toronto = pendulum.datetime(2012, 1, 1, tz= America/Toronto )
dt_vancouver = pendulum.datetime(2012, 1, 1, tz= America/Vancouver )





Most classification algorithms are the most effective when the number of samples of each class is almost the same, but in practice, most of them are unbalanced data sets, which may affect the learning stage and subsequent prediction of machine learning algorithm. Fortunately, creating imbalance – learn library can solve this problem. It is compatible with scikit learning and is a part of scikit learning contrib project. Next time you encounter an unbalanced data set, don’t forget it.


pip install -U imbalanced-learn

# or

conda install -c conda-forge imbalanced-learn



In natural language processing (NLP) tasks, it is usually necessary to replace keywords or extract keywords from sentences to clean up text data. Usually, such operations can be done with regular expressions, but if the search vocabulary reaches thousands, then these operations will become very cumbersome.

Python’s flashtext module is based on the flashtext algorithm, which provides a suitable alternative for this situation. The best thing about flashtext is that it doesn’t work with your search volume.


$ pip install flashtext



1) Key words extraction

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()

# keyword_processor.add_keyword(, )

keyword_processor.add_keyword( Big Apple ,  New York )
keyword_processor.add_keyword( Bay Area )
keywords_found = keyword_processor.extract_keywords( I love Big Apple and Bay Area. )

[ New York ,  Bay Area ]


2) Alternative keywords

keyword_processor.add_keyword( New Delhi ,  NCR region )

new_sentence = keyword_processor.replace_keywords( I love Big Apple and new delhi. )

 I love New York and NCR region.



This name sounds strange, but fuzzywuzzy is a very useful library for string matching. It can easily achieve string matching rate and other operations. It can also easily match records stored in different databases.


$ pip install fuzzywuzzy



from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Simple Ratio

fuzz.ratio("this is a test", "this is a test!")

# Partial Ratio
fuzz.partial_ratio("this is a test", "this is a test!")



Time series analysis is one of the most common problems in machine learning. Pyflux is an open source library in Python, which is built to deal with time series problems. The library has a series of excellent modern time series models, such as Arima, GARCH and VaR models. In short, pyflux provides a probabilistic method for time series modeling.


pip install pyflux



A very important part of data science is the exchange of results. Visualization of results can give you a huge advantage. Ipyvolume is a python library for visualizing 3D capacity and symbols (such as 3D scatter diagrams) in jupyter notebooks with a small amount of configuration.


Using pip
$ pip install ipyvolume

$ conda install -c conda-forge ipyvolume




Dash is an efficient Python framework for building web applications. It’s based on FlaskPlotty.js And Response.js above. Instead of using JavaScript, you bundle UI elements such as drop-down menus and graphics with Python analysis code. Dash is very suitable for building data visualization applications that can be rendered in web browsers.


pip install dash==0.29.0  # The core dash backend
pip install dash-html-components==0.13.2  # HTML components
pip install dash-core-components==0.36.0  # Supercharged components
pip install dash-table==3.1.3  # Interactive DataTable component (new!)



The following example shows a highly interactive graph with drop-down functionality. When the user selects a value from the drop-down menu, the application code dynamically exports the data from Google Finance to Panda dataframe.


Gym is a tool for developing and comparing reinforcement learning algorithms. It is compatible with any data science library, such as tensorflow or theano. It’s a set of test problems, also called environments, that you can use to compute reinforcement learning algorithms. These environments have a shared interface that allows users to write general algorithms.


pip install gym



The following example will run 1000 times in the cartpole-v0 environment, rendering the environment at each step.