The text and pictures of this article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time
The following article is from Tencent cloud by Yu Liang
In this article, we will introduce some Python libraries for data science, which are not as well-known as panda, scikit learn, and Matplotlib, but they are also very practical. Welcome to the comments section~
Extracting data, especially from the Internet, is one of the main tasks of data scientists. WGet is a free utility for non interactive file downloads from the web. It supports HTTP, HTTPS and FTP protocols, as well as retrieval through HTTP proxy. Because it is non interactive, it can run in the background even if the user is not logged in. So if you need to download all the pictures from a website or page, WGet can help you
$ pip install wget
import wget url = http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3 filename = wget.download(url) 100% [................................................] 3841532 / 3841532 filename razorback.mp3
If you’re still bothered with the time and date processing in Python, you need to use the pendulum. It is a python package to simplify datetime operations. It is a temporary replacement for Python native classes.
$ pip install pendulum
import pendulum dt_toronto = pendulum.datetime(2012, 1, 1, tz= America/Toronto ) dt_vancouver = pendulum.datetime(2012, 1, 1, tz= America/Vancouver ) print(dt_vancouver.diff(dt_toronto).in_hours()) 3
Most classification algorithms are the most effective when the number of samples of each class is almost the same, but in practice, most of them are unbalanced data sets, which may affect the learning stage and subsequent prediction of machine learning algorithm. Fortunately, creating imbalance – learn library can solve this problem. It is compatible with scikit learning and is a part of scikit learning contrib project. Next time you encounter an unbalanced data set, don’t forget it.
pip install -U imbalanced-learn # or conda install -c conda-forge imbalanced-learn
In natural language processing (NLP) tasks, it is usually necessary to replace keywords or extract keywords from sentences to clean up text data. Usually, such operations can be done with regular expressions, but if the search vocabulary reaches thousands, then these operations will become very cumbersome.
Python’s flashtext module is based on the flashtext algorithm, which provides a suitable alternative for this situation. The best thing about flashtext is that it doesn’t work with your search volume.
$ pip install flashtext
1) Key words extraction
from flashtext import KeywordProcessor keyword_processor = KeywordProcessor() # keyword_processor.add_keyword(, ) keyword_processor.add_keyword( Big Apple , New York ) keyword_processor.add_keyword( Bay Area ) keywords_found = keyword_processor.extract_keywords( I love Big Apple and Bay Area. ) keywords_found [ New York , Bay Area ]
2) Alternative keywords
keyword_processor.add_keyword( New Delhi , NCR region ) new_sentence = keyword_processor.replace_keywords( I love Big Apple and new delhi. ) new_sentence I love New York and NCR region.
This name sounds strange, but fuzzywuzzy is a very useful library for string matching. It can easily achieve string matching rate and other operations. It can also easily match records stored in different databases.
$ pip install fuzzywuzzy
from fuzzywuzzy import fuzz from fuzzywuzzy import process # Simple Ratio fuzz.ratio("this is a test", "this is a test!") 97 # Partial Ratio fuzz.partial_ratio("this is a test", "this is a test!") 100
Time series analysis is one of the most common problems in machine learning. Pyflux is an open source library in Python, which is built to deal with time series problems. The library has a series of excellent modern time series models, such as Arima, GARCH and VaR models. In short, pyflux provides a probabilistic method for time series modeling.
pip install pyflux
A very important part of data science is the exchange of results. Visualization of results can give you a huge advantage. Ipyvolume is a python library for visualizing 3D capacity and symbols (such as 3D scatter diagrams) in jupyter notebooks with a small amount of configuration.
Using pip $ pip install ipyvolume Conda/Anaconda $ conda install -c conda-forge ipyvolume
pip install dash==0.29.0 # The core dash backend pip install dash-html-components==0.13.2 # HTML components pip install dash-core-components==0.36.0 # Supercharged components pip install dash-table==3.1.3 # Interactive DataTable component (new!)
The following example shows a highly interactive graph with drop-down functionality. When the user selects a value from the drop-down menu, the application code dynamically exports the data from Google Finance to Panda dataframe.
Gym is a tool for developing and comparing reinforcement learning algorithms. It is compatible with any data science library, such as tensorflow or theano. It’s a set of test problems, also called environments, that you can use to compute reinforcement learning algorithms. These environments have a shared interface that allows users to write general algorithms.
pip install gym
The following example will run 1000 times in the cartpole-v0 environment, rendering the environment at each step.