Where to find machine learning datasets: an inventory of the best datasets sources


Abstract:It is very difficult to find a specific data set to solve the corresponding machine learning problem, which is very painful. The following list of URLs contains not only large datasets for experiments, but also descriptions, usage examples, etc., and in some cases algorithm codes for solving machine learning problems related to the datasets.

1 – kaggle dataset

Website: http://www.kaggle.com/datasets

This is one of my favorite data collection sites. Each dataset has a small community where you can discuss data, find common code, or create your own projects in the kernel. The website contains a large number of real datasets with different shapes, sizes and formats. You can also see the “kernel” associated with each dataset, where many different data scientists provide notes to analyze the dataset. Sometimes in some specific data sets, you can find the corresponding algorithm from the notes to solve the prediction problem.

2 – Amazon data set

Website: https://registry.opendata.aws

The data source includes data sets of different fields, such as public transport, ecological resources, satellite images, etc. It also has a search box to help you find the dataset you are looking for. In addition, it has dataset description and use examples, which is very simple and practical!

3 – UCI machine learning library:

Website: https://archive.ics.uci.edu/ml/datasets.html

This is a database of more than 100 data sets from the school of information and computer science, University of California. It classifies data sets according to the types of machine learning problems. You can find univariate, multivariate, categorical, regression, or datasets of recommended systems. Some data sets of UCI have been updated and are ready for use.

4 – Google’s dataset search engine:

Website: https://toolbox.google.com/datasetsearch

In late 2018, Google did what they were best at, launching another great service. It’s a toolkit for searching datasets by name. Google’s goal is to unify thousands of different datasets repositories so that they can be discovered.

5 – Microsoft data set:

Website: https://msropendata.com

In July 2018, Microsoft and the external research community jointly announced the launch of “Microsoft Research open data”.

It includes a data repository in the public cloud to facilitate collaboration between global research communities. It also provides a set of collated data sets used in published studies.

6-awesome public dataset:

Website: https://github.com/awesome data/awesome-public-datasets

This is a list of a series of data sets, classified by subject, publicly maintained by the community, such as biology, economics, education, etc. Most of the datasets listed here are free, but you should check the license requirements before using any of them.

7 – government data set:

Relevant government datasets are also easy to find. Many countries have shared various datasets with the public in order to improve their visibility. For example:

EU open data portal: European government data set.

New Zealand government data set.

Government of India data set.

8-computer vision data set:

Website: https://www.visualdata.io

If you are engaged in image processing, computer vision or deep learning, this should be one of the important sources of data for your experiment.

This dataset contains some large datasets that can be used to build computer vision (CV) models. You can find specific data sets through specific CV topics, such as semantic segmentation, image titles, image generation, and even search for specific data sets through the solution (autopilot auto data set).

To sum up, from what I have observed, more and more data sets for machine learning research are becoming easier to obtain, and the communities that maintain these new data sets will continue to develop, so that the computer science community can continue to innovate rapidly and bring more creative solutions for life.

Author: [direction]

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.