Any important decision should be based on data, as well as information projects and software development. If you do not carefully look at the data describing the evolution of the project, you will not be able to understand the health of the project and give reasonable improvement measures. In order to analyze and mine this information, we can get some meaningful data from git repository and the code hosting platform (such as GitHub, gitlab) where the project is located. However, it is not easy to get data from git / GitHub. This article introduces some git / GitHub open source analysis tools for your reference.
First of all, GitHub’s official API is the best way to get the details of GitHub repository. The API is very easy to use. You can use curl or any other language to package the library and get all the information of the warehouse (other public online git hosting platforms or self built gitlab have similar APIs). However, the annoying thing is that GitHub limits the number of API calls. The number of requests per hour is limited (anonymous users 60 times, authorized users 5000 times). If you want to analyze large projects (or some of them for global analysis), using API is not a good solution. However, some kind of dashboard typically used to focus on a single project or contributor to an individual build is not affected.
Through the GitHub API, you can basically get all the information you see when you visit and browse the GitHub warehouse of the project, but the GIT information of the warehouse is limited (for example, you want to know which code lines have been modified in the recent day). You need the clone repository to get complete information through git command.
Ghcrawler is a robust GitHub API crawler developed by Microsoft. It can traverse GitHub entities and messages, search and track them. Ghcrawler is particularly useful if you want to analyze the activities of an organization or project. Ghcrawler is also limited by the number of GitHub API requests, but ghcrawler optimizes the use of API tokens by using token pool and rotation. Ghcrawler supports command-line call and a web interface operation (ghcrawler dashboard)
Official warehouse of the project:https://github.com/Microsoft/ghcrawler
GH archive is an open source project for recording, archiving, and making public GitHub timelines accessible for further analysis. GitHub archive obtains all GitHub events information and stores it in a set of JSON files for downloading and offline processing as needed.
In addition, GitHub archive can also be used as a public dataset on Google bigquery. The dataset is automatically updated every hour, and any SQL like query can be run on the whole dataset in a few seconds.
Project official website：https://www.gharchive.org
Similar to GH archive, the ghtorrent project is also used to monitor GitHub public event schedule information. For each event, it retrieves its content and interdependencies in detail. Then the information of the result JSON is stored in mongodb database, and its structure is extracted into MySQL database.
Ghtorrent is a bit similar to GH archive. The difference between them is that GH archive is designed to provide a more detailed set of events and obtain information on an hourly basis. GH torrent provides event data in a more structured way, which makes it easier to obtain information about all events. The data acquisition frequency is monthly.
Project official warehouse：https://github.com/ghtorrent
Apache kibble is a set of tools for collecting, summarizing and visualizing activities in software projects. Kibble architecture consists of a central kibble server and a set of scanning applications (a git repo, a mailing list, a JIRA instance, etc.) which are used to process specific types of resources, and push the compiled data objects to the kibble server.
Based on this data, you can customize a dashboard, which contains many widgets to display project data (language classification, main contributors, code evolution, etc.). In this sense, kibble is more like a tool, which can help create project data information display web side.
Project official website：https://kibble.apache.org/
Chaos is a Linux foundation project dedicated to creating data analysis and indicator definitions to help a healthy open source community. There are many tools for chaoss project to mine and calculate the index data needed by the project
AugurIs a python library, flash web application, and rest server that provides metrics on the health and sustainability of open source software development projects. The goal is rapid prototyping as a new indicator of interest to the chaos community.
CregitFocus on generating views to visualize the source of code changes
GrimoireLab Bitergia’s most mature and ambitious tool to date. The purpose of grimoire lab is to provide an open source platform to realize:
1. Collect automatic and incremental data from almost any open source development related tools (data sources) (source code management, problem tracking system, forum, etc.)
Automatically enrich data to clean up and expand the data collected above (merge duplicate identities, add additional information about contributor affiliation, compute latency, geographic data, etc.)
Data visualization, filtering and searching by time range, project, repository, contributor, etc.
Grimoire lab uses kibana to provide all these excellent visualizations on top of the collected data.
Chaoss project official website:https://chaoss.community/
Sourced calls itself the data platform of the development life cycle. Compared with previous tools, it focuses more on project code than on community collaboration. The source project uses the general AST to query the details of the code base in a language independent way.
Several interesting data analysis tools can be found in the source project organization. include:
go-git: a highly extensible git implementation library written in pure golang language.
Hercule：Golang implements an analysis tool for the entire submission history of the repository.
gitbase：The GIT repository SQL database interface implemented by golang. For example, the following SQL statement can be used for submitting information by month, year and submitted by:
SELECT YEAR, MONTH, repo_id, committer_email, COUNT(*) AS num_commits FROM (SELECT YEAR(committer_when) AS YEAR, MONTH(committer_when) AS MONTH, repository_id AS repo_id, committer_email FROM ref_commits NATURAL JOIN commits WHERE ref_name = 'HEAD') AS t GROUP BY committer_email, YEAR, MONTH, repo_id;
Official website of the project:https://sourced.tech/
GitHub project organization:https://github.com/src-d
Hubble is used to visualize the collaboration, usage, and health data of GitHub enterprise. It is dedicated to helping large companies understand how their internal organizations, projects and contributors work together to distribute and collaborate.
Hubble enterprise consists of two components. The updater component is a python script that queries relevant data from GitHub enterprise device every day and stores the results in Git repository. The docs component is a web application that visualizes the collected data and is hosted by GitHub pages.
Official project hosting address:https://github.com/Autodesk/hubble
Finally, a very beautiful git project information visualization tool under the command line, which supports more than 50 languages, is mentioned because it is written in the emerging rust language.
In this paper, we list some data mining tools and projects for GitHub / GIT. In addition to the above mentioned open source software, some commercial tools are also very good, such as snoot and waydev.
Hello everyone, I’m your mascot.
A little buddy love articles, and can finally give a praise. Oh, finally, as usual, Amway’s official account: “terminal research and development department”, now recommends a quality technology related article every day, mainly sharing Java related technology and interview skills, learning Java is not lost.