The popular trend of using Python to mine GitHub (I)

Time:2019-11-8

The popular trend of using Python to mine GitHub (I)

  • Source: content editor of chaindesk.cn
  • Wish code slogan | connects each programmer’s story
  • Website: http://chaindesk.cn
  • Willing code vision | to create a free course of IT system for the whole discipline, to help Xiaobai users and junior engineers to learn free system at zero cost, to advance at low cost, to help bat senior engineers grow and use their own advantages to create post sleep income.
  • Official public code, wish code, service number, block chain, tribe
  • Free join the whole code of thinking engineer community, any public number reply “wish code” two characters to get into the group two-dimensional code.

Reading time: 10min

In this article, we’ll explore how to take advantage of Python’s powerful capabilities to collect and process data from GitHub and prepare it for analysis.

GitHub adopts a widely used version control method to achieve social network functions in the field of programming and to raise the coding to the highest level. GitHub allows you to create code repositories and provide multiple collaboration functions, error tracking, feature requests, task management, and wikis. It has about 20 million users and 57 million code bases (source: Wikipedia). These statistics easily prove that this is the most representative platform for programmers. It is also a platform for several open source projects, which have made great contributions to the field of software development. Assuming that GitHub uses the latest programming tools and technologies, analyzing GitHub can help us detect the most popular technologies. The popularity of the repository on GitHub is measured by the number of submissions it receives from the community. In this article, we’ll use the GitHub API to collect data from repositories with the highest number of commits, and then discover the most popular technology.

Scope and process


The GitHub API allows us to get information about the public code repository that users submit. It covers many open source, educational and personal projects. Our focus is to find the trend technologies and programming languages of the past few months and compare them with the repositories of the past few years. We will collect all meta information about the repository, such as:

  • Name: the name of the repository
  • Description: description of the repository
  • Watchers: people follow the repository and get informed about its activities
  • Forks: users clone the repository to their own account
  • Open issues: issues submitted about the repository

We will use these data, a combination of qualitative and quantitative information, to identify the latest trends and weak signals. This process can be represented by the steps shown in the following figure:
The popular trend of using Python to mine GitHub (I)

get data


Before using the API, we need to set up authorization. The API allows you to access all publicly available data, but some endpoints require user rights. You can use application settings to create new tokens with certain scope access rights. The scope depends on the needs of your application, such as accessing user email, updating user profiles, and so on. Password authorization is only required in some cases, such as user authorized application access. In this case, you need to provide a user name or email and password.
All API access is via HTTPS and can be accessed from https://api.github.com/ domain. All data is sent and received as JSON.

Rate limit


The GitHub Search API is designed to help you find specific items (repositories, users, etc.). The rate limiting policy allows up to 1000 results per search. For requests using basic authentication, OAuth or client ID and key, you can issue up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to issue up to 10 requests per minute.

Connect to GitHub


GitHub provides a search endpoint that returns all repositories that match the query. As we progress, we will change the value of variable Q (query) in different steps of the analysis. In the first part, we will retrieve all repositories created since January 1, 2017, and then we will compare the results of previous years.

First, we initialize an empty list result that stores all the data about the repository. Second, we use the parameters required by the API to build the get request. We can only get 100 results per request, so we have to use paging technology to build a complete dataset.

results = []

q = "created:>2017-01-01"

def search_repo_paging(q):

url = 'https://api.github.com/search/repositories'

params = {'q' : q, 'sort' : 'forks', 'order': 'desc', 'per_page' : 100}

while True:

res = requests.get(url,params = params)

result = res.json()

results.extend(result['items'])

params = {}

try:

url = res.links['next']['url']

except:

break

In the first request, we have to pass all the parameters to the method in the request. Then, we create a new request for each next page, where we can find the full link to the resource with all the other parameters. That’s why we empty the params dictionary.
GET res.links’next’. res. 

Repeat until there is no next page key in the dictionary. For other datasets, the way we modify our search queries is to retrieve the repository from the previous years. For example, to get data from 2015, we define the following query: res.links

q = "created:2015-01-01..2015-12-31"

In order to find the right repository, the API provides a large number of query parameters. Use the qualifier system to search for repositories with high precision. Starting with the main search parameter q, we have the following options:

  • Sort: set to forks because we’re interested in finding repositories with the largest number of branches (you can also sort by number of stars or update time)
  • Order: set to descending order
  • Per_page: set to the maximum number of repositories returned

Of course, the search parameter Q can contain multiple combinations of qualifiers.

Data pull


The amount of data we collect through the GitHub API makes it suitable for memory. We can process it directly in the pandas data frame. If more data is needed, we recommend storing it in a database, such as mongodb.
We use the JSON tool to convert the results to clean JSON and create data frames.

from pandas.io.json import json_normalize

import json

import pandas as pd

import bson.json_util as json_util

sanitized = json.loads(json_util.dumps(results))

normalized = json_normalize(sanitized)

df = pd.DataFrame(normalized)

DF data box contains columns related to all the results returned by GitHub API. We can list them by entering:

Df.columns

Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url',

'clone_url', 'collaborators_url', 'comments_url', 'commits_url',

'compare_url', 'contents_url', 'contributors_url', 'default_branch',

'deployments_url', 'description', 'downloads_url', 'events_url',

'Fork',

'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url',

'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads',

'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage',

'hooks_url', 'html_url', 'id', 'issue_comment_url',

'Issue_events_url',

'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url',

'merges_url', 'milestones_url', 'mirror_url', 'name',

'notifications_url', 'open_issues', 'open_issues_count',

'owner.avatar_url', 'owner.events_url', 'owner.followers_url',

'owner.following_url', 'owner.gists_url', 'owner.gravatar_id',

'owner.html_url', 'owner.id', 'owner.login',

'Owner.organizations_url',

'owner.received_events_url', 'owner.repos_url', 'owner.site_admin',

'owner.starred_url', 'owner.subscriptions_url', 'owner.type',

'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url',

'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url',

'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url',

'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url',

'Watchers',

'watchers_count', 'year'],

dtype='object')

We then select the subset of variables that will be used for further analysis. We skip all technical variables related to URLs, owner information, or IDs. The remaining columns contain information that is likely to help us identify new technology trends:

  • Description: user description of the repository
  • Watchers’ count: number of observers
  • Size: the size of the repository in kilobytes
  • Forks’count: number of forks
  • Open issues count: number of open issues
  • Language: programming language for writing repositories

We chose to measure the popularity of the repository. This number indicates how many people are interested in the project. But we can also use it to give us slightly different information about popularity. The latter represents the number of people actually using the code, so it’s related to different groups. watchers_count forks_count

data processing


In the previous step, we built the raw data, which can now be further analyzed. Our goal is to analyze two types of data:

  • Text data in description
  • Numerical data of other variables

Each of them requires a different preprocessing technique. Let’s look at each type in detail.

Text data


For the first, we have to create a new variable that contains the cleaned string. We will do this in three steps, which are described in the previous chapters:

  • Select English description
  • Symbolization
  • Stop word

Since we only deal with English data, we should delete all descriptions written in other languages. The main reason for this is that each language requires a different processing and analysis process. If we leave Russian or Chinese descriptions, we will get very noisy data, which we can’t explain. Therefore, it can be said that we are analyzing the trend of the English world.

First, we delete all empty strings in the Description column.

df = df.dropna(subset=['description'])

In order to delete non English descriptions, we must first detect the language used in each text. To do this, we use a library called langdetect, which is based on the Google language detection project.

from langdetect import detect

df['lang'] = df.apply(lambda x: detect(x['description']),axis=1)

We create a new column that contains all the forecasts. We see different languages, such as en (English), Zh CN (Chinese), VI (Vietnamese) or Ca (Catalan).

df['lang']

0 en

1 en

2 en

3 en

4 en

5 zh-cn

In our data set, en accounts for 78.7% of all repositories. We will now select only those repositories with descriptions in English:

df = df[df['lang'] == 'en']

In the next step, we will create a new clean column using the preprocessed text data. We execute the following code to tokenize and remove the disablement words:

import nltk

from nltk import word_tokenize

from nltk.corpus import stopwords

def clean(text = '', stopwords = []):



#tokenize

tokens = word_tokenize(text.strip())

#lowercase

clean = [i.lower() for i in tokens]

#remove stopwords

clean = [i for i in clean if i not in stopwords]

#remove punctuation

punctuations = list(string.punctuation)

clean = [i.strip(''.join(punctuations)) for i in clean if i not in punctuations]

return " ".join(clean)



df['clean'] = df['description'].apply(str) #make sure description is a string

df['clean'] = df['clean'].apply(lambda x: clean(text = x, stopwords = stopwords.words('english')))

Finally, we obtain a clean column which contains cleaned English descriptions, ready for analysis:

df['clean'].head(5)

0 roadmap becoming web developer 2017

1 base repository imad v2 course application ple…

2 decrypted content eqgrp-auction-file.tar.xz

3 shadow brokers lost translation leak

4 learn design large-scale systems prep system d...

Numerical data


For numerical data, we will count the distribution of check values and whether there are any missing values:

df[['watchers_count','size','forks_count','open_issues']].describe()

The popular trend of using Python to mine GitHub (I)
We see that there are no missing values in all four variables: watchers’count, size, forks’count, and open’issues. The values of watchers’count range from 0 to 20792, while the minimum number of forks is 33, which goes up to 2589. The first quarter of repositories have no open issues, while the first 25% have more than 12 issues. It is worth noting that in our dataset, there is a repository of 458 open issues.

Once we have finished preprocessing the data, our next step is to analyze it so that we can get operational insights from it.