Debug error, view table structure, transaction output quantity distribution – use Google big query API to process bitcoin data (3)

Time:2020-2-25

The first two articles:
Data set on bitcoin chain on kaggle – using Google big query API to process bitcoin data (1)
It summarizes the code of kernal on kaggle

Get bigquery bitcoin data with SQL – process bitcoin data with Google big query API (2)
Get raw data using the simplest SQL

Error reporting SQL

Write a query to count the input and output quantity of bitcoin transaction

from google.cloud import bigquery

client = bigquery.Client()
query = "SELECT SUM(output_count),  COUNT(1) AS tx_count FROM `bigquery-public-data.bitcoin_blockchain.transactions` as tx GROUP BY output_count"
r = client.query(query)
df = r.to_dataframe()

Result error:

BadRequest: 400 Unrecognized name: output_count at [1:12]

Does that mean there is no such column?

I went back to the dataset home page and found this column.

Get table schema

Reference notebook

dataset_ref = client.dataset("bitcoin_blockchain", project="bigquery-public-data")
dataset = client.get_dataset(dataset_ref)

#Print all the tables in the dataset
for table in client.list_tables(dataset):
    print(table.table_id)

blocks
transactions

It’s wrong to see that there are four tables in this data set, but there are only two tables here. Continue to see the contents of transactions table

table_ref = dataset_ref.table("transactions")
table = client.get_table(table_ref)
for col in table.schema:
    print(col)
    print()

SchemaField(‘timestamp’, ‘INTEGER’, ‘NULLABLE’, None, ())

SchemaField(‘transaction_id’, ‘STRING’, ‘NULLABLE’, None, ())

SchemaField(‘inputs’, ‘RECORD’, ‘REPEATED’, None, (SchemaField(‘input_script_bytes’, ‘BYTES’, ‘NULLABLE’, None, ()), SchemaField(‘input_script_string’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘input_script_string_error’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘input_sequence_number’, ‘INTEGER’, ‘NULLABLE’, None, ()), SchemaField(‘input_pubkey_base58’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘input_pubkey_base58_error’, ‘STRING’, ‘NULLABLE’, None, ())))

SchemaField(‘outputs’, ‘RECORD’, ‘REPEATED’, None, (SchemaField(‘output_satoshis’, ‘INTEGER’, ‘NULLABLE’, None, ()), SchemaField(‘output_script_bytes’, ‘BYTES’, ‘NULLABLE’, None, ()), SchemaField(‘output_script_string’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘output_script_string_error’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘output_pubkey_base58’, ‘STRING’, ‘NULLABLE’, None, ()), SchemaField(‘output_pubkey_base58_error’, ‘STRING’, ‘NULLABLE’, None, ())))

SchemaField(‘block_id’, ‘STRING’, ‘NULLABLE’, None, ())

SchemaField(‘previous_block’, ‘STRING’, ‘NULLABLE’, None, ())

SchemaField(‘merkle_root’, ‘STRING’, ‘NULLABLE’, None, ())

SchemaField(‘nonce’, ‘INTEGER’, ‘NULLABLE’, None, ())

SchemaField(‘version’, ‘INTEGER’, ‘NULLABLE’, None, ())

SchemaField(‘work_terahash’, ‘INTEGER’, ‘NULLABLE’, None, ())

SchemaField(‘work_error’, ‘STRING’, ‘NULLABLE’, None, ())

There is really no output “count

Get a piece of data

r1 = client('SELECT * FROM `bigquery-public-data.bitcoin_blockchain.transactions` WHERE transaction_id = "4a5e1e4baab89f3a32518a88c31bc87f618f76673e2cc77ab2127b7afdeda33b"')
df = r1.to_dataframe()
df.shape

(1, 11)

The data returned is only 11 columns, which is different from the result of the previous article.

It turns out that bigquery has two different bitcoin datasets

In the previous article:
bigquery-public-data.crypto_bitcoin.transactions
This is mentioned on the dataset home page

The error code above uses
bigquery-public-data.bitcoin_blockchain.transactions
It is used in the official kernal recommended from the dataset home page.

The two datasets are different in both the number of tables and the structure of the tables.

So it’s better to use the first one

Final code and visualization

from google.cloud import bigquery

client = bigquery.Client()
query = "SELECT output_count,  COUNT(1) AS tx_count FROM `bigquery-public-data.crypto_bitcoin.transactions` as tx GROUP BY output_count ORDER BY output_count"
r = client.query(query)
df = r.to_dataframe()
r.total_bytes_processed / (1024 ** 3)

3.7562014535069466

df

Debug error, view table structure, transaction output quantity distribution - use Google big query API to process bitcoin data (3)

s = df['tx_count'].sum()
s

504172778

500 million transactions in this dataset

df['tx_count'][0] / s * 100, df['tx_count'][1] / s * 100 , df['tx_count'][2] / s * 100, 

(11.770812822424935, 78.09005467566121, 5.974420340480977)

11.77%, 78.09% and 5.97% of the transactions with output count of 1, 2 and 3 respectively

The long tail of the data is clear, with more than 90 transactions focused on 1 – 3. So just draw the first 10 lines of the bar chart

df[:10].set_index('output_count').plot.bar()

Debug error, view table structure, transaction output quantity distribution - use Google big query API to process bitcoin data (3)

Welcome to my blog: https://codeplot.top/
My blog bitcoin category: https://codeplot.top/tags/bitcoin/