Data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing


Reference source:

Data creation, reading and writing guide

The first step in any data analysis project can be to read data from a file, so we need to consider this first. In this section, we will cover the exercise of creating pandas series and dataframe objects

The IO tools section of the official pandas documentation provides a comprehensive overview of this topic

First let’s import the pandas Library

import pandas as pd

There are two core objects in creating data pandas: dataframe and series.

Dataframe is a table. It contains a series of individual entries, each with a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple dataframe:

pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})

In this example, the value of the “0, no” entry is 131. The value of the “0, yes” entry is 50, and so on.

Dataframe entries are not limited to integers. For example, this is a dataframe with the value STR string:

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})

We use the PD. Dataframe constructor to generate these dataframe objects. The syntax for creating a new object is to initialize a dictionary whose key is the column name (in this case, Bob and Sue) and whose value is the list of entries. This is the standard way to build a new dataframe, and the one you’re most likely to encounter.

The constructor for the dictionary list assigns a value to the column label, but uses only an incremental count of 0 (0,1,2,3,…) as the row label. Sometimes it doesn’t matter, but we often want to assign the row labels ourselves.

The list of row labels used in the dataframe is called an index. We can use index parameters in the constructor to assign values to it:

pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])

Series, by contrast, is a series of data values. If dataframe is a table, series is a list. In fact, you can create a series with only one list:

pd.Series([1, 2, 3, 4, 5])

Series is essentially a single column of a dataframe. Therefore, you can use index parameters to assign column values to series in the same way as before. However, series does not have a column name, it has only one total name:

pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')

Series is closely related to dataframe. It’s helpful to think of a dataframe as actually just a bunch of series “glued together.”. We’ll see more in the next part of this tutorial.

Read common file CSV

It is very convenient to create dataframe and series manually. However, in most cases, we will not actually create our own data manually, we will use the existing data.

Data can be stored in many different forms and formats. So far, the most basic one is a simple CSV file. When you open a CSV file, you will see the following:

Product A,Product B,Product C,

CSV files are comma separated tables

Now let’s put our toy dataset aside and see what the real dataset looks like when we read it into the dataframe.

In this case, we will look at the price data of Melbourne, Australia.

Click here to download the dataset

Let’s upload the CSV file to VITU’s dataset space

Data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing

We will use the read ﹣ CSV function to read the data into the dataframe. Do it in this way.

melbourne_data = pd.read_csv('melb_data.csv') 

We can use the shape property to check the size of the generated dataframe:


As a result, our new dataframe has more than 13000 records, divided into 21 different columns.

We can use the head command to check the content of the dataframe. The command grabs the first five lines:


The pandas read ﹣ CSV function has rich functions and can specify more than 30 optional parameters. For example, you can use a column of a CSV file as a built-in index when creating a dataset, and pandas does not get it automatically by default. In order for pandas to use the column for indexing (rather than creating a new index from scratch), we can specify and use index col.

melbourne_data = pd.read_csv('melb_data.csv', index_col=0)

Reading common files Excel

Let’s look at more data types you may encounter.

The first is the old Excel spreadsheet. Excel files (XLS or XLST) organize themselves into a series of named worksheets. Each sheet is basically a form. Therefore, to load data into pandas, we need an additional parameter: the name of the worksheet of interest.

Click here to download the dataset

Open locally as follows

Data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing

Let’s upload the XLS file to VITU’s dataset space

Data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing

Read with pandas’s read ﹣ Excel:

wic = pd.read_excel("WICAgencies2013ytd.xls", 
                    sheet_name='Total Women')

As you can see in this example, Excel files are usually not formatted like CSV files. Spreadsheets allow (and encourage) the creation of human readable but not machine-readable comments and fields.

So before we use this particular data set, we need to clean up a little bit. We’ll see how to do this in the next section.

Read the common file SQLite

Now, let’s move to another common data format: SQLite.

Click here to download the SQLite dataset

Let’s upload the SQLite file to VITU’s dataset space

Data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing

Connecting to a SQL database requires more thinking than reading from an excel file. First, you need to create a connector

Pandas will not automatically perform this operation for you because there are many different types of SQL databases, each with its own connector. Therefore, for SQLite database (the only type supported by VITU), you need to do the following first (use SQLite3 library included with Python):

import sqlite3
conn = sqlite3.connect("Salaries.sqlite")

Another thing you need to do is write SQL statements. Internally, SQL databases operate in a completely different way. Externally, however, they all provide the same API, the “structured query language” (or SQL…).

So, needless to say, here is how many tables are in the database from SQLite

tables = pd.read_sql_query("SELECT * FROM sqlite_master where type='table'", conn)

Here’s a look at all the data in the sales table from SQLite

salaries = pd.read_sql_query("SELECT * FROM Salaries", conn)

How to write common files

It’s usually easier to write data to a file than to read it from a file, because pandas can handle the conversion for you.

We will use the CSV file again. The opposite of “read” CSV (read our data) is “to” CSV, write it. Using a CSV file is simple:


To write back to an excel file, you need to use to excel and sheet name again:

wic.head().to_excel('wic_head.xls', sheet_name='Total Women')

Finally, to write to the SQLite database, we need to provide the name of the table in the data input database, as well as the connector:

conn = sqlite3.connect("salaries_head.sqlite")
salaries.head().to_sql("salaries", conn)

Painless reading and writing!

Original address: data processing [Swiss Army knife pandas guide]: 1. Data creation, reading and writing

Recommended Today

Laravel service container must know

The article was forwarded from the professional laravel developer community. Original link: To learn how to build an application with laravel is not only to learn how to use different classes and components in the framework, but also to remember allartisanCommand or all helper functions (we have Google). Learning to code with laravel is […]