Detailed explanation of Python text data processing learning notes

Time:2020-1-10

Recently, I feel more and more that the limitation of using Python and reading other people’s code is mostly the ability to process data.

In fact, programming is essentially data processing. How to turn text data and image data into an n-dimensional matrix through Python reading, segmentation, etc., and then bring them into other people’s models, bingo ~ runs out a result. The result is of course in the form of a matrix or vector.

Therefore, the reason why we are helpless with many models and codes is that we do not have a good command of the “dragon slaying sword” for data processing, and we cannot “do everything in one place” for massive data. Therefore, I want to take a section of someone else’s code as an example, carefully ponder the subtlety of text data processing, and strive to deepen the use and understanding of this aspect.

1) Problem description

Data: visitor data within 181 days of an area, in the following format: the first column represents the name of the visitor, and the second column represents the time when the visitor arrives in the area within 181 days:

Objective: to make statistics of visitor data and discretize it into a three-dimensional matrix of 72624 according to days / weeks / hours.
That is to say, each value in the matrix represents the number of visitors to the region at week x, week and point, such as
[1,5,19] = 100, the number of people representing 7:00 p.m. on Monday of the fifth week is 100.

2) difficulties

Of course, it’s difficult for me.

2.1) how to make statistics by line

2.2) how to discretize time (stored as the matrix of days, weeks and times)

3) code

import time
import numpy as np
import sys
import datetime
import pandas as pd
import os
#Using dictionary query instead of type conversion can reduce part of calculation time
date2position = {}
datestr2dateint = {}
str2int = {}
for i in range(182):
 date = datetime.date(day=1, month=10, year=2018)+datetime.timedelta(days=i)
 #print(i,":",date)
 date_int = int(date.__str__().replace("-", ""))
 date2position[date_int] = [i%7, i//7]
 datestr2dateint[str(date_int)] = date_int
#print(datestr2dateint)
#
for i in range(24):
 str2int[str(i).zfill(2)] = i
f=open("D:\BaiDuBigData19-URFC-master\UrbanRegionFunctionClassification-master\data\train_visit\000000_008.txt")
#table = pd.read_csv(f, header=None,error_bad_lines=False)
table = pd.read_csv(f, header=None,sep='\t')

#print(table.shape)
#print(table.ix[1])
strings = table[1]
#print(strings)
init = np.zeros((7, 26, 24))
for string in strings:
 temp = []
 for item in string.split(','):
 temp.append([item[0:8], item[9:].split("|")])
 for date, visit_lst in temp:
 #X - week
 #Y - day
 #Z - what time
 #Value - total number of visitors
 # print(visit_lst)
 print(date)
 x, y = date2position[datestr2dateint[date]]
 For visit in visit? LST:
  init[x][y][str2int[visit]] += 1
 #print(init[x][y][str2int[visit]])```

3.1) dictionary creation, time discretization, time saving

Three dictionaries are created here. Let’s look at the code implementation and printing results:


date2position = {}
datestr2dateint = {}
str2int = {}
for i in range(182):
 date = datetime.date(day=1, month=10, year=2018)+datetime.timedelta(days=i)
 #print(i,":",date)
 date_int = int(date.__str__().replace("-", ""))
 date2position[date_int] = [i%7, i//7]
 datestr2dateint[str(date_int)] = date_int
for i in range(24):
 str2int[str(i).zfill(2)] = i

Print date2position:

Print datestr2dateint:

Print str2int:

As you can see, datestr2dateint is the date of converting STR to int.
And date2position is the specific date calculated, which represents the week and day.
Str2int represents 24 times of the day.

3.2) read the file and get the string by line

Note that the separator of the text is \ t (to distinguish the user name and visit information), so the


f=open("D:\BaiDuBigData19-URFC-master\UrbanRegionFunctionClassification-master\data\train_visit\000000_008.txt")
#table = pd.read_csv(f, header=None,error_bad_lines=False)
table = pd.read_csv(f, header=None,sep='\t')

Then use strings to read the visit information, which is the second column of the table:


strings = table[1]

3.3) string segmentation

First, the strings are:

You can see each line of string, which is the visit record of a user and is read circularly. Among them, visits of different dates are separated by “,” so use:


for string in strings:
 temp = []
 for item in string.split(','):

Item can separate the visit records of each date:

After that, use the temp list to store the date and time in each row.
For example, the first item is 20181221 & 09,10,11,12,13,14,15
Date is item [0:8],
The time is separated by the separator “|”, so it can be obtained through item [9:]. Split (“|”).


temp.append([item[0:8], item[9:].split("|")])

Print temp as:

So we need two data to store the date and the time.
72624 matrix used to convert week, day and time (according to the previous conversion function)
Then, according to this matrix, the number of visitors in each location is counted

for date, visit_lst in temp:
 #X - week
 #Y - day
 #Z - what time
 #Value - total number of visitors
 # print(visit_lst)
 #print(date)
 x, y = date2position[datestr2dateint[date]]
 For visit in visit? LST:
  init[x][y][str2int[visit]] += 1

This code is very short, but it is the essence of the whole time discretization.

The above is the whole content of this article. I hope it will help you in your study, and I hope you can support developepaer more.