DWQA QuestionsCategory: ProgramA problem of Python data processing and dictionary generation
sallyheros asked 4 weeks ago

Problem description
There are two dictionary files on hand, file1, file1
You need to generate a new file based on these two dictionary files
The contents of file1 file are

zhangwei
wangwei
wangfang
liwei
lina
zhangmin
lijing
wangjing
liuwei
wangxiuying
zhangli
lixiuying
wangli
zhangjing
zhangxiuying
liqiang
wangmin
limin
wanglei
liuyang
wangyan
wangyong
lijun
zhangyong
lijie
zhangjie
zhanglei
wangqiang
lijuan
wangjun
zhangyan
zhangtao
wangtao
liyan
wangchao
liming
liyong
wangjuan
liujie
liumin
lixia
lili
......

The file content of File2 is

123
123456
@123
888
999
666
2015
2016
521

File1 + File2 are required, and generation is similar

zhangwei123
zhangwei123456
[email protected]
zhangwei888
zhangwei999
zhangwei666
zhangwei2015
zhangwei2016
zhangwei521
wangwei123
wangwei123456
[email protected]
wangwei888
wangwei999
wangwei666
wangwei2015
wangwei2016
wangwei521
wangfang123
wangfang123456
[email protected]
wangfang888
wangfang999
wangfang666
wangfang2015
wangfang2016
wangfang521

Dictionary file for
So far, I’m writing this

#!/usr/bin/env python
# -*- coding: utf-8 -*-

f = open('zidian.txt','w')
with open('file1.txt','r') as username:
    for user in username:
        print user
        with open('file2.txt','r') as dict:
            for dic in dict.readlines():
                f.write(user.strip()+dic.strip('\r')+'\n')
               

But there is a drawback in this way: the generated dictionary file is too large
At present, I want to change it to one to five lines of file1 + File2 to generate a file. Lines 6 to 10 of file1 + File2 are generated in one cycle until File2 is completed
How to improve the education of Daniel

5 Answers
Best Answer
dokelung answered 4 weeks ago

This is not the way to cut documents,itertools.productIt can help you finish more succinctly:

import itertools

with open('zidian.txt', 'w') as z:
    with open('file1.txt') as f1, open('file2.txt') as f2:
        for a, b in itertools.product(f1, f2):
            a, b = a.strip(), b.strip()
            print(a+b, file=z)

Cutting output method:

import itertools

with open('file2.txt') as f2:
    for key, group in itertools.groupby(enumerate(f2), lambda t: t[0]//5):
        with open('file1.txt') as f1, open('zidian-{}.txt'.format(key), 'w') as z:
            for a, (_, b) in itertools.product(f1, group):
                a, b = a.strip(), b.strip()
                print(a+b, file=z)

Let’s talk about some problems of your original code:

  • f = open('zidian.txt','w')You open the file here but forget to close it. It’s better to use with to read and write the file
  • dict.readlines(), do not use unless absolutely necessaryreadlinesRemember! Please refer to this articleText format conversion code optimization
  • In addition,dicordictThis word has a unique meaning in Python. A little experienced Python programmer will think that they are Python dictionary, which is easy to cause misunderstanding

Questions I answered: Python-QA

MichaelXoX replied 4 weeks ago

Did you execute this? Does it meet your requirements? How do I feel like I don’t match the example answer in your question?The requirement given in your topic looks like this: every name in file1 needs to be added with every string in File2, so the result is m * n values, and the result of this execution is not only this!

ferstar answered 4 weeks ago

Well, to understand the meaning of the wrong subject, rewrite the code, I admit to usingfilehandler.readlines()It’s my face~
In fact, if you just think the generated file is a little big,*nixThere’s a gadgetsplitIt’s very suitable. You can freely split large files into several small ones
The following code can be modified simply without considering the result segmentationwrite2fileFunction, andid_generatorFunctions and related modules(random, string)Can be deleted

def write2file(item):
    with open("dict.txt", "a") as fh, open("file1.txt", "r") as f1:
        for i in f1.readlines():
            for j in item:
                fh.write("{}{}\n".format(i.strip(), j))
       
import random
import string
from multiprocessing.dummy import Pool


def id_generator(size=8, chars=string.ascii_letters + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))


def generate_index(n, step=5):
    for i in range(0, n, step):
        if i + step < n:
            yield i, i+step
        else:
            yield i, None


def write2file(item):
    ext_id = id_generator()
    with open("dict_{}.txt".format(ext_id), "w") as fh, open("file1.txt", "r") as f1:
        for i in f1.readlines():
            for j in item:
                fh.write("{}{}\n".format(i.strip(), j))


def multi_process(lst):
    pool = Pool()
    pool.map(write2file, b_lst)
    pool.close()
    pool.join()


if __name__ == "__main__":
    with open("file2.txt") as f2:
        _b_lst = [_.strip() for _ in f2.readlines()]
        b_lst = (_b_lst[i: j] for i, j in generate_index(len(_b_lst), 5))
    multi_process(b_lst)

The results are shown in the figure, and severaldict_plus8Text document with random string of bits
clipboard.png
One of themdict_3txVnToL.txt

zhangwei123
zhangwei123456
[email protected]
zhangwei888
zhangwei999
wangwei123
wangwei123456
[email protected]
wangwei888
wangwei999
...

Here’s what’s old
To satisfy your desire:

with open("file1") as f1, open("file2") as f2, open("new", "w") as new:
    b = f2.readline().strip()
    while b:
        a = f1.readline().strip()
        for i in range(5):
            if b:
                new.write("{}{}\n".format(a, b))
            else: break
            b = f2.readline().strip()

Each time, you can only read by line. No matter how large the file is, it can hold. It is energy-saving and environmental friendly. The results are as follows:

$ head new
zhangwei123
zhangwei123456
[email protected]
zhangwei888
zhangwei999
wangwei666
wangwei2015
wangwei2016
wangwei521
wangwei123

PS: as mentioned above, try to avoid usingreadlinesMethod, in the case of limited memory, it will be a disaster if large files are encountered

manong answered 4 weeks ago

Save each line of File2 into a list, and then take five from the list every time
I don’t have Python on hand, and I guess there is a mistake in pure handwriting. Just understand the thought

names = []
with open('file1.txt','r') as username:
    for line in username.readlines():
        names.append(line)
    
list = []
with open('file2.txt','r') as dict:
    for line in dict.readlines():
       list.append(line)
for i in range(len(line) / 5):
    f = open('zidian' + str(i + 1) + '.txt', 'w')
    for j in range(5):
        for name in names:
            f.write(user.strip() + line[i * 5 + j] + '\n')
    f.close()
#Divide the remainder of 5, that is, write another file for the last few lines, and the code will not be written
libxd answered 4 weeks ago

@dokelungItertools.cycle is a wonderful use. I have a better way:

with open('file2') as file2_handle:
    passwords = file2_handle.readlines()
    #Of course, as mentioned above, it's not good to use readlines, but it's not absolute. When your file is not too large for memory, readlines will significantly improve the performance of the program
    #In my opinion, it's not a matter of millions of lines of files. It's common for me to use Python to read more than 10g files
    #Of course, try not to use readlines. It's just for my convenience to implement the following algorithm
  
with open('file1') as file1_handle:
    name_password_dict = ['%s%s' % (line.rstrip(), passwords[i%len(passwords)]) for i, line in enumerate(file1_handle)]

#With the name "password" dict, I don't want to do anything, no matter what the other files are
Recall first answered 4 weeks ago

Simply put, add a counter line, close the file when line + = 1 and line is 5, open the new file and set line to 0