Shell script implements a variety of methods to arrange file contents in disorder (shuffle problem)

Time:2022-6-19

Shuffle: what is a good way to shuffle a deck of poker? Can you wash it evenly and quickly? That is, relative to a file, how to efficiently achieve disordered arrangement?

ChinaUnixIt is indeed a place where shell experts gather. As long as you want to get the question, you can basically find the answer there.r2007 GivenAn ingenious method, use the $random variable of shell to add a random line number to each line of the original file, then sort according to the random line number, and then filter out the temporarily added line number. In this way, the new file obtained after the operation is equivalent to being randomly “washed” once:

Copy codeThe codes are as follows:
while read i;do echo “$i $RANDOM”;done<file|sort -k2n|cut -d” ” -f1

Of course, if the content of each line of your source file is complex, you must rewrite this code, but as long as you know the key skills, the remaining problems are not difficult to solve.

Another article fromUndstand Using awk to achieve the shuffle effect of random file sorting code analysishere, and aFollow up discussion, if you don’t have a login account, you can go tohereSee the article in the essence area) for more details:
——————————————————————–
In fact, there is a good shell solution to the shuffle problem. Here are three other methods based on awk. Please point out any mistakes.

Method 1: exhaustive

Similar to the exhaustive method, a hash is constructed to record the number of times a printed line appears. If the number of times is more than one, it will not be processed. This can prevent repetition, but the disadvantage is that it increases the overhead of the system.

Copy codeThe codes are as follows:
awk -v N=`sed -n ‘$=’ data` ‘
BEGIN{
FS=”\n”;
RS=””
}
{
srand();
while(t!=N){
  x=int(N*rand()+1);
  a[x]++;
  if(a[x]==1)
    {
        print $x;t++
    }
  }
}
‘ data

Method 2: Transformation

The method based on array subscript transformation, that is, the contents of each line are stored in the array, and the contents of the array are exchanged through the transformation of array subscript. The efficiency is better than that of method 1.

Copy codeThe codes are as follows:
#! /usr/awk

BEGIN{
srand();
}

{
b[NR]=$0;
}

END{

C(b,NR);
for(x in b)
  {
    print b[x];
  }}

function C(arr,len,i,j,t,x){

for(x in arr)
  {
      i=int(len*rand())+1;
      j=int(len*rand())+1;
      t=arr[i];
      arr[i]=arr[j];
      arr[j]=t;
  }

}

Method 3: hash

The best of the three methods.
Using the hash feature in awk (see 7.x in info gawk for details), you can only construct a random non repeating hash function. Because the linenumber of each line of a file is unique, you can use:

Random number + linenumber of each line ——- corresponding to ——-> the contents of that line

Is the constructed random function.
Thus:

Copy codeThe codes are as follows:
awk ‘BEGIN{srand()}{b[rand()NR]=$0}END{for(x in b)print b[x]}’ data

In fact, you don’t have to worry too much about the problem of using too much memory. You can do a test:

Test environment:

PM 1.4GHz CPU, 40g hard disk, 256M laptop memory
SUSE 9.3  GNU bash version 3.00.16 GNU Awk 3.1.4

Generate a random file with more than 500000 lines, about 38m:

Copy codeThe codes are as follows:
od /dev/urandom |dd  count=75000 >data

Take the less efficient method 1 for example:

Time for one shuffle:

Copy codeThe codes are as follows:
time awk -v N=`sed -n ‘$=’ data` ‘
BEGIN{
FS=”\n”;
RS=””
}
{
srand();
while(t!=N){
  x=int(N*rand()+1);
  a[x]++;
  if(a[x]==1)
    {
        print $x;t++
    }
  }
}
‘ data

Results (file contents omitted):

Copy codeThe codes are as follows:
real    3m41.864s
user    0m34.224s
sys     0m2.102s

So the efficiency is just acceptable.

Test of method 2:

Copy codeThe codes are as follows:
time awk -f awkfile datafile

Results (file contents omitted):

Copy codeThe codes are as follows:
real    2m26.487s
user    0m7.044s
sys     0m1.371s

The efficiency is obviously better than the first one.

Next, consider the efficiency of method 3:

Copy codeThe codes are as follows:
time awk ‘BEGIN{srand()}{b[rand()NR]=$0}END{for(x in b)print b[x]}’ data

Results (file contents omitted):

Copy codeThe codes are as follows:
real    0m49.195s
user    0m5.318s
sys     0m1.301s

It is quite good for a 38m file.
——————————————————————–

An out of order code of Python version written by flyfly is attached:

Copy codeThe codes are as follows:
#coding:gb2312
import sys
import random

def usage():
print “usage:program srcfilename dstfilename”
global filename
filename = “”
try:
filename = sys.argv[1]
except:
usage()
raise()
#open the phonebook file

f = open(filename, ‘r’)
phonebook = f.readlines()
print phonebook
f.close()

#write to file randomly
try:
filename = sys.argv[2]
except:
usage()
raise()

f = open(filename, ‘w’)
random.shuffle(phonebook)
f.writelines(phonebook)
f.close()