Shuffle: what is a good way to shuffle a deck of poker? Can you wash it evenly and quickly? That is, relative to a file, how to efficiently achieve disordered arrangement?
ChinaUnixIt is indeed a place where shell experts gather. As long as you want to get the question, you can basically find the answer there.r2007 GivenAn ingenious method, use the $random variable of shell to add a random line number to each line of the original file, then sort according to the random line number, and then filter out the temporarily added line number. In this way, the new file obtained after the operation is equivalent to being randomly “washed” once:
Of course, if the content of each line of your source file is complex, you must rewrite this code, but as long as you know the key skills, the remaining problems are not difficult to solve.
Another article fromUndstand Using awk to achieve the shuffle effect of random file sorting code analysishere, and aFollow up discussion, if you don’t have a login account, you can go tohereSee the article in the essence area) for more details:
——————————————————————–
In fact, there is a good shell solution to the shuffle problem. Here are three other methods based on awk. Please point out any mistakes.
Method 1: exhaustive
Similar to the exhaustive method, a hash is constructed to record the number of times a printed line appears. If the number of times is more than one, it will not be processed. This can prevent repetition, but the disadvantage is that it increases the overhead of the system.
BEGIN{
FS=”\n”;
RS=””
}
{
srand();
while(t!=N){
x=int(N*rand()+1);
a[x]++;
if(a[x]==1)
{
print $x;t++
}
}
}
‘ data
Method 2: Transformation
The method based on array subscript transformation, that is, the contents of each line are stored in the array, and the contents of the array are exchanged through the transformation of array subscript. The efficiency is better than that of method 1.
BEGIN{
srand();
}
{
b[NR]=$0;
}
END{
C(b,NR);
for(x in b)
{
print b[x];
}}
function C(arr,len,i,j,t,x){
for(x in arr)
{
i=int(len*rand())+1;
j=int(len*rand())+1;
t=arr[i];
arr[i]=arr[j];
arr[j]=t;
}
}
Method 3: hash
The best of the three methods.
Using the hash feature in awk (see 7.x in info gawk for details), you can only construct a random non repeating hash function. Because the linenumber of each line of a file is unique, you can use:
Random number + linenumber of each line ——- corresponding to ——-> the contents of that line
Is the constructed random function.
Thus:
In fact, you don’t have to worry too much about the problem of using too much memory. You can do a test:
Test environment:
PM 1.4GHz CPU, 40g hard disk, 256M laptop memory
SUSE 9.3 GNU bash version 3.00.16 GNU Awk 3.1.4
Generate a random file with more than 500000 lines, about 38m:
Take the less efficient method 1 for example:
Time for one shuffle:
BEGIN{
FS=”\n”;
RS=””
}
{
srand();
while(t!=N){
x=int(N*rand()+1);
a[x]++;
if(a[x]==1)
{
print $x;t++
}
}
}
‘ data
Results (file contents omitted):
user 0m34.224s
sys 0m2.102s
So the efficiency is just acceptable.
Test of method 2:
Results (file contents omitted):
user 0m7.044s
sys 0m1.371s
The efficiency is obviously better than the first one.
Next, consider the efficiency of method 3:
Results (file contents omitted):
user 0m5.318s
sys 0m1.301s
It is quite good for a 38m file.
——————————————————————–
An out of order code of Python version written by flyfly is attached:
import sys
import random
def usage():
print “usage:program srcfilename dstfilename”
global filename
filename = “”
try:
filename = sys.argv[1]
except:
usage()
raise()
#open the phonebook file
f = open(filename, ‘r’)
phonebook = f.readlines()
print phonebook
f.close()
#write to file randomly
try:
filename = sys.argv[2]
except:
usage()
raise()
f = open(filename, ‘w’)
random.shuffle(phonebook)
f.writelines(phonebook)
f.close()