DWQA QuestionsCategory: Artificial IntelligenceForeachpartition() under pyspark writes data to HBase, but the data is not completely written to HBase
Kaka asked 3 months ago

1. Problem descriptionIn the process of using pyspark, there is a problem of writing data to HBase. When using happybase to write data in each partition to HBase in the foreachpartition () method, there will be a problem of data loss. Not all data is written in HBase, but only a small part is written.2. Specific business codes are as follows:Articlevector is the vector of articles, and similar is the similarity between articlesarticle_ The structure of the vector table is as follows:

create temporary table article.article_vector
(
    id            string comment 'id',
    major_id      int comment 'major_id',
    vector array<string> comment 'keyword vector'
);

Calculate similar codes:

from pyspark.ml.feature import Word2Vec
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import Word2VecModel
from pyspark.ml.feature import BucketedRandomProjectionLSH

articleVector = spark.sql("select * from article_vector")
def toVector(row):
    return row.id, Vectors.dense(row.vector)
    
train = articleVector.rdd.map(toVector).toDF(["id", "vector"])
brp = BucketedRandomProjectionLSH(inputCol='vector', outputCol='hashes', seed=12345, bucketLength=1.0)
model = brp.fit(train)
similar = model.approxSimilarityJoin(train, train, 2.0, distCol='EuclideanDistance')

Store in HBase

import happybase

def save_hbase(partitions):
    pool = happybase.ConnectionPool(size=10, host='hbase-url')
    
    with pool.connection() as conn:
        article_similar = conn.table('article_similar')
        for row in partitions:
            article_similar.put(str(row.datasetA.id).encode(),
                                {'similar:{}'.format(row.datasetB.id).encode(): b'%0.4f' % (row.EuclideanDistance)})
        conn.close()
        
similar.foreachPartition(save_hbase)

3. Specific issuesarticle_ The amount of data in the vector is 120W pieces of data. Take it out and calculate the similarity to get similar. But it’s time to save_ There was a problem with HBase (). No error was reported during the program running, and no exception was found in the spark log, but the article in the final HBase_ There are only 60000 records in the similar table. Logically, the number of records stored in HBase should be the same as that of article_ The amount of data in the vector is consistent, and the corresponding similar data of this ID can be found in HBase according to each ID. In fact, there are only about 60000 IDs stored, and only 60000 IDS corresponding to similar information can be found. Why is this? Is there a problem in the storage process of happybase?,Independent of happybase, the barrel length of LSH is set too small and increasesBucketedRandomProjectionLSHMediumbucketLength, increase againapproxSimilarityJoinThe threshold of Euclidean distance in. For details, you can view the source code of bucketedrandomprojectionlsh class in pyspark.ml.feature.

1 Answers
Kaka answered 3 months ago

Independent of happybase, the barrel length of LSH is set too small and increasesBucketedRandomProjectionLSHMediumbucketLength, increase againapproxSimilarityJoinThe threshold of Euclidean distance in. For details, you can view the source code of bucketedrandomprojectionlsh class in pyspark.ml.feature.