Interview Questions for Big Data Engineers (1)

  1. Tell me about the first project.
  1. Optimization of shuffle in hive

    1. compress
      Compression can reduce the amount of data stored on disk and improve query speed by reducing I/O.

      Enable compression for a series of MR intermediate processes generated by hive

      set hive.exec.compress.intermediate=true;

      Compression of the final output (files written to hdfs, local disk)

      set hive.exec.compress.output=true;
    2. Join optimization

      1. map join
If one of the two tables in the associated query has a small table default map join, add the small table to memory
    Hive. mapjoin. smalltable. filesize = 25000000 default size
    Hive. auto. convert. join = true is turned on by default
    If using map join is not enabled, use statements to make tabs and use map join
    select /*+ MAPJOIN(time_dim) */ count(1) from
    store_sales join time_dim on (ss_sold_time_sk = t_time_sk)

    2. smb join
    Sort-Merge-Bucket join
    Solving the problem of slow joining speed between large tables and large tables
    The number of barrels is divided by the hash value of the barrel field.

    3. Tilt connection
    <! -- hive. optimize. skewjoin: Whether to create a separate execution plan for the tilt keys in the join table. It is based on skewed keys stored in metadata. At compile time, Hive generates its own query plan for tilt keys and other key values. >
    <property> <name>hive.optimize.skewjoin</name> 
    </property> <property>

    <! - hive. skewjoin. key: Decides how to determine the tilt key in the connection. In connection operation, if the number of data rows corresponding to the same key value exceeds the parameter value, the key is considered to be an inclined connection key. >

    <! - hive. skewjoin. mapjoin. map. tasks: Specifies the number of tasks used for Map connection jobs in a tilted connection. This parameter should be used with hive. skewjoin. mapjoin. min. split to perform fine-grained control. >
    <property> <name></name> 

    <! - hive. skewjoin. mapjoin. min. split: Determines the number of tasks for a Map connection job by specifying the minimum split size. This parameter should be used with hive. skewjoin. mapjoin. map. tasks to perform fine-grained control. >
  1. How Hive Solves Data Tilt in Clustering Process

Essential Cause: Uneven Key Distribution

Map end part aggregation, equivalent to Combiner

Load balancing with data skew


When the selection is set to true, the generated query plan will have two MR Jobs. In the first MR Job, the output result set of Map is randomly distributed to Reduce. Each Reduce performs a partial aggregation operation and outputs the result, so that the same Group By Key may be distributed to different Reduces to achieve the purpose of load balancing. The second MR Job distributes to Reduce according to the preprocessed data result according to Group By Key. This process ensures that the same Group By Key is distributed in the same Reduce, and finally completes the final aggregation operation.

  1. Sqoop will import all tables in the database, how to operate? What parameters? Incremental import?

Full scale import

[[email protected] sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \
> --connect jdbc:mysql:// \
> --username root \
> --password 123456 \
> --table user

Incremental import

bin/sqoop import \
--connect jdbc:mysql:// \
--username root \
--password 123456 \
--table user \
--fields-terminated-by '\t' \
--target-dir /sqoop/incremental \
-m 1 \
--direct \
--check-column id \
--incremental append \
--last-value 3
  1. The possibility of data skew caused by hive (which operations lead to) > How to solve the problem of uneven distribution of barrel join keys with a large number of empty values?

    Data skew may be caused by uneven distribution of results according to key operation, such as group by key

    Order by uses global sorting, which ultimately only runs all data on a reducer, resulting in data skewing

    Large number of NULL

    Hive’s NULL is sometimes necessary:

    1) The insert statement in hive must match the number of columns. It does not support not writing. Columns without values must use null placement.

    2) Hive tables are divided into columns by delimiters in their data files. An empty column saves NULL (n) to keep the column location. However, when external tables load some data, if there are not enough columns, such as table 13 and only two columns of file data, the remaining columns at the end of the table do not correspond to the data and are automatically displayed as NULL.

    Therefore, NULL can be converted into empty strings, which can save disk space. There are several ways to achieve this.
    1) Directly specify when building tables (in two ways)

    A. Sentences
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
    with serdeproperties('serialization.null.format' = '')
        To achieve this, note that both must be used together, such as
    CREATE TABLE hive_tb (id int,name STRING)
    PARTITIONED BY ( `day` string,`type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)
    ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
    CREATE TABLE hive_tb (id int,name STRING)
    PARTITIONED BY ( `day` string,`type` tinyint COMMENT '0 as bid, 1 as win, 2 as ck', `hour` tinyint)
            NULL DEFINED AS '' 

    2) Modify existing tables

    alter table hive_tb set serdeproperties('serialization.null.format' = '');
  2. How to add a column of data in hive?
    Add a new column

    hive > alter table log_messages add coloumns(
    app_name string comment 'Application name',
    session_id long comment 'The current session id'
    -- After adding the last field of the column table, add it before partitioning the field.

    If a column of new_column is added to the table, it is not feasible to insert the column of new_column directly into the original table.

    If a new column is partitioned, additional data can be added to the partition

    insert into table clear partition(date='20150828',hour='18')   select id,url,guid from tracklogs where date='20150828' and hour='18'; 
  3. Run spark
  4. Has hive ever handled json? What functions are there?

    1. Formulate jar packages to process JSON data when creating tables

      1. First add the jar package
ADD JAR hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.14.2.jar;

    2. build tables
    hive (default)> ADD JAR hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.14.2.jar;
    Added [hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.14.2.jar] to class path
    Added resources: [hcatalog/share/hcatalog/hive-hcatalog-core-1.1.0-cdh5.14.2.jar]
    hive (default)> create table spark_people_json( 
                > `name` string,
                > `age`   int)
                > ROW FORMAT SERDE ''
                > STORED AS TEXTFILE;
    Time taken: 4.445 seconds        

2. Record how you want to get a value inside if only a field is json.
    1. get_json_object()

    Only one field can be obtained
    select get_json_object('{"shop":{"book":[{"price":43.3,"type":"art"},{"price":30,"type":"technology"}],"clothes":{"price":19.951,"type":"shirt"}},"name":"jane","age":"23"}', '$[0].type');

    2. json_tuple() 

    You can get multiple fields
    select json_tuple('{"name":"jack","server":""}','server','name')

    3. Write UDF by yourself
  1. How can a sparkstreaming running program stop? How can it stop safely? How can the code be updated, and how can the running and updated code be alternated?

    Upgrade application code

    If you need to upgrade a running Spark Streaming application with new application code, there are two possible mechanisms.

    1. The upgraded Park Streaming application starts and runs in parallel with existing applications. Once the new (receiving the same data as the old) has been preheated and ready for prime time, the old can be put down. Note that this can be used to support data sources that send data to two targets, early and upgraded applications.
    2. Existing applications are gracefully closed (see StreamingContext. stop (…) or Java StreamingContext. stop (…) for the graceful closure option) to ensure that the received data is processed completely before closing. Then you can start the upgraded application, which will start processing at the same point that the earlier application stopped. Note that this can only be done by input sources that support source-side buffering, such as Kafka and Flume, because data needs to be buffered when the previous application shuts down and the upgraded application has not yet started. And it is not possible to restart the information of the pre-upgrade code from the early checkpoint. Checkpoint information basically contains serialized Scala / Java / Python objects, and attempts to use new modified classes to serialize objects may lead to errors. In this case, either start the upgraded application with a different checkpoint directory or delete the previous checkpoint directory.
  1. How to achieve accurate data consumption in sparkstreaming and Kafka integration?

    Using Direct Connection

  2. How many kinds of message semantics are there?

    1. At least once — Messages are never lost, but may be transmitted repeatedly
    2. At most once — Messages may be lost, but never transmitted repeatedly
    3. Just once — Every message must be transmitted once and only once, which is what users want most of the time.
  3. How can consumers of Kafka ensure accurate consumption?
  1. How many ways do sparkstreaming and Kafka integrate?

    1. Receiver-based Approach
    2. Direct Approach (No Receivers) native Offsets
  2. How to implement sparkstreaming?

    Initialize StreamingContext

    Define the input source by creating the input DStreams.

    Stream computing is defined by applying conversion and output operations to DStream.

    Start receiving data and use it to process streamingContext. start ().

    Waiting for processing to stop (manually or due to any error) using streamingContext. awaitTermination ().

    You can use manual stop processing streamingContext. stop ().

  3. How many people are allocated in the project?
  4. Storage architecture in your project?
  5. What are the development tools? (what tools do you use? /xshell/idea)
  6. How does the code do management (git)?
  7. Analysis function in hive?

    1. Window function
    You can choose to specify the number of rows to boot. If the number of rows to boot is not specified, the lead is a row.
    Returns null when the leading of the current line exceeds the end of the window.
    hive (default)> desc function lead;
    LEAD (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause); The LEAD function is used to return data from the next row. 
    You can choose to specify the number of rows that are delayed. If the number of lagged rows is not specified, the lag is one row.
    When the delay of the current row extends before the window starts, null is returned.
    hive (default)> desc function lag;
    LAG  (scalar_expression [,offset] [,default]) OVER ([query_partition_clause] order_by_clause); The LAG function is used to access data from a previous row.
    This requires at most two parameters. The first parameter is the column you want the first value, and the second (optional) parameter must be the default Boolean value of false. If set to true, the null value is skipped.
    This requires at most two parameters. The first parameter is the column of the last value you want, and the second (optional) parameter must be the default Boolean value of false. If set to true, the null value is skipped.

2. OVER Sentences
OVER Standard Polymerization:


Use the PARTITION BY statement with one or more partition columns of any original data type.

Use PARTITION BY and ORDER BY with one or more partitions and/or sorted columns of any data type.

Over with windows is specified. Windows can be defined separately in the WINDOW clause. The window specification supports the following formats:
The WINDOW clause is missing when ORDER BY is specified, and the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING and CURRENT ROW.

When ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING.

The OVER clause supports the following functions, but it does not support windows with them (see HIVE-4797):

Ranking functions: Rank, NTile, DenseRank, CumeDist, PercentRank.

Lead and Lag functions.

3. Analytical function
CUME_DIST: CUME_DIST is less than or equal to the number of rows in the current value / total number of rows in the group
NTILE(2) over(partition by c_id order by s_score)
from score

4. Distinct is supported in Hive 2.1.0 and later (see HIVE-9534)

Aggregation functions support Distinct, including SUM, COUNT and AVG, which aggregate on different values within each partition. Current implementations have the following limitations: for performance reasons, ORDER BY or window specifications cannot be supported in partitioning clauses. The supported grammar is as follows.
ORDER BY and window specification are supported in Hive 2.2.0 (see HIVE-13453). An example is as follows.

5. Aggregation functions in OVER clause support in Hive 2.1.0 and later (see HIVE-13475)

Added support for referencing aggregation functions in OVER clauses. For example, we can now use SUM aggregation functions in the OVER clause, as shown below.

SELECT rank() OVER (ORDER BY sum(b))
  1. What functions do common strings use?

    1. String length function: length
    2. String inversion function: reverse
    3. String connection function: concat
    4. String join function with delimiter: concat_ws
    hive (practice)> select concat_ws('|','abc','def','gh');
    1. String interception function: substr, substring

      substr(string A, int start),substring(string A, int start)

    2. String interception function: substr, substring

      substr(string A, int start, int len),substring(string A, int start, int len)

    3. String capitalization function: upper, ucase
    4. String to lowercase function: lower, lcase
    5. Despace function: trim
    6. Left de-space function: ltrim
    7. Right-side de-space function: rtrim
    8. Regular expression replacement function: regexp_replace

      Syntax: regexp_replace (string A, string B, string C)

      Return value: string

      Description: Replace the Java regular expression B part of string A with C. Note that in some cases escape characters are used, similar to the regexp_replace function in oracle.

    9. Regular expression analytic function: regexp_extract

      Syntax: regexp_extract (string subject, string pattern, int index)

      Return value: string

      Description: Split the string subject according to the rule of pattern regular expression and return the character specified by index.

    10. URL analytic function: parse_url
    11. JSON analytic function: get_json_object
    12. Space string function:
    13. Repeat string function: repeat
    hive (practice)> select repeat('abc',5);
    1. The first character ASCII function: ASCII
    2. Left complement function: lpad
    3. Right complement function: rpad
    4. Split string function: split
    5. Set lookup function: find_in_set
  1. How to count the PV on the first day of a month every Monday?

    Gets the first day of the specified date, month, and year

    select trunc('2019-02-24', 'YYYY');
    select trunc('2019-02-24', 'MM');

    Designated Date The Designated Week of Next Week

    select next_day('2019-02-24', 'TU');

    Returns the date of the specified date in the specified format after several months

    select add_months('2019-02-28', 1);
    select add_months('2019-02-24 21:15:16', 2, 'YYYY-MM-dd HH:mm:ss');

    select count(guid) from table group by trunc(date, ‘MM’)

    select count(guid) from table group by next_day(‘2019-06-08’, ‘MONDAY’);

  2. What is the difference between dataset and dataframe?

    After Spark 2.x, the official unification of the DataFrame / Dataset API has been implemented. DataFrame is only when every element in Dataset is of Row type.

    The difference is that Dataset is strongly typed, while dataframe is untypedrel.

  3. Where is the metadata of hive stored in the project?

    1. If there is no plan to save the Metastore to the specified database, the Metastore will be stored in the Deybe database of hive by default, which is the embedding mode of installing hive.
    2. If an external database is created in the settings, it is saved in the database, and Metastore is saved in local and remote modes in this way.
  4. How does metadata ensure its security?

    Username and password used to modify metadata


    Setting access rights to Metastore database on MySQL side

  5. How many ways do sqoop import and export? Incremental export?


    Full scale import
    [[email protected] sqoop-1.4.5-cdh5.3.6]$ bin/sqoop import \
    > --connect jdbc:mysql:// \
    > --username root \
    > --password root \
    > --table toHdfs \
    > --target-dir /toHdfs \
    > --direct \
    > --delete-target-dir \
    > --fields-terminated-by '\t' \
    > -m 1
    Incremental import append
    bin/sqoop import \
    --connect jdbc:mysql:// \
    --username root \
    --password 123456 \
    --table user \
    --fields-terminated-by '\t' \
    --target-dir /sqoop/incremental \
    -m 1 \
    --direct \
    --check-column id \
    --incremental append \
    --last-value 3
    Incrementally import LastModified
    There must be a list of indicated times in the table
    sqoop import \
    --connect jdbc:mysql://master:3306/test \
    --username hive \
    --password 123456 \
    --table customertest \
    --check-column last_mod \
    --incremental lastmodified \
    --last-value "2016-12-15 15:47:29" \
    -m 1 \


    To update
    根据目标数据库,如果要To update已存在于数据库中的行,或者如果行尚不存在则insert行,则还可以--update-mode 使用allowinsert模式指定参数
  1. Sparkstreaming saves too many small files by batch hdfs?

    Use window functions to specify long enough windows to process data, so that the total amount of data is large enough (preferably around the size of a block), and write the data to HDFS using foreach RDD after completion.

  2. What are the responsible data types in hive?

    1. Numerical type








    2. Character




    3. Date type



    4. Complex type


      create table hive_array_test (name string, stu_id_list array<INT>)
      -'FIELDS TERMINATED BY': A separator between fields
      -'COLLECTION ITEMS TERMINATED BY': A separator for each item in a field
      [[email protected] ~]$ vi hive_array.txt 
      load data local inpath '/home/chen/hive_array.txt' into table hive_array_test;
      hive (default)> select * from hive_array_test;
      OK    hive_array_test.stu_id_list
      0601    [1,2,3,4]
      0602    [5,6]
      0603    [7,8,9,10]
      0604    [11,12]
      Time taken: 0.9 seconds, Fetched: 4 row(s)

      MAP<primitive_type, data_type>

      create table hive_map_test (id int, unit map<string, int>)
      'MAP KEYS TERMINATED BY': key value separator
      [[email protected] ~]$ vi hive_map.txt
      0       Chinese:100,English:80,math:59
      1       Chinese:80,English:90
      2       Chinese:100,English:100,math:60
      load data local inpath '/home/chen/hive_map.txt' into table hive_map_test;
      hive (default)> select * from hive_map_test;
      OK        hive_map_test.unit
      0       {"Chinese":100,"English":80,"math":59}
      1       {"Chinese":80,"English":90}
      2       {"Chinese":100,"English":100,"math":60}
      Time taken: 0.204 seconds, Fetched: 3 row(s)
      hive (default)> select id, unit['math'] from hive_map_test;
      id      _c1
      0       59
      1       NULL
      2       60
      Time taken: 0.554 seconds, Fetched: 3 row(s)

      STRUCT<col_name : data_type [COMMENT col_comment], …>

      create table hive_struct_test(id int, info struct<name:string, age:int, height:float>)
      [[email protected] ~]$ vi hive_struct.txt
      load data local inpath '/home/chen/hive_struct.txt' into table hive_struct_test;
      hive (default)> select * from hive_struct_test;
      0       {"name":"zhao","age":18,"height":178.0}
      1       {"name":"qian","age":30,"height":173.0}
      2       {"name":"sun","age":20,"height":180.0}
      3       {"name":"li","age":23,"height":183.0}
      Time taken: 0.153 seconds, Fetched: 4 row(s)
      hive (default)> select id, from hive_struct_test;
      id      name
      0       zhao
      1       qian
      2       sun
      3       li
      Time taken: 0.133 seconds, Fetched: 4 row(s)
  1. How to load the data import and export in hive? Where to save it?

    Import:Load data [local] inpath'path'overwrite into table table name

    Export:insert overwrite [local] directory '/home/hadoop/data' select * from emp_p;

    Local: Local is loaded locally and HDFS is loaded without local

  2. How to create RDD?

    1. Parallelization inDriver terminalExisting data sets, not parallelize executor-side data
    scala> var data = Array(1, 2, 3, 4, 5)
    data: Array[Int] = Array(1, 2, 3, 4, 5)
    scala> val rdd = sc.parallelize(data)
    rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26
    1. Referencing or reading data sets from an external storage system, such as HDFS, Hbase, or any subclass of Hadoop inputFormat
    scala> sc.textFile("student.log")
    res0: org.apache.spark.rdd.RDD[String] = student.log MapPartitionsRDD[1] at textFile at <console>:25
    1. From an existing RDD, call the transformation operator to generate a new sub-RDD
  3. How to deal with the memory overflow of the running driver?
  • An application is running. How do you modify its displacement calculation when it is down?
  • Hadoop compression format

    bin/hadoop checknative  -a
    [[email protected] hadoop-2.6.0-cdh5.14.2]$ bin/hadoop checknative  -a
    19/06/05 19:15:45 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
    19/06/05 19:15:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    Native library checking:
    hadoop:  true /opt/modules/hadoop-2.6.0-cdh5.14.2/lib/native/
    zlib:    true /lib64/
    snappy:  true /opt/modules/hadoop-2.6.0-cdh5.14.2/lib/native/
    lz4:     true revision:10301
    bzip2:   true /lib64/
    openssl: true /usr/lib64/
  • Sparksql processed data frame results to be stored in the database, what should be done?

        .jdbc("jdbc:mysql://centos01:3306/mydb", "emp", prop)
  • Optimization of shuffle in MapReduce process?

    Shuffle process: map side: ring buffer (up to 80%) - "overwrite (partition, sorting) - -" combiner - "compress-"
                    Reduce side: --"merge - --" sort - "group"
        Combiner can be turned on optionally, making a small reduction on the map side
        Compressor can choose open zone to compress the result and reduce IO
        Hash Partitioner is used in shuffle partitioning, the same key will enter the same reduce, the uneven distribution of key will lead to data skew, reference data skew optimization process
  • The problem of hive quadratic ranking?

    Order by: globally ordered, the final data goes into a reduce and is not recommended

    Sort by: Locally ordered, data bureaus in each reduce ordered

    distribute by

    Set up partitions by distribute by and sort by to set sort within partitions
    Distbuteby is often used with sort by

    Cluster by: distributeby and sort by are used when conditions are consistent

    Secondary sorting is followed by field names a, b, c…. Hive will be sorted by a first, if a is sorted by B the same, if B is sorted by C the same.

    select * from score order by score.s_id asc, score.s_score desc;
    score.s_id      score.c_id      score.s_score
    01      03      99
    01      02      90
    01      01      80
    02      03      80
    02      01      70
    02      02      60
    03      03      80
    03      02      80
    03      01      80
    04      01      50
    04      02      30
    04      03      20
    05      02      87
    05      01      76
    06      03      34
    06      01      31
    07      03      98
    07      02      89
    Time taken: 96.333 seconds, Fetched: 18 row(s)

    You can see why order by is not sorted.

  • How can rowkey be designed and designed?

    General Policy: Avoid data hotspots caused by HBase pre-matching mechanism that all data enter a region server or several regions servers


    Must be unique

    It is not recommended to use random numbers as rowkeys, but to design rowkeys according to actual business requirements.

    Can’t set too large to make storage space and index bigger

  • Sparkstreaming window function how to open a window, how long to open

    Interview Questions for Big Data Engineers (1)

    Opening window functions requires two parameters:

    Window Length - The duration of the window (3 in the figure).
    Sliding interval - The interval at which window operation is performed (2 in the figure).
    // Sliding window: There is overlap between two windows, the sliding time is less than the window time.
        val wc = res.reduceByKeyAndWindow(
    // Aggregation of different batches of data with the same key in the same window
        (a:Int, b:Int) => a + b,
    // Window width, representing the amount of data that a window calculates
    // Sliding time interval, representing how long it takes to compute a window's data
  • How are the three parameters of sparkstreaming window function designed?

    General window length is larger than sliding interval

    Increase the window width, and you can run batches with sparkstreaming for data manipulation in a window