Geohash precision and principle


Geohash precision and principle

Transferred from:

The basic principle of geohash is to understand the earth as a two-dimensional plane, recursively decompose the plane into smaller sub blocks, and each sub block has the same coding in a certain latitude and longitude range. This method is simple and rough, and can meet the latitude and longitude retrieval of small-scale data

1. General knowledge of longitude and latitude

  • Longitude is vertical, longitude is horizontal, used to represent different longitude, latitude is horizontal, latitude is vertical, used to represent different latitude, as shown in the following figure
  • Geohash precision and principle   Geohash precision and principle
  • Latitude: the horizontal line on the globe, lat. the equator is the largest latitude line. From the equator, it is divided into north latitude and south latitude, both of which are 0-90 degrees. The latitude line is an angular value, not a meter;
  • Longitude: the vertical line on the globe, LNG, meridian is 0 degrees, divided into west longitude and east longitude, both are 0-180 degrees, longitude is also the angle value;
  • Conversion of longitude and latitude lines and meters: longitude or latitude 0.00001 degrees is about 1 meter. When GPS measures the distance, we can realize that as long as GPS is accurate to five decimal places, it is within 10 meters
  • The position of longitude 0 is the prime meridian, and the position of latitude 0 is the equator
  • For easy understanding, the earth is regarded as a coordinate system based on longitude and latitude lines. The latitude is the circumference of those planes parallel to the equatorial plane, and the longitude is the semicircle of the great circle connecting the north and south poles. Latitude is divided into north latitude (positive) and south latitude (negative), and the latitude value of the equator is 0. Longitude is divided into east longitude (positive) and west longitude (negative) according to the boundary of primary meridian (primary meridian longitude is 0). So latitude range can be expressed as [- 90o, 0o), (0o, 90o], longitude range can be expressed as [- 180o, 0o), (0o, 180o]
  • Geohash precision and principle
  • Consider the longitude and latitude as two latitudes in the two-dimensional coordinate system. The horizontal axis represents the longitude and the vertical axis represents the latitude, as shown in the figure above
2. Get to know geohash

  • Geohash converts two-dimensional longitude and latitude into strings. For example, the following figure shows the geohash strings of nine regions in Beijing, namely wx4er, wx4g2, wx4g3, etc. each string represents a rectangular region. In other words, all the points (longitude and latitude coordinates) in the rectangular area share the same geohash string, which can protect privacy (only represent the approximate location of the area, not the specific points), and is easier to cache.
  • Geohash precision and principle
  • Different encoding lengths represent different ranges. The longer the string is, the more accurate the range is
  • String similarity means that the distance is close (special case will be explained later), so we can use the prefix matching of string to query the nearby POI information. As shown in the following two figures, one is in the urban area, and the other is in the suburb. The geohash strings in the urban area are similar, and the strings in the suburb are also similar, while the geohash strings in the urban area and the suburb are less similar
  • Geohash precision and principle
  • Conclusion: geohash is a method to convert longitude and latitude into string, and in most cases, the more string prefix matches, the closer the distance
3. Geohash algorithm

The algorithm is illustrated by longitude and latitude: (116.389550, 39.928167), and the latitude 39.928167 is approximated and coded (the latitude interval of the earth is [- 90,90])

  1. The interval [- 90,90] is divided into [- 90,0], [0,90], which is called left and right interval. It can be determined that 39.928167 belongs to the right interval [0,90], and is marked as 1
  2. Then divide the interval [0,90] into [0,45], [45,90], and 39.928167 belongs to the left interval [0,45], which is marked as 0
  3. The recursive process 39.928167 always belongs to an interval [a, b]. With each iteration, the interval [a, b] is always shrinking and getting closer to 39.928167
  4. If the given latitude x (39.928167) belongs to the left interval, record 0. If it belongs to the right interval, record 1. The length of the sequence is related to the number of times of partition of the given interval, as shown in the following figure
  5. Geohash precision and principle
  • Similarly, the longitude range of the earth is [- 180180], and the longitude 116.389550 can be encoded
  • Through the above calculation, the code generated by latitude is 11010010100010, and the code generated by longitude is 11011100001100011
  • Merge: put the even number of bits into longitude and the odd number into latitude, and combine the two codes to generate a new string, as shown in the figure below:
  • Geohash precision and principle
  • First, convert 11100, 11101, 00100, 01111, 0000, 01101 to decimal system, corresponding to 28, 29, 4, 15, 0, 13 decimal system, and the corresponding base32 code is wx4g0e, as shown in the following figure
  • Geohash precision and principle
  • In the same way, the decoding algorithm that converts the code into latitude and longitude is the opposite
4. Geohash principle

  • Geohash is actually to divide the whole map or a segmented area once, because it adopts the base32 coding method, that is, every letter or number in geohash (such as w in wx4g0e) is composed of 5 bits (2 ^ 5)= These 5bits can have different combinations (0 ~ 31) in 32. In this way, we can divide the whole map area into 32 areas, and identify these 32 areas by 00000 ~ 11111. The situation after the first division of the map is shown in the figure below (the number in each area corresponds to the corresponding code of the area)
  • Geohash precision and principle
  • The sequence of 0 and 1 in geohash is arranged alternately by the numbers in the sequence of longitude 0 and 1 and the sequence of latitude 0 and 1. The sequence corresponding to even digits is longitude sequence, and the sequence corresponding to odd digits is latitude sequence. During the first partition, the first five bits (11100) in the sequence of geohash 0 and 1, then three bits in the five bits represent longitude and two bits represent latitude, so the first partition The longitude is divided into 8 sections (2 ^ 3 = 8) and the latitude into 4 sections (2 ^ 2 = 4), thus forming 32 regions. As shown in the figure below
  • Geohash precision and principle
  • Similarly, 32 regions can be divided again according to the first division

Comparison table

  • Geohash precision and principle
5. Geohash defect

        The calculation steps of geohash are described above, but only what is it, not why? Why code longitude and dimension separately? Why is it necessary to cross and combine the longitude and latitude codes into one? This section attempts to answer this question.

        As shown in the figure, we fill in the result of binary coding into the space. When the space is divided into four blocks, the coding order is 00 in the lower left corner, 01 in the upper left corner, 10 in the lower right foot, and 11 in the upper right corner, which is a curve similar to Z. when we recursively decompose each block into smaller sub blocks, the coding order is self similar (fractal), and each sub block also forms a Z curve This type of curve is called Peano space filling curve.

        The advantage of this type of space filling curve is to transform the two-dimensional space into one-dimensional curve (in fact, fractal dimension). For most of them, the distance of similar codes is similar, but the biggest disadvantage of Peano space filling curve is mutation. Some codes are adjacent, but the distance is very different. For example, 0111 and 1000 codes are adjacent, but the distance is very different.


        In addition to Peano space filling curve, there are many space filling curves, as shown in the figure. Among them, Hilbert space filling curve is generally recognized as the best one. Compared with Peano curve, Hilbert curve has no big mutation. However, due to the Peano curve is more simple, when used with a certain solution, it can well meet most of the needs, so TD internal geohash algorithm uses Peano space filling curve.

6. Points for attention

    a. Because geohash divides the region into regular rectangles and encodes each rectangle, it will cause the following problems when querying the POI information nearby. For example, the red dot is our location, and the green dot is two restaurants nearby. However, when querying, we will find that the geohash code of the restaurant far away is the same as ours (because in the same geoh) The geohash code of the nearest restaurant is different from ours. The problem often arises at the border.


        The solution is very simple. When we query, we not only use geohash coding of the anchor point for matching, but also use geohash coding of the surrounding 8 regions, which can avoid this problem.

    b. We already know that the existing geohash algorithm uses Peano space filling curve, which will produce mutation, resulting in the problem that although the coding is similar, the distance may vary greatly. Therefore, when querying the nearby restaurants, we first filter the POI points with similar geohash coding, and then calculate the actual distance.

    c. Geohash base32 encoding length and accuracy. It can be seen that when the encoding length of geohash base32 is 8, the accuracy is about 19 meters, while when the encoding length is 9, the accuracy is about 2 meters, and the encoding length needs to be selected according to the data situation.


7. ~ ~ ~ calculates all geohashes in the fence

        After understanding the basic principle of geohash algorithm, this section will introduce a common scenario in practical application: calculating all geohash codes within the fence. The scene is encapsulated as a function, which can be expressed as follows: input the set of longitude and latitude coordinates of the points constituting the fence and the specified geohash length, and output a set of geohash codes.

        public static Set getHashByFence(List points, int length)

        The specific algorithm steps are as follows:

1. Enter the fence point coordinate set list points and the specified geohash length

2. Calculate the coordinates of the upper left corner and the lower right corner of the enclosure rectangle\_ min、lat\_ max、lng\_ min、lng\_ max

3. According to lat\_ min、lat\_ max、lng\_ min、lng\_ Max, calculate the distance D of the outer rectangle diagonal fixed point

4. Make a circle with the center of the outer rectangle as the center and D / 2 as the radius, and calculate the geohash within the circle coverage

    4.1 obtain the longitude and latitude of the fixed-point coordinates of the upper left corner and lower right corner of the outer rectangle of the circle, and store them in double \ [\] locs

    4.2 calculate the latitude and longitude interval (lata, lnga) of the geohash code according to the length of the geohash character

    4.3 according to lata and lnga, calculate the longitude and latitude of the upper left corner and the lower right corner of the rectangle composed of locs, and record the index (that is, the number) of the grid divided in geohash as lat respectively\_ min,lat\_ max,lng\_ min,lng\_ max

    4.4 calculation of Lat\_ min,lat\_ max,lng\_ min,lng\_ Max corresponds to the binary code of left and right geohash in the range, and then uncode the longitude and latitude binary code into the geohash character code, and save it as set sets

5. Eliminate the geohash in sets whose center point of the corresponding rectangle is not within the fence of points, and get the final geohash result set