[Elasticsearch] This Week in Elasticsearch and Apache Lucene - 2016-04-11
Elastic/Elasticsearch 2016. 4. 12. 09:59봐야지 봐야지 하다 이제 봅니다.
제 눈에 띄는 것은
- The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.
- The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.
- Elasticsearch will no longer accept unquoted field names in JSON.
- Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.
Java 8 is the minimum Java version required.
Dimensional points, replacing legacy numeric fields, provides fast and space-efficient support for both single- and multi-dimension range and shape filtering. This includes numeric (int, float, long, double), InetAddress, BigInteger and binary range filtering, as well as geo-spatial shape search over indexed 2D LatLonPoints. See this blog post for details. Dependent classes and modules (e.g., MemoryIndex, Spatial Strategies, Join module) have been refactored to use new point types.
Lucene classification module now works on Lucene Documents using a KNearestNeighborClassifier or SimpleNaiveBayesClassifier.
The spatial module no longer depends on third-party libraries. Previous spatial classes have been moved to a new spatial-extras module.
Spatial4j has been updated to a new 0.6 version hosted by locationtech.
TermsQuery performance boost by a more aggressive default query caching policy.
IndexSearcher's default Similarity is now changed to BM25Similarity.
Easier method of defining custom CharTokenizer instances.
원본링크)
Elasticsearch Core
Changes in 2.x:
- Extended Stats could return the wrong result when some indices are missing a field.
- Adding an object field with the same name as an existing field should fail.
- Shadow replicas should be considered as having size zero.
- CORS was broken for preflight requests.
- Windows users can configure the Windows service name, description, and user.
- Network addresses are now consistently displayed as the ip:port, instead of the hostname.
Changes in master:
- Network partitions will no longer cause loss of in flight documents, and we have the test to prove it.
- The `match`, `match_phrase`, and `match_phrase_prefix` queries are now separate queries, not just types of the `match` query.
- The task manager response now tells you which tasks can be cancelled, and supports a `_cat/tasks` API.
- Elasticsearch will no longer accept unquoted field names in JSON.
- Elasticsearch now uses mmapfs for Lucene directories instead of a hybrid of niofs/mmapfs.
- ParseField is now used to parse query names, which comes with deprecation logging for free.
- Geo-points support ignore_malformed correctly again.
- Moving averages threw an NPE when no window was specified.
- MappedFieldType should be responsible for knowing about which formatter apply, rather than the agg framework.
- The allocation-explain API now includes the configured allocation_delay and remaining_delays times.
- Hot threads now fail hard if the JVM doesn't support them.
- Queries now have a registry, and queries have gradually been migrated to use it.
Ongoing changes:
- Bulk request sizes will be subject to a circuit breaker.
- Deleted index tombstones are complicated.
- ObjectParser should allow constructor args.
- Should we enable http compression by default?
- Numeric and date fields in 5.0 should use the new Lucene points API.
- Now that we have removed the percolator API, we should also remove the percolator type and use percolator fieldsinstead.
- Improvements to how we score the _all field based on per-field boosts.
Apache Lucene
- The 6.0.0 release vote has passed and the bits were set free a few hours ago! Thank you Nick Knize for taking on the challenging role of release manager!
- Many
geo3d
improvements this week:- Polygon queries now accept
Polygon...
inputs, including random nested test polygons, matching our geo2d implementations and respecting the order of polygon vertices Geo3d
seems to sometimes incorrectly think a polygon is concave when it's really convex- Adjacent polygon points can now be coplanar
- The unique
GeoPath
support, which matches all point within X distance of a specified path (think road trip, looking for sushi nearby), now has a simple factory API as well - Tests were not adequately testing the new simple factory methods for common shapes
Geo3d
now uses a similar encode/decode quantization approach asLatLonPoint
- After lively discussions,
geo3d
APIs no longer publicly expose classes and methods that could safely be private. APIs should start life private until proven worthy of being public!
- Polygon queries now accept
- Many geo2d improvements as well:
LatLonPoint
Polygon queries are faster using a cool pixelating grid approach, and we can do the same forGeoPointField
- We must improve debuggability of our geo test failures with nice 3D earth models like this example
- Here's a lively discussion about the pros and cons of having our geo tests quantize data only once
- Quantization issues are tricky, and geo2d queries were quantizing the edges of box queries incorrectly, resulting in false positive hits
- We have improved the geo2d tests to never allow "tolerance" on the returned results
- We have moved common geo encoding APIs to core so they can be shared across implementations
- Better random latitude/longitude generation for tests has exposed a tie-break bug in distance sorting, edge case bugs in box query, test bugs and polygon bugs
Rectangle
andPolygon
classes have graduated into Lucene's core, to enable sharing across our numerous geo implementations- A new encoding for
GeoPointField
will be consistent withLatLonPoint,
and use all 64 available bits to minimize quantization error GeoPointField
gets an efficient distance sort- Randomized tests tried to create a too-big
GeoPointDistanceQuery
- We will move
BaseGeoPointTestCase
from the spatial module totest-framework
allowing us to remove the dependency of the sandbox module on spatial SloppyMath.haversin
can now move toGeoUtils
- The classification module now computes the f1-measure
- A previously commented out test assertion comes half way back to life
- Our "getting started with Lucene" docs were a bit buggy, but now fixed thanks to a user asking about it
- We've upgraded our randomizedtesting dependency to 2.3.4, so we get better messages when there is a static leak in our tests
- Points were missing from the
codecs
package documentation - The
DataSplitter
in Lucene's classification module should pay attention to classes when splitting - 800+ new top-level-domains have been created since we last fixed
StandardTokenizer
to detect them, but we may need to wait for a JFlex release - Highlighting fails to find terms inside the child query of a
BlockJoinQuery
- Lucene doesn't have direct support for boolean subset matching, but a number of possible workarounds may work
Math.toRadians
is changing its results slightly between Java 1.8 and 1.9NRTCachingDirectory.listAll
sometimes throwsIllegalStateException
- A scary random test failure is hopefully caused by bad hardware or buggy JVM
TestCoreParser
gets some small improvements- A possibly new JVM bug causes JVM crash when decoding postings
JapaneseTokenizer
should do a better job validating custom user-provided dictionaries- Another iteration for codec level encryption; this patch uses a new initialization vector for each data block, and seems not to impact search performance
- Our release scripts still struggle with the switch from Subversion to git
- Sometimes,
BooleanQuery's
explain method can lie about its score - Another user falls into the unfortunately common trap of thinking Lucene's stored fields store all information about a field