[Elasticsearch] 불일치 데이터 검증 툴
Elastic/Elasticsearch 2013. 4. 17. 14:41원문 : https://github.com/Aconex/scrutineer
README.md Analyses a secondary stream of information against a known point-of-truth and reports inconsistencies. When you have a Lucene-based index of substantial size, say many hundreds of millions of records, what you want is confidence that your index is correct. In many cases, people use Solr/ElasticSearch/Compass to index their central database, mongodb, hbase etc so the index is a secondary storage of data. How do you know if your index is accurate? Can you just reindex 500 million documents anytime you like? (That's the Aliens: "Nuke the site from Orbit... It's the only way to be sure" approach). No, if there ARE inconsistencies in your index, then you want to: Scrutineer has been designed with this in mind, it can find any inconsistencies in your index fast. Scrutineer relies on your data having 2 core properties: The Version property is commonly used in an Optimistic Locking pattern. If you store the ID & Version information in your secondary store (say, Solr/ElasticSearch) then you can always compare for any given item whether the version in secondary store is up to date. Scrutineer takes a stream from your primary, and a stream from your secondary store, presumes they are sorted identically (more on that later) and walks the streams doing a merge comparison. It detects 4 states: Here's an example, 2 streams in sorted order, one from the Database (your point-of-truth), and one from ElasticSearch (the one you're checking) with the : for each side: Scrutineer picks up that: The very first thing you'll need to do is get your JDBC Driver jar and place it in the 'lib' directory of the unpacked package. We already have a JTDS driver in there if you're using SQL Server (that's just what we use). Note: if you're weirded out about that '...cast(...)' then don't worry, we'll explain that shortly. Scrutineer writes any inconsistencies direct to Standard Error, in a well-defined, tab-separated format for easy parsing to feed into a system to reindex/cleanup. If we use the Example scenario above, this is what Scrutineer would print out: The general format is: FailureType\t**ID**\t**VERSION**\t**Optional:Additional Info** This means you are missing this item in your secondary and you should reindex/re-add to your secondary stream This means the version of the object stored in the secondary is not the same information as the primary, and you should reindex The object was removed from the Primary store, but the secondary still has it. You should remove this item from your secondary. Scrutineer does not report when items match, we'll presume you're just fine with that... By default, Scrutineer allocates 256m to the Java Heap, which is used for sort, and ElasticSearch result buffers. This should be more than enough for the majority of cases but if you find you get an OutOfMemoryError, you can override the JAVA_OPTS environment variable to provide more heap. e.g. VERY IMPORTANT: Scrutineer relies on both streams to be sorted using an identical mechanism. It requires input streams to be in lexicographical (default) or numerical (indicate using Since Aconex uses ElasticSearch, Scrutineer supports ES out of the box, but it would not be difficult for others to integrate a Solr stream and wire something up. Happy to take Pull Requests! The authors of Scrutineer, Aconex, index content from a JDBC data source and index using ElasticSearch. We do the following: Scrutineer ships with the SQL Server JTDS driver by default (it's what we use). All you should need to do is drop your own JDBC driver in the 'repo' sub-directory of the Scrutineer distribution (where all the other jars are). We use the Maven AppAssembler plugin which is configured to automatically load all JARs in this path onto the classpath. Scrutineer is a Maven project, which really should just build right out of the box if you have Maven installed. Just type: And you should have a Tarball in the 'target' sub-directory. First, Please add unit tests! Second, Please add integration tests! Third, We have tightened up the quality rule set for CheckStyle, PMD etc pretty hard. Before you issue a pull request, please run: which will run all quality checks. Sorry to be super-anal, but we just like Clean Code. Scrutineer currently only runs in a single thread based on a single stream. Incremental checking – Right now Scrutineer checks the whole thing, BUT if you are using timestamp-based versions, there's no reason it couldn't only check objects that were changed after the last known full verification. This would require one to keep track of deletes on the primary stream (perhaps an OnDelete Trigger in your SQL database) so that IDs that were deleted in the primary stream after the last full check could be detected correctly. Obviously we'd love to have a Solr implementation here, we hope the community can help here.The Why
How does this work?
Example
Database ElasticSearch 1:12345 1:12345 2:23455 3:84757 3:84757 4:98765 4:98765 5:38475 6:34666 6:34556 Running Scrutineer
bin/scrutineer \
--jdbcURL=jdbc:jtds:sqlserver://mydbhost/mydb \
--jdbcDriverClass=net.sourceforge.jtds.jdbc.Driver \
--jdbcUser=itasecret \
--jdbcPassword=itsasecret \
--sql="select id,version from myobjecttype order by cast(id as varchar(100))" \
--clusterName=mycluster \
--indexName=myindex \
--query="_type:myobjecttype" \
--numeric
Output
NOTINSECONDARY 2 23455
MISMATCH 6 34666 secondaryVersion=34556
NOTINPRIMARY 5 38475
NOTINSECONDARY
MISMATCH
NOTINPRIMARY
Memory
export JAVA_OPTS=-Xmx1048m
Sorting
--numeric
) sort order.ElasticSearch
What are the 'best practices' for using Scrutineer?
Assumptions
JDBC Drivers
Building
mvn package
Submitting Pull Requests
mvn verify
Roadmap
It would be good to provide a 'manifest' to Scrutineer to outline a set of stream verifications to perform, perhaps one for each type you have so that your multi-core system can perform multiple stream comparisons in parallel.