Sunday, July 14, 2019

An Example Of An Autosuggest Request

An Example Of An Autosuggest Request





Suggester - a flexible "autocomplete" component. A common need in search applications is suggesting query terms or phrases based on incomplete user input. These completions may come from a dictionary that is based upon the main index or upon any other arbitrary dictionary. It's often useful to be able to provide only top-N suggestions, either ranked alphabetically or according to their usefulness for an average user (e.g. popularity, or the number of returned results). Contents1. Suggester - a flexible "autocomplete" component. Solr 3.1 includes a component called Suggester that provides this functionality. FSTLookup - automaton based representation; slower to build, but consumes far less memory at runtime (see performance notes below). WFSTLookup - weighted automaton representation: an alternative to FSTLookup for more fine-grained ranking. For practical purposes all of the above implementations will most likely run at similar speed when requests are made via the HTTP stack (which will become the bottleneck).





Direct benchmarks of these classes indicate that (W)FSTLookup provides better performance compared to the other two methods, at a much lower memory cost. JaspellLookup can provide "fuzzy" suggestions, though this functionality is not currently exposed (it's a one line change in JaspellLookup). Support for infix-suggestions is planned for FSTLookup (which would be the only structure to support these). An example of an autosuggest request: http://localhost:8983/solr/suggest? The configuration snippet above shows a few common configuration parameters. Impl - Lookup implementation. JaspellLookupFactory - a more complex lookup based on a ternary trie from the JaSpell project. OnCommit - if set to true then the Lookup data structure will be rebuilt after commit. NOTE: currently implemented Lookup-s keep their data in memory, so unlike spellchecker data this data is discarded on core reload and not available until you invoke the build command, either explicitly or implicitly via commit. Location - location of the dictionary file.





If not empty then this is a path to a dictionary file (see below). If this value is empty then the main index will be used as a source of terms and weights. Location is empty then terms from this field in the index will be used when building the trie. Dir - where to store the index data on the disk (else use in-memory). When a file-based dictionary is used (non-empty sourceLocation parameter above) then it's expected to be a plain text file in UTF-8 encoding. 0007) character, or a string and a TAB separated floating-point weight. This is a sample dictionary file. Please note that the format of the file is not limited to single terms but can also contain phrases - which is an improvement over the TermsComponent that you could also use for a simple version of autocomplete functionality. FSTLookup has a built-in mechanism to discretize weights into a fixed set of buckets (to speed up suggestions). The number of buckets is configurable.





WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects weights to be whole numbers. As mentioned above, if the sourceLocation parameter is empty then the terms from a field indicated by the field parameter are used. It's often the case that due to imperfect source data there are many uncommon or invalid terms that occur only once in the whole corpus (e.g. OCR errors, typos, etc). According to the Zipf's law this actually forms the majority of terms, which means that the dictionary built indiscriminately from a real-life index would consist mostly of uncommon terms, and its size would be enormous. In order to avoid this and to reduce the size of in-memory structures it's best to set the threshold parameter to a value slightly above zero (0.5% in the example above). This already vastly reduces the size of the dictionary by skipping "hapax legomena" while still preserving most of the common terms. This parameter has no effect when using a file-based dictionary - it's assumed that only useful terms are found there.

No comments:

Post a Comment