org.apache.solr.update.processor
Class TextProfileSignature
java.lang.Object
org.apache.solr.update.processor.Signature
org.apache.solr.update.processor.MD5Signature
org.apache.solr.update.processor.TextProfileSignature
public class TextProfileSignature
- extends MD5Signature
This implementation is copied from Apache Nutch.
An implementation of a page signature. It calculates an MD5 hash
of a plain text "profile" of a page.
The algorithm to calculate a page "profile" takes the plain text version of
a page and performs the following steps:
- remove all characters except letters and digits, and bring all characters
to lower case,
- split the text into tokens (all consecutive non-whitespace characters),
- discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),
- sort the list of tokens by decreasing frequency,
- round down the counts of tokens to the nearest multiple of QUANT
(
QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f
by default, and maxFreq is the maximum token frequency). If
maxFreq is higher than 1, then QUANT is always higher than 2 (which
means that tokens with frequency 1 are always discarded).
- tokens, which frequency after quantization falls below QUANT, are discarded.
- create a list of tokens and their quantized frequency, separated by spaces,
in the order of decreasing frequency.
This list is then submitted to an MD5 hash calculation.
| Fields inherited from class org.apache.solr.update.processor.MD5Signature |
log |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TextProfileSignature
public TextProfileSignature()
init
public void init(SolrParams params)
- Overrides:
init in class Signature
getSignature
public byte[] getSignature()
- Overrides:
getSignature in class MD5Signature
add
public void add(String content)
- Overrides:
add in class MD5Signature
Copyright © 2000-2012 Apache Software Foundation. All Rights Reserved.