public class ClassicSimilarity extends TFIDFSimilarity
encodes
norm values as a single byte before being stored. At search time,
the norm byte value is read from the index
directory
and
decoded
back to a float norm value.
This encoding/decoding, while reducing index size, comes with the price of
precision loss  it is not guaranteed that decode(encode(x)) = x. For
instance, decode(encode(0.89)) = 0.875.
Compression of norm values to a single byte saves memory at search time, because once a field is referenced at search time, its norms  for all documents  are maintained in memory.
The rationale supporting such lossy compression of norm values is that given
the difficulty (and inaccuracy) of users to express their true information
need by a query, only big differences matter.
Last, note that search time is too late to modify this norm part of
scoring, e.g. by using a different Similarity
for search.
Similarity.SimScorer, Similarity.SimWeight
Modifier and Type  Field and Description 

protected boolean 
discountOverlaps
True if overlap tokens (tokens with a position of increment of zero) are
discounted from the document's length.

Constructor and Description 

ClassicSimilarity()
Sole constructor: parameterfree

Modifier and Type  Method and Description 

float 
coord(int overlap,
int maxOverlap)
Implemented as
overlap / maxOverlap . 
float 
decodeNormValue(long norm)
Decodes the norm value, assuming it is a single byte.

long 
encodeNormValue(float f)
Encodes a normalization factor for storage in an index.

boolean 
getDiscountOverlaps()
Returns true if overlap tokens are discounted from the document's length.

float 
idf(long docFreq,
long docCount)
Implemented as
log((docCount+1)/(docFreq+1)) + 1 . 
Explanation 
idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation
for that score factor.

float 
lengthNorm(FieldInvertState state)
Implemented as
state.getBoost()*lengthNorm(numTerms) , where
numTerms is FieldInvertState.getLength() if setDiscountOverlaps(boolean) is false, else it's FieldInvertState.getLength()  FieldInvertState.getNumOverlap() . 
float 
queryNorm(float sumOfSquaredWeights)
Implemented as
1/sqrt(sumOfSquaredWeights) . 
float 
scorePayload(int doc,
int start,
int end,
BytesRef payload)
The default implementation returns
1 
void 
setDiscountOverlaps(boolean v)
Determines whether overlap tokens (Tokens with
0 position increment) are ignored when computing
norm.

float 
sloppyFreq(int distance)
Implemented as
1 / (distance + 1) . 
float 
tf(float freq)
Implemented as
sqrt(freq) . 
String 
toString() 
computeNorm, computeWeight, idfExplain, simScorer
protected boolean discountOverlaps
public float coord(int overlap, int maxOverlap)
overlap / maxOverlap
.coord
in class TFIDFSimilarity
overlap
 the number of query terms matched in the documentmaxOverlap
 the total number of terms in the querypublic float queryNorm(float sumOfSquaredWeights)
1/sqrt(sumOfSquaredWeights)
.queryNorm
in class TFIDFSimilarity
sumOfSquaredWeights
 the sum of the squares of query term weightspublic final long encodeNormValue(float f)
The encoding uses a threebit mantissa, a fivebit exponent, and the zeroexponent point at 15, thus representing values from around 7x10^9 to 2x10^9 with about one significant decimal digit of accuracy. Zero is also represented. Negative numbers are rounded up to zero. Values too large to represent are rounded down to the largest representable value. Positive values too small to represent are rounded up to the smallest positive representable value.
encodeNormValue
in class TFIDFSimilarity
Field.setBoost(float)
,
SmallFloat
public final float decodeNormValue(long norm)
decodeNormValue
in class TFIDFSimilarity
encodeNormValue(float)
public float lengthNorm(FieldInvertState state)
state.getBoost()*lengthNorm(numTerms)
, where
numTerms
is FieldInvertState.getLength()
if setDiscountOverlaps(boolean)
is false, else it's FieldInvertState.getLength()
 FieldInvertState.getNumOverlap()
.lengthNorm
in class TFIDFSimilarity
state
 statistics of the current field (such as length, boost, etc)public float tf(float freq)
sqrt(freq)
.tf
in class TFIDFSimilarity
freq
 the frequency of a term within a documentpublic float sloppyFreq(int distance)
1 / (distance + 1)
.sloppyFreq
in class TFIDFSimilarity
distance
 the edit distance of this sloppy phrase matchPhraseQuery.getSlop()
public float scorePayload(int doc, int start, int end, BytesRef payload)
1
scorePayload
in class TFIDFSimilarity
doc
 The docId currently being scored.start
 The start position of the payloadend
 The end position of the payloadpayload
 The payload byte array to be scoredpublic Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
TFIDFSimilarity
The default implementation uses:
idf(docFreq, docCount);Note that
CollectionStatistics.docCount()
is used instead of
IndexReader#numDocs()
because also
TermStatistics.docFreq()
is used, and when the latter
is inaccurate, so is CollectionStatistics.docCount()
, and in the same direction.
In addition, CollectionStatistics.docCount()
does not skew when fields are sparse.idfExplain
in class TFIDFSimilarity
collectionStats
 collectionlevel statisticstermStats
 termlevel statistics for the termpublic float idf(long docFreq, long docCount)
log((docCount+1)/(docFreq+1)) + 1
.idf
in class TFIDFSimilarity
docFreq
 the number of documents which contain the termdocCount
 the total number of documents in the collectionpublic void setDiscountOverlaps(boolean v)
TFIDFSimilarity.computeNorm(org.apache.lucene.index.FieldInvertState)
public boolean getDiscountOverlaps()
setDiscountOverlaps(boolean)
Copyright © 20002017 Apache Software Foundation. All Rights Reserved.