Monday, November 18, 2013

Get TF and IDF of all the terms of an index [Lucene 4.3]

    Let's assume that you have indexed a number of documents with Lucene 4.3. The database created by Lucene is a "flat" database that has a number of fields for every document. Each field contains the terms of a document and their respective frequencies in a termVector*. For those who tried to migrate from older versions of Lucene extracting statistics like TF and IDF in Lucene 4.3 can seem a bit more tricky. Newer versions of Lucene are indeed less intuitive but on the other hand they are more flexible.
    Firstly, a reader must be initiated in order to access the index, and also a TFIDFSimilarity class that will help us calculate the frequencies(tf,idf), and a HashMap that will hold the scores (tf*idf).

         IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index)));
         TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
         Map<String, Float> tf_Idf_Weights = new HashMap<>();
          Map<String, Float> termFrequencies = new HashMap<>();
    Secondly, in order to get the terms of every document we must iterate through the enumeration of the terms and documents, respectively for every indexed document. Practically we iterate the enumerations for every document in the index :

 *Pay attention ! During indexing the termVectors must be stored.
 *The terms are stored in the index as Bytes*

Calculating the Inverse Document Frequencies: 

 Firstly, we create a Map for adding the idf values:

 Map<String, Float> docFrequencies = new HashMap<>();
The function below is field-specific and the value is calculated while looping through the termsEnum:
    /*** GET ALL THE IDFs ***/
   Map<String, Float> getIdfs(IndexReader reader, String field) throws IOException
     /** GET FIELDS **/
        Fields fields = MultiFields.getFields(reader); //Get the Fields of the index 
        TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
        for (String field: fields) 
            TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
            BytesRef bytesRef;
            while ((bytesRef = termEnum.next()) != null) 
                if (termEnum.seekExact(bytesRef, true)) 
                 String term = bytesRef.utf8ToString(); 
                    float idf = tfidfSIM.idf( termEnum.docFreq(), reader.numDocs() );
                    docFrequencies.put(term, idf);      

  return docFrequencies;
In particular the Lucene function that we use to get  the inverse document frequency is:    

   tfidfSIM.idf(termEnum.docFreq(), reader.numDocs())

It practically computes a score factor based on a term's document frequency (the number of documents which contain the term). This value is multiplied by the tf(int) factor for each term in the query.

Calculating the Term Frequencies:
for (int docID=0; docID< reader.maxdoc(); docID++)
        TermsEnum termsEnum = MultiFields.getTerms(reader, field).iterator(null);
        DocsEnum docsEnum = null;
        Terms vector = reader.getTermVector(docId, CONTENT);
   termsEnum = vector.iterator(termsEnum);
        catch (NullPointerException e) 
        BytesRef bytesRef = null;
        while ((bytesRef = termsEnum.next()) != null) 
         if (termsEnum.seekExact(bytesRef, true)) 
          String term = bytesRef.utf8ToString(); 
          float tf = 0; 
                docsEnum = termsEnum.docs(null, null, DocsEnum.FLAG_FREQS);
                while (docsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) 
                 tf =  tfidfSIM.tf(docsEnum.freq()); 
                    termFrequencies.put(term, tf); 
                float idf = docFrequencies.get(term);
                float w = tf * idf;
                tf_Idf_Weights.put(term, w); 
        return tf_Idf_Weights;
    Lucene has an inverted index data structure which means that the process to find the term frequencies for every document is not as direct as one can think. The reason for that is that the inverted index stores a list of fields(for every docId) containing each term and its frequency throughout the documents. This means that we can easily retrieve the number of matching documents for a certain term, but in order for us to get the tf we proceed by iterating through DocEnum and calling
tf = tfidfSIM.tf(docsEnum.freq());
for every term.

  * The function freq() returns the term frequency in the current document *

    After calculating and adding the tf  and  idf of every term, we can get the weight ( w = tf * idf ) and store it. This way we can create a vector for each document that will contain the respective weights of its terms, and therefore we can calculate the distance between vectors.

If you want to take a look at a complete java class implementation of this functionality check out this post.


  1. in your code there is variabel "CONTENT"(Calculating the Term Frequencies) is it a parameter ?, could you write complete your method ?
    thank you for your sharing

    1. CONTENT is the name of the field that we are interested in.
      Note that MultiFields.getTerms(reader, field).iterator(null) is used to enable us to iterate the terms for all fields.
      So dont be troubled with this line :
      Terms vector = reader.getTermVector(docId, CONTENT);
      This was originally inside another loop in order to get the vectors from all fields:

      Terms vector = reader.getTermVector(docId, field);

  2. Thanks for the post. I am getting a null pointer exception at line:

    TermsEnum termsEnum = MultiFields.getTerms(reader, field).iterator(null);

    How could I resolve? Thanks in advance

  3. Hi, thanks for this post. I used your code, but I get 1.0 for all TF and IDF values. Eventhough I have checked the index and there are terms appear more than once.

    A similar problem seems to be asked here, with no answer:

    Can you guess what might be the problem? Thanks

  4. i noticed you try-catch declaration and
    i want to know why would the command
    reader.getTermVectors(docID) would
    return a null value, even for appropriate docID values?
    the version i am trying to run your code is 5.3.0