When you need to find out how much semantically similar two pieces of text are (no matter if just two words, or two articles).
ESA is currently the state-of-the-art method for comparing semantic similarity of texts.
Download the repo with the pre-built jar and database
$ git clone https://github.com/ticcky/esalib.git $ cd esalib
Create a symbolic link to the sample database
$ ln -s example/esa_en.db esa_db.db
Get relatedness estimate of two texts:
$ ./run_analyzer "computer" "apple"
The tool is verified to yield good results (meaning correlation with human judgement as reported in the original ESA paper) with the provided prebuilt English Wikipedia ESA background from 2005. I have not had success building the ESA background from the recent dumps of Wikipedia. Please let me know if you manage.
Just email me for help ;)
Occasionally I get some questions about ESAlib on email, and in this section I publish the ones that I thought other ESAlib users might find useful.
Is possible to get as a output also the ESA vectors of each text and not only the relatedness estimate between them?
Yes, it is. Look into src/clldsystem/esa/IConceptVector.java which is the interface for the object that represents the ESA vector. Its instance is returned by the getVector method in src/clldsystem/esa/ESAAnalyzer.java, which get ESA vector fro the given piece of text. getVector is also used internally by the similarity calculator. With the default ESA background that I provided, the dimensions correspond to the pageId's on wikipedia (i.e. dimension # 11400 -> http://en.wikipedia.org/wiki/?curid=11400).
Assertion failed: (jBlob), function Java_org_sqlite_NativeDB_column_1blob, file ../src/main/java/org/sqlite/NativeDB.c, line 513. ./run_analyzer: line 7: 4119 Abort trap: 6 java -cp lib/*:esalib.jar clldsystem.esa.ESAAnalyzer "$1" "$2"
Check that the directory with the background (.db) resides in a directory where you have writing permissions.
Source code is partially based on the original implementation of Evgeniy Gabrilovich.
I was working under supervision of Petr Knoth.