e-LiSe – an online tool for finding needles in the “(Medline)
Arek Gladki1 , Pawel Siedlecki1 , Szymon Kaczanowski1 , Piotr Zielenkiewicz
Bioinformatics Department, Institute of Biochemistry and Biophysics, Polish Academy of Sciences, ul.
Pawinskiego 5a, 02-106 Warszawa,Poland
Plant Molecular Biology Department, Warsaw University, Warszawa, Poland
*To whom correspondence should be addressed; email: [email protected]
Summary: Using literature databases one can find not only known
and true relations between processes but also less studied, nonobvious associations. The main problem with discovering such type
of relevant biological information is “selection”. The ability to
distinguish between a true correlation (e.g. between different types
of biological processes) and random chance that this correlation is
statistically significant is crucial for any bio-medical research,
literature mining being no exception. This problem is especially
visible when searching for information which hasn't been studied
and described in many publications. Therefore a novel bio-linguistic
statistical method is required, capable of “selecting” true
correlations, even when they are low-frequency associations.
In this paper we present such statistical approach based on Z-score
and implemented in a web-based application “e-LiSe”.
Contact: [email protected]
One of the most useful ways to extract information is based on the
analysis of co-occurrence of keywords. A fine implementation of
such methodology is used by EbiMed . EbiMed can retrieve
abstracts matching a defined query, list overrepresented words and
additionally search for associations with other online sources of
information. An extended approach is utilized by Ihop , which
uses a special dictionaries of gene/protein synonyms. The results
page provides options to use gene/protein names for gene
description and also for extracting interaction information. A
successful modification of frequency co-occurrence methodology,
based on fuzzy binary relation formalism, is used by XplorMed
. The software analyses queries to detect dependency relations
between words in a set of abstracts. The output contains
information about the “power” of association between words.
MEDIE  is a search engine that uses natural language
processing (NLP) for semantic and keyword searches. It shows
sentences containing query keywords, and in the case of semantic
search also an appropriate verb. However the user is provided only
To whom correspondence should be addressed.
© Oxford University Press 2007
with words which co-occurre in the same sentence as query,
without statistical information about their significance.
Although the services described above are fine examples of
information extraction based on frequency co-occurrence, they all
have shortcomings. Some of them tend to correlate words
describing broad functions with user queries (e.g. mitosis can be
correlated with alcohol, secretion or pregnancy). Numerous broad
keywords can shade more specific associations, which might be
more revealing for the researcher. On the other hand systems
which provide a large set of very specific words are quite hard to
navigate and somewhat overwhelming, hindering a good overview.
The information extraction field is also hampered by the problem
of detecting associations between a user query and words which
have a very low frequency in Medline abstracts. There is a lack of
software which would be able to show statistical overrepresentation of an English word when it does not occur in many
abstracts from a given set.
To deal with these problems, we designed E-LiSe (e-Literature
Search). E-LiSe is a web based application which can discover
biomedically-meaningful, over-represented and specific words
using a Z-score statistical approach. It is especially useful for
detecting low-frequency associations between user query and
keywords such as drugs and genes or less obvious correlations
The e-LiSe software uses data from <author>, <title> and
<abstract> fields, extracted from XML files released by Medline
(NLM FTP). The data is preprocessed, filtered and loaded into
MySQL database. There is a small, average 2% difference between
e-LiSe and PubMed databases due to the filtering step (which
excludes duplications, comments, retractions, etc). Also, the data
available by FTP is older than data presented in PubMed online.
To decrease the differences we automatically update e-LiSe
databases in a weekly manner.
When the user provides query word(s), the system finds all
abstracts which contain them and extracts additional words which
are overrepresented in this set. The key parameter for statistical
analysis, the background probability of a given word occurrence, is
defined as the number of abstracts containing this word in Medline
database. Using this parameter the statistical significance of
extracted words over-representation is measured as Z-score. As Zscore calculations do not require extensive computation, e-LiSe
interface is fast and responsive, improving user experience.
For words which were found less than 5 times for given query set,
Z-score is less informative. Therefore such words (if exist) are
presented in a separate table as “suspected to be correlated”.
Additionally, the table with the most frequently observed words in
the given set of abstracts is created. These data seems to be the
least meaningful source of information.
E-LiSe is also capable of finding researchers' names correlated to
the information searched by the user. It uses the list of co-authors
of each publication a researcher has written to display connections
between scientists. It can function as a name reference engine,
answering questions like “who is working on specified subject?” or
“what are the coworkers/collaborators of a certain person?”.
The user web interface of e-LiSe software is comprised of CGI
scripts connected with xhtml pages generated dynamically. User
provides the query as word or words separated by commas. The
system also allows multi-word queries such as “growth factor
receptor” to be submitted. Additionally, a separate search field
allowing author(s) name(s) queries is available.
The results page is generated in the form of sortable tables. The
“correlated words” table is populated with words correlated to the
user query. Each word in this table is directly linked with Pubmed,
to provide easy access to publications from which it was extracted.
Keywords which are less broad (i.e. are present in less than 10k
abstracts) are underlined in black. The “strength” of the correlation
is shown in two separate columns: as the Z-score value, and as the
number of abstracts in which the query words and correlated word
appeared. Additionally the results page can display a table with a
list of authors names correlated to the user query.
insight and highlight “regions of knowledge” that ought to be
explored first to gain more in-depth understanding.
We tested E-LiSe service with queries referring to various levels of
biomedical knowledge. For a simple, “narrow” query we have
chosen a molecule name “histone”. The output contains highly
specific information regarding histone types (e.g. h1, h3, h4),
biological structures (e.g. chromatine, nucleosome), modifications
(e.g. acetylation, deacetylation, trimetylation), associated enzymes
(e.g. rpd3, pcaf) and genes (e.g. gcn5),
transcription, epigenetic) and known inhibitors of histone
deacetylase (trichostatin and suberoylanilide). Most of these
keywords appear at the top of the keywords list allowing quick
access to information. To assess the softwares' ability to provide
descriptions for a more “broad” query we used the word “mitosis”.
E-LiSe correctly identified main phases of mitosis (e.g. anaphase),
structures involved in mitosis (e.g. spindle, centromers), involved
proteins (e.g. cdc25, cdc20) and specific, direct inhibitors (e.g.
taxol, monastrol). Finally, to search for a connection between a
process and disease we used the query “schizophrenia”. We
discovered significant correlations with: names of popular drugs
(e.g. clozapine, risperidone), and names of candidate gene for
susceptibility for disease (eg. DISC1, dtnbp1). Again theses results
were mostly found at the top of the associated word list, which
further demonstrates proper sorting of correlated words.
One of the main advantages of our methodology is the ability to
detect significant association for words occurring in small fraction
of analyzed abstracts. In the example mentioned above words like
“DISC1” and “monastrol” were present only in 0.001 analyzed
abstracts – yet were placed at the top of the correlated words list,
as they are highly informative and can direct the researchers'
interest to novel areas of knowledge.
To assess quantitative efficiency of e-LiSe we used well currated
descriptions from OMIM database. The first 18 syndromes with
less that 4 title words were fetched. These syndromes were
annotated with e-LiSe, and compared with the original
descriptions. On average e-LiSe found 55.66 ± 16 words (out of
default 100) that were present in OMIM descriptions. To obtain
statistical significance 100 permutation without replacement was
used for description randomization. With randomized descriptions
e-LiSe found 15.78 ± 2.29 words per disease. It is also important to
notice that e-LiSe found words not present in OMIM description
which are biologically meaningful (e.g. cryoglobulinemia for
Autoimmune Lymphoproliferative syndrome or CIITA <MHC
Class II Transactivator> for Bare Lymphocite syndrome).
In conclusion, our system automatically analyses user provided
strings, allowing both single and multi-word queries. It applies Zscore for detecting statistically significant keywords linked to the
query string. The software includes an interactive web interface,
that allows easy access to source publications and a sorting
mechanisms for fast results overview. Additionally our software
has the ability to search for researchers names connected to a user
query, or search for keyword connected to certain author(s).
between diseases. It also aims at providing information directly
and specifically linked with user provided context.
SYSTEM AND METHODS
RESULTS AND DISCUSSION
The E-LiSe service is designed to provide brief answers to user
queries in the form of biologically meaningful keyword list. These
keywords together should form a sufficient description for a quick
1. Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P EBIMed--text crunching to gather facts for proteins from Medline,
Bioinformatics 2007 Jan 15;23(2):e237-44.
2. Hoffmann, R. and Valencia, A. - A gene network for navigating the literature. Nat.
Genet, 2004, 36, 664
3. Perez-Iratxeta C, Bork P, Andrade MA. XplorMed: a tool for exploring MEDLINE
abstracts. Trends Biochem Sci. 2001 Sep;26(9):573-5.
4. Medie, www-tsujii.is.s.u-tokyo.ac.jp/medie/index.html