ChemEx : A Tool for Chemical Information Extraction

Home | Download | Benchmarking

What is ChemEx ?

ChemEx is a software for chemical knowledge discovery from a large collection of publications. The system extracts compound, organism, and assay information with flexible framework so it can be customized according domain of interest. Results from ChemEx can be visualized via Graphical User Interface and exported to an XML file.

ChemEx Screeenshot


Command-line options:

Unix: ./ChemEx [-p path] [-mm 2048m] [-v]

-p Path to a collection containing documents (PDFs) and their bibliography in BibTeX format (BIBs)
-mm Max memory used
-v Open ChemEx information viewer

For windows, use ChemEx.bat.


ChemEx.exe -p data -mm 2048m -v
To run ChemEx on windows to process "data" folder using 2048MB of RAM and open the viewer after the processing is done.

./ChemEx -v
To open only the viewer in Unix.

The example of a collection folder is included in the distribution ("data" folder).

Change Log

ChemEx 1.1.0
- Add structure information service
- Fix bugs on Windows
- Update to OSRA 1.3.9
- Improve image segmentation
- Improve patterns in label recognition
- Improve performance in structure recognition
- Improve patterns in Coordination Resolution
- Make BibTeX optional
ChemEx 1.0.0
- original public release


ChemEx is built on top of the following software:

- Poppler, PDF rendering library based on the xpdf-3.0 code base.
- UIMA, Unstructured Information Management Aplications.
- ChemicalTagger, a phrase-based semantic NLP tool for parsing the language of chemical experiments.
- OSRA, Optical Structure Recognition Application.
- JChemPaint, an editor and viewer for 2D chemical structures.
- Boost, free peer-reviewed portable C++ source libraries.
- Japura, Java Swing framework and collection of components.

All software are included in the distribution and/or installed automatically by the installer.


ChemEx provides three dictionaries as follow:

- Integrated Taxonomic Information System (ITIS 545,485 records, accessed on 9th December 2011)
- List of Prokaryotic names with Standing in Nomenclature (LPSN 14,390 records, accessed on 8th December 2011)
- Catalogue of Life (only fungi domain, 55,022 records, accessed on 5th December 2011)

Catalogue of Life contains scientific names with Phylum, Family, Genus, and Species information and is included in the distribution. Others contain only scientific names and are available for download as addon. To add downloaded dictionaries:

- Extract downloaded files into the ChemEx folder.
- Edit the file desc\ConceptMapper\ConceptMapper.xml as an example below:
    <delegateAnalysisEngine key="ITISMapper">
      <import location="ITISMapper.xml"/>
For LPSN, change "ITIS" to "Bacterio".


  • ChemEx is freely available on two operating systems: Ubuntu Linux and Windows under terms of the GNU General Public License.
  • If you use ChemEx for your research, please cite:
    Atima Tharatipyakul, Somrak Numnark, Duangdao Wichadakul and Supawadee Ingsriswang (2012), "ChemEx: information extraction system for chemical data curation", BMC Bioinformatics, 13: (Suppl 17):S9, [Web link].

Copyright 2011-2012. Information Systems Laboratory (ISL), Bioresources Technology Unit (BTU), National Center for Genetic Engineering and Biotechnology (BIOTEC), Thailand