Genes2WordCloud

Introduction Structure Examples Troubleshooting Use the applet on your website FAQ Contact

This page is translated to Serbo-Croatian language by Anja Skrba from Webhostinggeeks.com.

What is Genes2WordCloud?

Genes2WordCloud is a web-based server application and Java Applet that enables users to create biologically-relevant-content WordClouds.

A WordCloud is a visual display of a set of words where the font, size, color or angle can represent some underlying information. A WordCloud is an effective way to visually summarize information about a specific topic of interest. The WordCloud is optimized to maximize the display of the most important terms about a specific topic in the minimum amount of space.

Wordcloud

As researchers are faced with the daunting amount of new and growing data and text, methods to quickly summarize knowledge about a specific topic from large bodies of text or data are critical. WordClouds are emerging as a method of choice on the web to accomplish this task.

Genes2WordCloud generates WordClouds from the following sources:

A single gene, or a list of genes. For that, three different resources are used. Either the gene(s) are matched to:
- their generifs annotations;
- their gene onthology annotations;
- abstracts on Pubmed articles linked to the gene(s) through generifs;
- their mammalian phenotype annotations from MGI;
Free text or text extracted from a URL of a website. Free text or text extracted from a URL is used to generate a WordCloud.
An author's name.WordClouds can be created from Pubmed articles returned for a specific author.
General Pubmed search.A WordCloud can be generated from any Pubmed search based on returned abstracts.
BMC Bioinformatics most viewed articles.Displays a WordCloud created from the most viewed BMC Bioinformatics articles for different time periods.

How does it work?

There are two tasks for creating WordClouds: first, generating the keywords to display; and secondly, displaying the keywords.

Generating the keywords

The keywords are generated in several ways depending on the source chosen. In each case the process can be divided into two main tasks: obtain the text related to the user input, and text-mine the text.

Diagram 1 - Main task 1: obtain text from the user input

Diagram 2 - Main task 2: text-mining

Diagram 3 -Text-mining task details

The Porter stemming algorithm is a common stemming algorithm which works well for English. It reduces words such as "stem", "stems", "stemming" to a single root, e.g., "stem". The root is not always a real English word. Therefore, to obtain a more readable WordCloud, after the stemming of all the words, each stemmed-word is replaced by the shortest word of its family.

It should be noted that some words are removed from the text before finding the keywords. First, all common English words such as: the, is, or are are removed. The list of these words can be found in the following file. Then common biological terms such as: experiments, abstracts, contributes are removed. These terms are available here. These terms were chosen by hand curation after experimenting with many WordClouds. Text-mining of generifs and gene ontology annotations also contains removed common terms. Finally, other terms such as the input gene names, the name of the author, or the keywords of the Pubmed search are also removed to avoid self-referencing.

The source files used to create the database for processing lists of genes to create WordClouds were taken from:

NCBI for generating a reference of Entrez gene names. Only mouse, rat and human genes were used (file1, file2, file3)
NCBI file for linking PMIDs to genes. (file4)
NCBI's GeneRifs annotations. (file5)
Gene Ontology annotations. Only mouse, rat and human genes were used. (file6, file7, file8, file9)

The different methods to obtain text from the user input and the text-mining algorithms consume a lot of CPU time and memory. For each query we only use a maximum of 150 abstracts or 500 annotations picked randomly when the queries return more than these limits.

Displaying the WordCloud

There are currently two main web-based applications to create WordClouds from weighted lists of keywords: Wordle, developed by Jonathan Feinberg and indirectly IBM, and WordCram developed by Dan Bernier. Wordle cannot be used outside of the web application since its source code is protected, whereas WordCram is an open-source Java library using the Java libraries of Processing. Processing is a scripting language that uses Java Applets for creating web-based applications enriched with graphics. Genes2WordCloud is a WordCould viewer that is based on WordCram.

A web-based user-interface was added to Genes2WordCloud where several parameters such as the font or the background-color can be changed.

Examples

In this section we provide some examples of using Genes2WordCloud.

A generif based Wordcloud for NANOG and SOX2

wordcloud

NANOG and SOX2 are both genes encoding transcription factors involved in embryonic stem cells self-renewal and pluripotency maintenance. The WordCloud automatically obtained relevant terms such as stem (the word cell was automatically removed as it is considered a biological common term), differentiate, pluripotent, self-renewal . Also Oct4, a gene that is often associated with NANOG and SOX2 was recovered by Genes2WordCloud.

A WordCloud that is based on our laboratory web-page was also created as an example

wordcloud

The Ma'ayan Laboratory is a computational systems biology laboratory and the program correctly extracted the most relevant terms that describe the function of the lab, for example: network, mammalian, software, database, compute, web-based tool.

A WordCloud for the p38 pathway based on a PubMed search

wordcloud

This WordCloud was obtained with the PubMed search: p38 pathway. The algorithm recovered terms such as: kinase, signal, MAPK, phosphorylate, apoptosis which are relevant to the p38 pathway, a signaling pathway involved in cell differentiation and apoptosis.

Troubleshooting

What to do if you don't see the WordCloud?

There are three possible explanations:

Java is not working on your computer or within your browser. In this case, to verify and solve the problem, go here.
No terms were found with your input. Normally you should receive a warning message. In some cases try to remove punctuations, symbols, or other similar characters, or verify that you entered correct gene names.
Check that the color of the words is different from the background-color. White words on a white background won't be visible.

If it still doesn't work, you can try to figure out the error by opening the java console on your computer. To do this click here.
Send us the content of the java console, along with the type of WordCloud you tried to display and the input you used. We will try to debug the error and get back to you.

Using the WordCloud as an applet on your own website?

You can use the applet with your own keywords on your own website. For doing this all you need to do is:

Download the following scripts from here.
Unzip the compressed file in the repository of your website.
To create a WordCloud, write the following code in the body of your HTML web-page:

<?php
include ('path/embed_applet/WordCloud.php');
create_wordcloud('name_of_worcloud', 'path', 'path_of_textfile_to_textmine.txt', 'path_of_english_forbidden_words_file', 'path_of_biology_forbidden_words_file', 'path_of_other_forbidden_words_file', 'cutoff');
?>

where
- name_of_wordcloud is the name you want to give to your WordCloud. Make sure to only use letters, digits and underscores characters.
- path is the path to the folder that you unzipped. The default path should be "embed_applet/".
- path_of_textfile_to_textmine is the path to the file containing the text to mine. The default file is in embed_applet/data/text_to_textmine.txt
- path_of_english_forbidden_words_file is the path to the file containing the common English words to remove. The words need to be separated by space, tabs or returns. The default file is in embed_applet/data/stopwords.txt.
- path_of_biology_forbidden_words_file is the path to the file containing the common biological terms you may want to remove. The words need to be separated by spaces, tabs or returns. The default file is in embed_applet/data/bio-stopwords.txt.
- path_of_other_forbidden_words_file is the path to a file containing the other words you may want to remove. The words need to be separated by spaces, tabs or returns. The default file is in embed_applet/data/other-stopwords.txt.
- cutoff is an integer representing the threshold for keywords. A word appearing less time than this value won't be kept. The default value is 0.
You need to have php installed on your server to be able to use the WordCloud generator. Therefore, your web page where you embed the WordCloud needs to have a .php extension.
No css-style is provided to the WordCloud, so if you want to add some css properties, we advise you to use Firebug to obtain the names of the HTML elements you want to add style to.

If you already have keywords and weights for the keywords, you can directly use them as an input to the WordCloud. For this you need to write in your own HTML code as follows:

<?php
include ('WordCloud.php');
create_wordcloud_weights('name_of_worcloud', 'path', 'keyword1 weight1 keyword2 weight2 ... keywordn weightn');
?>

where keyword1 weight1 keyword2 weight2 ... keywordn weightn are the keywords and the weights associated with them. These need to be separated by the space character.

An example of how to embed a WordCloud in a web-page is available in /embed_applet/example.php.

Frequent Asked Questions

What happens to the terms suggested to be removed for all the WordClouds?

These terms are stored in our database. If we agree that these should be indeed removed, we will add them to the common English words list or the common biological terms list.

Contact Information

Genes2WordCloud was developed by the Ma'ayan Laboratory , at Mount Sinai School of Medicine as part of the activities of the Systems Biology Center New York (SBCNY) .

If you have any particular issues, questions, remarks or suggestions, please Contact us.

References

Visual Presentation as a Welcome Alternative to Textual Presentation of Gene Annotation Information, Jairav Desai, Jared M. flatow, Jie Song, Lihua J. Zhu, Pan Du, Chiang-Ching Huang, Hui Lu, Simon M. Lin, and Warren A. Kibbe, Advances in computational biology, 2010, pages 709-715, Springer.
Wordle, Jonathan Feinberg, 2009
Wikipedia, article on Tag Clouds
Processing librairies
WordCram, Dan Bernier
Pubmed e-utilities
Comparison of Tag Cloud Layouts: Task-Related Performance and Visual Exploration, Lohmann, S., Ziegler, J., Tetzlaff, L., T. Gross et al. (Eds.): INTERACT 2009, Part I, LNCS 5726, 2009, pages 392–404.
How to extract keywords from a web page, Dr. David R. Nadeau