This page is translated to Serbo-Croatian
language by Anja Skrba from Webhostinggeeks.com
What is Genes2WordCloud?
Genes2WordCloud is a web-based server application and Java Applet that enables users to create biologically-relevant-content WordClouds.
A WordCloud is a visual display of a set of words where the font, size, color or angle can represent some underlying information. A WordCloud is an effective way to visually summarize information about a specific topic of interest. The WordCloud is optimized to maximize the display of the most important terms about a specific topic in the minimum amount of space.
As researchers are faced with the daunting amount of new and growing data and text, methods to quickly summarize knowledge about a specific topic from large bodies of text or data are critical. WordClouds are emerging as a method of choice on the web to accomplish this task.
Genes2WordCloud generates WordClouds from the following sources:
- A single gene, or a list of genes. For that, three different resources are used. Either the gene(s) are matched to:
- their generifs annotations;
- their gene onthology annotations;
- abstracts on Pubmed articles linked to the gene(s) through generifs;
- their mammalian phenotype annotations from MGI;
- Free text or text extracted from a URL of a website. Free text or text extracted from a URL is used to generate a WordCloud.
- An author's name.WordClouds can be created from Pubmed articles returned for a specific author.
- General Pubmed search.A WordCloud can be generated from any Pubmed search based on returned abstracts.
- BMC Bioinformatics most viewed articles.Displays a WordCloud created from the most viewed BMC Bioinformatics articles for different time periods.
How does it work?
There are two tasks for creating WordClouds: first, generating the keywords to display; and secondly, displaying the keywords.
Generating the keywords
The keywords are generated in several ways depending on the source chosen. In each case the process can be divided into two main tasks: obtain the text related to the user input, and text-mine the text.
Diagram 1 - Main task 1: obtain text from the user input
Diagram 2 - Main task 2: text-mining
Diagram 3 -Text-mining task details
The Porter stemming algorithm is a common stemming algorithm which works well for English.
It reduces words such as "stem", "stems", "stemming" to a single root, e.g., "stem".
The root is not always a real English word. Therefore, to obtain a more readable WordCloud, after the stemming of all the words, each stemmed-word is replaced by the shortest word of its family.
It should be noted that some words are removed from the text before finding the keywords.
First, all common English words such as: the, is, or are are removed. The list of these words can be found in the following file.
Then common biological terms such as: experiments, abstracts, contributes are removed. These terms are available here. These terms were chosen by hand curation after experimenting with many WordClouds. Text-mining of generifs and gene ontology annotations also contains removed common terms.
Finally, other terms such as the input gene names, the name of the author, or the keywords of the Pubmed search are also removed to avoid self-referencing.
The source files used to create the database for processing lists of genes to create WordClouds were taken from:
- NCBI for generating a reference of Entrez gene names. Only mouse, rat and human genes were used (file1, file2, file3)
- NCBI file for linking PMIDs to genes. (file4)
- NCBI's GeneRifs annotations. (file5)
- Gene Ontology annotations. Only mouse, rat and human genes were used. (file6, file7, file8, file9)
The different methods to obtain text from the user input and the text-mining algorithms consume a lot of CPU time and memory. For each query we only use a maximum of 150 abstracts or 500 annotations picked randomly when the queries return more than these limits.
Displaying the WordCloud
There are currently two main web-based applications to create WordClouds from weighted lists of keywords: Wordle, developed by Jonathan Feinberg and indirectly IBM, and WordCram developed by Dan Bernier. Wordle cannot be used outside of the web application since its source code is protected, whereas WordCram is an open-source Java library using the Java libraries of Processing. Processing is a scripting language that uses Java Applets for creating web-based applications enriched with graphics. Genes2WordCloud is a WordCould viewer that is based on WordCram.
A web-based user-interface was added to Genes2WordCloud where several parameters such as the font or the background-color can be changed.
In this section we provide some examples of using Genes2WordCloud.
A generif based Wordcloud for NANOG and SOX2
NANOG and SOX2 are both genes encoding transcription factors involved in embryonic stem cells self-renewal and pluripotency maintenance. The WordCloud automatically obtained relevant terms such as stem (the word cell was automatically removed as it is considered a biological common term), differentiate, pluripotent, self-renewal
. Also Oct4, a gene that is often associated with NANOG and SOX2 was recovered by Genes2WordCloud.
A WordCloud that is based on our laboratory web-page was also created as an example
The Ma'ayan Laboratory is a computational systems biology laboratory and the program correctly extracted the most relevant terms that describe the function of the lab, for example: network, mammalian, software, database, compute, web-based tool.
A WordCloud for the p38 pathway based on a PubMed search
This WordCloud was obtained with the PubMed search: p38 pathway. The algorithm recovered terms such as: kinase, signal, MAPK, phosphorylate, apoptosis which are relevant to the p38 pathway, a signaling pathway involved in cell differentiation and apoptosis.
What to do if you don't see the WordCloud?
There are three possible explanations:
- Java is not working on your computer or within your browser. In this case, to verify and solve the problem, go here.
- No terms were found with your input. Normally you should receive a warning message. In some cases try to remove punctuations, symbols, or other similar characters, or verify that you entered correct gene names.
- Check that the color of the words is different from the background-color. White words on a white background won't be visible.
If it still doesn't work, you can try to figure out the error by opening the java console on your computer. To do this click here
the content of the java console, along with the type of WordCloud you tried to display and the input you used. We will try to debug the error and get back to you.
Using the WordCloud as an applet on your own website?
You can use the applet with your own keywords on your own website. For doing this all you need to do is:
- Download the following scripts from here.
- Unzip the compressed file in the repository of your website.
- To create a WordCloud, write the following code in the body of your HTML web-page:
create_wordcloud('name_of_worcloud', 'path', 'path_of_textfile_to_textmine.txt', 'path_of_english_forbidden_words_file', 'path_of_biology_forbidden_words_file', 'path_of_other_forbidden_words_file', 'cutoff');
- name_of_wordcloud is the name you want to give to your WordCloud. Make sure to only use letters, digits and underscores characters.
- path is the path to the folder that you unzipped. The default path should be "embed_applet/".
- path_of_textfile_to_textmine is the path to the file containing the text to mine. The default file is in embed_applet/data/text_to_textmine.txt
- path_of_english_forbidden_words_file is the path to the file containing the common English words to remove. The words need to be separated by space, tabs or returns. The default file is in embed_applet/data/stopwords.txt.
- path_of_biology_forbidden_words_file is the path to the file containing the common biological terms you may want to remove. The words need to be separated by spaces, tabs or returns. The default file is in embed_applet/data/bio-stopwords.txt.
- path_of_other_forbidden_words_file is the path to a file containing the other words you may want to remove. The words need to be separated by spaces, tabs or returns. The default file is in embed_applet/data/other-stopwords.txt.
- cutoff is an integer representing the threshold for keywords. A word appearing less time than this value won't be kept. The default value is 0.
You need to have php installed on your server to be able to use the WordCloud generator. Therefore, your web page where you embed the WordCloud needs to have a .php extension.
No css-style is provided to the WordCloud, so if you want to add some css properties, we advise you to use Firebug to obtain the names of the HTML elements you want to add style to.
If you already have keywords and weights for the keywords, you can directly use them as an input to the WordCloud. For this you need to write in your own HTML code as follows:
create_wordcloud_weights('name_of_worcloud', 'path', 'keyword1 weight1 keyword2 weight2 ... keywordn weightn');
where keyword1 weight1 keyword2 weight2 ... keywordn weightn are the keywords and the weights associated with them. These need to be separated by the space character.
An example of how to embed a WordCloud in a web-page is available in /embed_applet/example.php.
Frequent Asked Questions
What happens to the terms suggested to be removed for all the WordClouds?
These terms are stored in our database. If we agree that these should be indeed removed, we will add them to the common English words list or the common biological terms list.
- Visual Presentation as a Welcome Alternative to Textual Presentation of Gene Annotation Information, Jairav Desai, Jared M. flatow, Jie Song, Lihua J. Zhu, Pan Du, Chiang-Ching Huang, Hui Lu, Simon M. Lin, and Warren A. Kibbe, Advances in computational biology, 2010, pages 709-715, Springer.
- Wordle, Jonathan Feinberg, 2009
- Wikipedia, article on Tag Clouds
- Processing librairies
- WordCram, Dan Bernier
- Pubmed e-utilities
- Comparison of Tag Cloud Layouts: Task-Related Performance and Visual Exploration, Lohmann, S., Ziegler, J., Tetzlaff, L., T. Gross et al. (Eds.): INTERACT 2009, Part I, LNCS 5726, 2009, pages 392–404.
- How to extract keywords from a web page, Dr. David R. Nadeau