Summary: With an exponentially growing number of articles being published every year, scientists can use some help in determining which journal is most appropriate for publishing their results, and which other scientists can be called upon to review their work.
Jane (Journal/Author Name Estimator) is a freely available web-based application that, on the basis of a sample text (e.g. the title and abstract of a manuscript), can suggest journals and experts who have published similar articles.
PubMed (Wheeler et al., 2007) is growing exponentially. In 1996, 520 148 articles were published versus 793 919 in 2006. Interestingly, the number of different journals in which these articles were published did not show a similar growth: 5006 in 1996 versus 5100 in 2006. There is a steady turnover: according to the PubMed Journals database, 1707 journals were started between 1996 and 2006. The number of authors publishing one or more papers every year does increase rapidly: 543 974 in 1996 versus 867 919 in 2006.
For all these authors, finding the appropriate journal to publish their work becomes increasingly difficult: many journals deal with a wide diversity of topics, and many articles are multi-disciplinary, leading for instance to computer scientists publishing in biomedical journals. At the same time, finding reviewers among the growing number of peers also becomes more of a problem. We developed Jane (Journal/Author Name Estimator) to help with both tasks.
2 USING JANE
2.1 Finding journals and authors
The user starts by entering a piece of text as query (Fig. 1). Typically, this will be the title and abstract of the article for which the user wants to find a suitable journal or reviewer. The application will return an ordered list of results, with a confidence score for each item. Furthermore, it is possible to show the articles on which the score of a specific journal or author was based, as well as other similar articles. This can help a user to evaluate whether the journal is really the suitable medium for publishing his or her findings, or whether the selected author is really knowledgeable about the topic of the article used as input.
2.2 Extra features
Users can refine their search by selecting specific languages and types of publications. The search algorithm will then compare the input text only to those articles that meet these specifications. For instance, by selecting ‘Japanese’ and the publication type ‘review’, the system will return those journals containing the most similar Japanese review articles.
Some authors may be hesitant to send an abstract of their latest research to an unknown server. Therefore, we have included an option to scramble the input before submission. Scrambling simply entails putting the words in the text in alphabetic order, which makes it next to impossible to reconstruct the original text, but has no effect on the search.
The open source search engine Lucene (Gospodnetic and Hatcher, 2005) is used to find articles that are similar to the input query. Texts are tokenized using the standard Lucene tokenizer, and are subsequently compared using the Lucene MoreLikeThis algorithm, a very efficient implementation of the traditional TF*IDF vector space model.
After retrieving the ordered list of most similar records, a weighted k-nearest neighbor approach is used to determine the journal or author list. For each item (i.e. a journal or author), we add the Lucene similarity scores for the articles belonging to this item in the k top-ranking records. To produce confidence scores, these sums are then normalized so that the scores add up to 100%. Results are ordered by confidence score. A leave-one-out evaluation showed that the best performance was achieved using k = 50.
We indexed all 4 171 368 articles from 4513 journals in Medline that
contained an abstract,
were published in the last 10 years,
did not belong to one of these categories: comment, editorial, news, historical article, congresses, biography, newspaper article, practice guideline, interview, bibliography, legal cases, lectures, consensus development conference, addresses, clinical conference, patient education handout, directory, technical report, festschrift, retraction of publication, retracted publication, duplicate publication, scientific integrity review, published erratum, periodical index, dictionary, legislation or government publication and
belonged to a journal with at least 25 publications in the last 10 years, and at least one publication in the last 12 months.
4 COMPARISON WITH OTHER TOOLS
PubMed itself offers the possibility to search for ‘similar articles’, but only existing Medline records can be used as queries. There are many other systems that offer some means of finding authors and/or journals, but they all use a boolean keyword-based query, for instance GoPubMed (Doms and Schroeder, 2005), and HubMed (Eaton, 2006).
One system, called eTBLAST (Errami et al., 2007), does accept full abstracts to search for journals and authors. It retrieves the 400 most similar articles using a vector-space approach, and for these articles a text-alignment score is calculated and aggregated per journal or author. We compared the performance of Jane to eTBLAST using a random set of 1000 citations entered into PubMed in the 3 days before the test, and were consequently not in the training sets of Jane and eTBLAST at that time. For each citation, we tested how well the systems predicted the authors of the paper, and the journal in which the paper was published.
Figure 2 shows that Jane outperforms eTBLAST (P < 0.001 and P = 0.010 for journals and authors, respectively, using a sign test to compare ranks). Furthermore, even though eTBLAST runs on a 20 CPU Linux cluster and Jane was tested on a dual CPU system, eTBLAST searches were much slower than Jane searches: the average search times were 114.0 and 0.6 seconds, respectively. Because eTBLAST currently has more users than Jane, we simulated an extra average load of 100 000 queries per day on our server whilst determining our search time.
Jane is a simple, fast and accurate tool for finding journals and authors, as compared to other such tools.
We tested how well Jane predicts the journal in which a paper was published, assuming that this journal was the most appropriate one. Obviously, this may not always be the case since many journals overlap considerably and journal choice may be influenced by many factors. In a qualitative analysis of a small sample of the abstracts for which the correct journal did not appear in the top 10, we believe that the abstracts would also have been appropriate for many of the top-ranking journals returned by Jane. The same holds true for authors: although we can assume that an author is knowledgeable about the paper (s)he wrote, other, more experienced authors might qualify as better experts.
Jane is freely available. The underlying database of indexed abstracts will regularly be updated.
This study was supported by the Biorange project sp 4.1.1. of the Netherlands Bioinformatics Centre.
Conflict of Interest: none declared.