Abstract

Summary: With an exponentially growing number of articles being published every year, scientists can use some help in determining which journal is most appropriate for publishing their results, and which other scientists can be called upon to review their work.

Jane (Journal/Author Name Estimator) is a freely available web-based application that, on the basis of a sample text (e.g. the title and abstract of a manuscript), can suggest journals and experts who have published similar articles.

Availability:http://biosemantics.org/jane

Contact:m.schuemie@erasmusmc.nl

1 INTRODUCTION

PubMed (Wheeler et al., 2007) is growing exponentially. In 1996, 520 148 articles were published versus 793 919 in 2006. Interestingly, the number of different journals in which these articles were published did not show a similar growth: 5006 in 1996 versus 5100 in 2006. There is a steady turnover: according to the PubMed Journals database, 1707 journals were started between 1996 and 2006. The number of authors publishing one or more papers every year does increase rapidly: 543 974 in 1996 versus 867 919 in 2006.

For all these authors, finding the appropriate journal to publish their work becomes increasingly difficult: many journals deal with a wide diversity of topics, and many articles are multi-disciplinary, leading for instance to computer scientists publishing in biomedical journals. At the same time, finding reviewers among the growing number of peers also becomes more of a problem. We developed Jane (Journal/Author Name Estimator) to help with both tasks.

2 USING JANE

2.1 Finding journals and authors

The user starts by entering a piece of text as query (Fig. 1). Typically, this will be the title and abstract of the article for which the user wants to find a suitable journal or reviewer. The application will return an ordered list of results, with a confidence score for each item. Furthermore, it is possible to show the articles on which the score of a specific journal or author was based, as well as other similar articles. This can help a user to evaluate whether the journal is really the suitable medium for publishing his or her findings, or whether the selected author is really knowledgeable about the topic of the article used as input.

Fig. 1.

Screenshots of Jane. From right to left: (1) Starting screen: you can enter the text of your title and abstract, select additional options, and choose whether you want to find journals or authors; (2) Results screen: the application returns an ordered list of journals or authors. For each item, a confidence score is given, and an option to show the articles on which the score is based; (3) Results screen showing the articles for a journal: The user can choose to view these and other similar articles in PubMed.

Fig. 1.

Screenshots of Jane. From right to left: (1) Starting screen: you can enter the text of your title and abstract, select additional options, and choose whether you want to find journals or authors; (2) Results screen: the application returns an ordered list of journals or authors. For each item, a confidence score is given, and an option to show the articles on which the score is based; (3) Results screen showing the articles for a journal: The user can choose to view these and other similar articles in PubMed.

2.2 Extra features

Users can refine their search by selecting specific languages and types of publications. The search algorithm will then compare the input text only to those articles that meet these specifications. For instance, by selecting ‘Japanese’ and the publication type ‘review’, the system will return those journals containing the most similar Japanese review articles.

Some authors may be hesitant to send an abstract of their latest research to an unknown server. Therefore, we have included an option to scramble the input before submission. Scrambling simply entails putting the words in the text in alphabetic order, which makes it next to impossible to reconstruct the original text, but has no effect on the search.

3 IMPLEMENTATION

The open source search engine Lucene (Gospodnetic and Hatcher, 2005) is used to find articles that are similar to the input query. Texts are tokenized using the standard Lucene tokenizer, and are subsequently compared using the Lucene MoreLikeThis algorithm, a very efficient implementation of the traditional TF*IDF vector space model.

After retrieving the ordered list of most similar records, a weighted k-nearest neighbor approach is used to determine the journal or author list. For each item (i.e. a journal or author), we add the Lucene similarity scores for the articles belonging to this item in the k top-ranking records. To produce confidence scores, these sums are then normalized so that the scores add up to 100%. Results are ordered by confidence score. A leave-one-out evaluation showed that the best performance was achieved using k = 50.

We indexed all 4 171 368 articles from 4513 journals in Medline that

  • contained an abstract,

  • were published in the last 10 years,

  • did not belong to one of these categories: comment, editorial, news, historical article, congresses, biography, newspaper article, practice guideline, interview, bibliography, legal cases, lectures, consensus development conference, addresses, clinical conference, patient education handout, directory, technical report, festschrift, retraction of publication, retracted publication, duplicate publication, scientific integrity review, published erratum, periodical index, dictionary, legislation or government publication and

  • belonged to a journal with at least 25 publications in the last 10 years, and at least one publication in the last 12 months.

4 COMPARISON WITH OTHER TOOLS

PubMed itself offers the possibility to search for ‘similar articles’, but only existing Medline records can be used as queries. There are many other systems that offer some means of finding authors and/or journals, but they all use a boolean keyword-based query, for instance GoPubMed (Doms and Schroeder, 2005), and HubMed (Eaton, 2006).

One system, called eTBLAST (Errami et al., 2007), does accept full abstracts to search for journals and authors. It retrieves the 400 most similar articles using a vector-space approach, and for these articles a text-alignment score is calculated and aggregated per journal or author. We compared the performance of Jane to eTBLAST using a random set of 1000 citations entered into PubMed in the 3 days before the test, and were consequently not in the training sets of Jane and eTBLAST at that time. For each citation, we tested how well the systems predicted the authors of the paper, and the journal in which the paper was published.

Figure 2 shows that Jane outperforms eTBLAST (P < 0.001 and P = 0.010 for journals and authors, respectively, using a sign test to compare ranks). Furthermore, even though eTBLAST runs on a 20 CPU Linux cluster and Jane was tested on a dual CPU system, eTBLAST searches were much slower than Jane searches: the average search times were 114.0 and 0.6 seconds, respectively. Because eTBLAST currently has more users than Jane, we simulated an extra average load of 100 000 queries per day on our server whilst determining our search time.

Fig. 2.

Cumulative histogram of the rank of the correct journal and the highest ranking correct author in the result lists of eTBLAST and Jane for a test set of 1000 abstracts (e.g. for Jane, the correct journal appeared at the top of the list for 23% of the abstracts, it appeared in the top 2 for 36% of the abstracts, etc.).

Fig. 2.

Cumulative histogram of the rank of the correct journal and the highest ranking correct author in the result lists of eTBLAST and Jane for a test set of 1000 abstracts (e.g. for Jane, the correct journal appeared at the top of the list for 23% of the abstracts, it appeared in the top 2 for 36% of the abstracts, etc.).

5 DISCUSSION

Jane is a simple, fast and accurate tool for finding journals and authors, as compared to other such tools.

We tested how well Jane predicts the journal in which a paper was published, assuming that this journal was the most appropriate one. Obviously, this may not always be the case since many journals overlap considerably and journal choice may be influenced by many factors. In a qualitative analysis of a small sample of the abstracts for which the correct journal did not appear in the top 10, we believe that the abstracts would also have been appropriate for many of the top-ranking journals returned by Jane. The same holds true for authors: although we can assume that an author is knowledgeable about the paper (s)he wrote, other, more experienced authors might qualify as better experts.

Jane is freely available. The underlying database of indexed abstracts will regularly be updated.

ACKNOWLEDGEMENTS

This study was supported by the Biorange project sp 4.1.1. of the Netherlands Bioinformatics Centre.

Conflict of Interest: none declared.

REFERENCES

Doms
A
Schroeder
M
GoPubMed: exploring PubMed with the gene ontology
Nucleic Acids Res
 , 
2005
, vol. 
33
 (pg. 
W783
-
W786
)
Eaton
AD
HubMed: a web-based biomedical literature search interface
Nucleic Acids Res
 , 
2006
, vol. 
34
 (pg. 
W745
-
W747
)
Errami
M
, et al.  . 
eTBLAST: a web server to identify expert reviewers, appropriate journals and similar publications
Nucleic Acids Res
 , 
2007
, vol. 
35
 (pg. 
W12
-
W15
)
Gospodnetic
O
Hatcher
E
Lucene in Action.
 , 
2005
Greenwich
Manning Publications
Wheeler
DL
, et al.  . 
Database resources of the National Center for Biotechnology Information
Nucleic Acids Res
 , 
2007
, vol. 
35
 (pg. 
D5
-
D12
)

Author notes

Associate Editor: Jonathan Wren

Comments

0 Comments