Summary: We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive database searches.

Availability: The program is available from http://bioinformatics.burnham-inst.org/cd-hi

Contact: liwz@sdsc.edu or adam@burnham-inst.org


To whom correspondence should be addressed.