The CAFA Challenge:
The problem: There are many proteins in the databases for which the sequence is known, but the function is not. The gap between what we know and what we do not know is growing. A major challenge in the field of bioinformatics is to predict the function of a protein from its sequence or structure. At the same time, how can we judge how well these function prediction algorithms are performing?
The solution: The Critical Assessment of protein Function Annotation algorithms (CAFA) is an experiment designed to provide a large-scale assessment of computational methods dedicated to predicting protein function, using a time challenge. Briefly, CAFA organizers provide a large number of protein sequences. The predictors then predict the function of these proteins by associating them with Gene Ontology terms or Human Phenoytpe Ontology terms (Blue “prediction” section of timeline). Following the prediction deadline, we wait for several months. During that time, some proteins whose function were unknown experimentally have received experimental verification (Green “annotation growth” section of timeline). Those proteins constitute the benchmark, against which the methods are tested (Orange “assessment” portion of timeline). You can read about CAFA 1 here and in the paper published in Nature Methods, and you can read about CAFA 2 here.
There is an opportunity for a postdoc / research scientist to run CAFA3. For position details see: https://careers.iscb.org
CAFA PI open for registration!
Target release date: December 1, 2017
Predictions deadline: April 20, 2018
Initial Evaluation: July 2018
CAFA 3 assessors will carry out evaluations following this date, as more experimentally annotated results accumulate. We expect the final evaluation to be done on or about October 2018
To participate in CAFA PI:
- Read the rules
- Download helper files. Target proteins available starting September 30, 2016.
- User and team registration currently open on the Synapse Site. Each team must register with the competition to submit your predictions. Each team must have at least one certified registered user who will upload data.
- Submit predictions after CAFA PI start date, December 1, 2017, but before deadline, April 20, 2018.
- (Optional): an Introduction to Protein Function Prediction[PDF] for computer scientists and software engineers written by Predrag Radivojac
Frequently Asked Questions:
Do I have to participate in the CAFA experiment to participate in the AFP meeting?
No. You just have to register for the AFP meeting.
Is there another way I can actively participate in the meeting? I do have work in the field of protein function prediciton, but I do not wish to participate in CAFA?
Yes. You can submit your work for presentation in the meeting, as a poster or a talk. The CAFA experiment is only one AFP activity, and we wish to be as inclusive as possible.
Will the organizers provide training data for the predictor development?
No, we will only estimate the prediction accuracy. The accuracy of protein function prediction may critically depend on the ability of the group to extract quality data from a range of sources, integrate it and preprocess it. Functional annotations for training can be extracted from the Swiss-Prot, or GO database. Molecular data is available from various additional sources, such as GEO, HPRD, PDB, PRIDE, BIND, DIP, etc.
Yeah yeah, I’ll read the rules. But can you tell me in brief how you are running the CAFA challenge?
Starting September 30, 2017, we will make about 100,000 protein sequences available. Those are proteins taken from several sources, chiefly SwissProt. You are expected to annotate them using Gene Ontology and/or (in case of the human proteins) the Human Phenotype Ontology terms. After the submission deadline, we will select some of the target proteins for scoring predictions. To participate in the CAFA challenge, you need to register your team on the Synapse site.
Which Gene Ontology terms?
All of them: Molecular Function, Biological Process and Cellular Component
What I don’t understand is how the true function of your proteins will be determined. Will these be made and then tested in some high throughput way? If not, how do you know that your answers are correct? Just because many different algorithms predict a particular function, it doesn’t mean that that function is the real one.
There is a natural growth of experimentally verified annotations in Swissprot. So if say, we hand out 3000 Drosophila genes, and over the lull period between the submission deadline and the evaluation date 10 (low estimate) get annotated experimentally, we already have quite a few to play with. We interrogated the history of several genomes, and the more popular model organisms have a good growth curve with respect to experimental annotations. We will use several genomes. We estimate a few hundred targets to come out of this pipeline.
So how many targets are we talking about?
We will release some 100,000 sequences. We expect that between the prediction submission deadline (
January 22, 2017 February 2, 2017) and the time we begin the assessment (July 2017) a few hundred will become experimentally validated, and non-trivial to predict. This is based on trends from previous years.
The “Critical Assessment” meme originated with the Critical Assessment of Structure Predictions (CASP) at the structural bioinformatics community. Later it was taken up by various other life science communities who have decided to perform their own critical assessment challenges. CAPRI, BioCreative, CAMDA and more recently CAGI. The name CAFA was coined by Inbal Haleprin-Landsberg, then at Russ Altman’s lab, and the initiative was first discussed at the 2006 Automated Function Prediction meeting.