CAFA Rules

CAFA Rules

NOTE: the submission deadline has been extended to January 18, 2011 11:59pm time zone of your choice.

How to participate in CAFA?

  1. Read the rules
  2. Register for the experiment today
  3. Download target proteins and evaluation scripts and data (targets are available from September 15, 2010)
  4. Submit predictions before the deadline January 15, 2011 January 18, 2011 11:59pm time zone of your choice.
  5. NEW: check your prediction file using the CAFA parser. This is a perl script to check that your prediction file is syntactically correct. Download the file, rename it to cafa_format_checker2.pl, then run it on your prediction file(s). Do so before you upload your predictions
  6. Evaluation criteria for the predictions are online


Download the rules in PDF

Rules
for the CAFA challenge

Rules for the CAFA challenge

  1. The goal of the assessment is to evaluate automated protein function prediction algorithms in the task of predicting Gene Ontology (GO) terms for the proteins that are currently not annotated in the Swiss-Prot database (using EXP, TAS, or IC evidence codes). Evaluation will be carried out only for the Molecular Function (MFO) and Biological Process (BPO) Ontologies.

  2. Targets available: September 15, 2010. Submission deadline: January 15, 2011 January 18, 2011 11:59pm time zone of your choice

  3. A person becomes a group (team) manager for CAFA by registering at http://biofunctionprediction.org/. A group manager can register a team by adding names and email addresses of the team members upon logging to the system. The group manager does not necessarily have to be a lab's principal investigator. One person can be a member of one group only; except for the principal investigator. Detailed instructions on the group registration process is provided in the appendix.

  4. The set of targets is split into the eukaryotic track and prokaryotic track. The prokaryotic track is exploratory only; that is, at this stage we cannot guarantee enough target proteins for evaluation from the prokaryotic track

    AUTHOR TEAM_NAME

    MODEL 1

    KEYWORDS literature, ortholog.

    ACCURACY PR=0.75; RC=0.31

    T00001 GO:000123 0.73

    T00001 GO:000200 0.72

    T00203 GO:000234 0.91

    T00243 GO:000001 0.29

    .

    .

    END


    Figure 1. File format for CAFA predictions. Allowed delimiters are tab and space.

  5. One team may test up to 3 different prediction models (named MODEL 1, MODEL 2, and MODEL 3) during submission. Only MODEL 1 will be officially evaluated by the organizers. Thus, a team should use its best model with MODEL 1.
  6. Prediction output file format is provided in Figure 1. The file a team submits should be in text format (*.txt) or compressed (*.zip). Predictions can be uploaded and deleted any number of times. The one present in the system at the submission deadline will be used for evaluation.

    1. The AUTHOR line lists the team name that the team leader has used during registration.

    2. The MODEL line contains numbers 1, 2, or 3 and corresponds to the prediction model used as described in (5)

    3. The KEYWORDS line contains a list of keywords that describe the methodology used. Keywords line uses a comma-separated list, ending with a full stop, of one or more of the following pre-specified keywords (comma separated): sequence alignments, profile-profile alignments, sequence-profile alignments, phylogeny, phylogenomics, derived/predicted, sequence properties, protein interactions, gene expression, mass spectrometry, genetic interactions, protein structure, literature mining, genomic context, structure alignment, comparative model, predicted protein structure, de novo prediction, machine learning based method, genome environment, operon, ortholog, paralog, protein interaction, other functional information. If a method is not specifically covered, use “other functional information”.

    4. The ACCURACY line is optional. If present, it must contain the group's estimate of the accuracy of their method. The line contains estimated precision (PR) and recall (RC) for the highest scoring prediction ("top hit") for a randomly selected target protein. Both numbers must be in the interval [0.00, 1.00]. Two significant figures are required (e.g. 0.70 is valid but 0.7 is not). The ACCURACY line may be different in each submitted file (all target are broken up into groups). If so, a weighted average will be used to estimate the final accuracy of the model. Weights are determined by the number of proteins from each target file that accumulate experimental terms between the submission deadline and evaluation time.

      1. How to estimate precision and recall? For each protein on which function prediction is evaluated, pick the highest scoring term among all non-zero predictions and propagate it to the root of the ontology (if two or more terms contain the same score, pick all such terms). The overlap between these predicted terms (nodes in MFO or BPO) and all experimental terms for protein i will be used to assign precision (pri) and recall (rci) for this prediction: pri = # correctly predicted nodes / # of predicted nodes; rci = # true nodes correctly predicted / # true nodes. Repeat this calculation for all proteins. Calculate the precision PR and recall RC of the top hit as an average over all proteins for which pri and rci were calculated.

    5. The list of predictions contains a list of pairs between protein targets and GO terms, followed by the probabilistic estimate of the relationship (one association per line). The target name must correspond to the target ID listed in the target files (in the FASTA header for each sequence). The GO ID must correspond to valid terms in GO from 05-15-2010 (http://archive.geneontology.org/lite/2010-05-15/). MFO and BPO are to be combined in the prediction files, but they will be evaluated independently by the CAFA assessment team. The score must be in the interval (0.00, 1.00] and contain two significant figures. Score 0.00 is not allowed, that is, the team should simply not list such pairs. In case the predictions are not propagated to the root of ontology, the organizers will recursively propagate them by assigning each parent term a score that is the maximum score among its children's scores. Finally, to limit prediction file sizes, one target cannot be associated with more than 1000 terms (total, i.e. for MFO and BPO together). The organizers are providing the Perl code for the teams to be able to check the validity of the format of their prediction files. Please submit only files that are verified for correctness. The organizers will not analyze submissions that are in incorrect format. If your method does not output a score associated with predicted terms, but rather just a set of terms, the team can set scores for all such predictions to the same value (e.g. 1.00). Such methods will be characterized by a single PR-RC point, instead of a PR-RC curve.

    6. The prediction file must end with the keyword END in a line of its own.

  7. The CAFA assessment team will assess prediction models based on the top k hits (k ≤ 100). Details on how top hit is calculated can be found above in step 6d. Similarly, the precision-recall point for the second best hit will be corresponding to the second highest scoring predicted term (allowed to be on the path of the top hit towards the root). This will create a precision-recall curve. Another metric will include the Jaccard coefficient (the size of the intersection divided by the size of the union between two sets of terms, predicted and experimental) for the top k hits. The assessment team will experiment with other assessment strategies as well. For example, the terms will be weighted by their information content. In addition, the prediction performance will be assessed for each level of GO ontologies. The efficacy of the metrics will be included in the final report.

  8. As in any primary endeavor, we do not expect these rules to deal with all contingencies and issues that may arise during the CAFA challenge. Any problems that may come up would be ruled upon by the CAFA assessment team and organizers. All such rulings are final.

  9. Appeals: a prediction group can appeal its prediction evaluation to the CAFA assessment team, and to the CAFA organizers. Appeals will be discussed and ruled upon by the assessment team and the organizers. All such rulings are final.




AFP 2011 Organizing Committee



Appendix: Instructions for managing CAFA accounts



Step 1 – Open an account

  • To participate in the CAFA challenge you must have an account at http://biofunctionprediction.org/. Anybody can create an account by going to the web site above and clicking on "Create new account" in the left bar.

  • Upon registering, you will receive an email with your password and a link to one-time access to your account (please check spam/junk folders if email not received immediately; otherwise please contact AFP organizers afpcafa2011@gmail.com).


Step 2 – Choosing your group's manager and members

  • The group manager should have all of the group's user names. Thus, each group member must register as well prior to being added to the group.


Step 3 – Creating the group

  • The group manager, been logged in to the system, should use the "Create a CAFA group" option to create a CAFA group and provide a name for it. The group manager will declare the principal investigator for the project.

  • Upon successfully creating a group, a link with your group name will appear in the left bar.

  • Clicking on the group name will present a menu to manage your group.


Step 4 – Manage your group

  • The group manager should add all his/her members to the group by choosing the "Add members" option after clicking on the link "Manage group" from the group menu.

  • Using the "Send invitation" link from the group menu is optional. After adding their user names, users will have an automatic access to the group when logged-in even if they did not receive an invitation.

  • Click on "Admin: Create" to give a group member permission to upload files. By default members can only view the uploaded files.


Step 5 – Upload prediction files

  • Use "Upload predictions for model 1" (or model 2 or 3) only once to create your models.



Restrictions:

  • A person can belong to only one CAFA group

  • A principal investigator may lead more than one group

  • A group can upload files for three models only

5