C. elegans SAGE Data

SAGE Data from "Changes in Gene Expression Associated with developmental arrest and longevity in Caenorhabditis elegans"

SAGE data has been derived for two populations of C. elegans. A dauer population and a population of C. elegans at mixed stages. For more information about the SAGE protocol see www.sagenet.org.

Update The current analysis differs slightly from the data presented in Jones et. al. "Changes in Gene Expression Associated with developmental arrest and longevity in Caenorhabditis elegans"

Specifically, for the dauer specific genes listed in table 3.
T21D12.11 Now only has 1 dauer;0 mixed tags (p=0.24) instead of 26 dauer;0 mixed previously.
T26C11.2 Now has no SAGE tags associated, as opposed to 53 dauer, 0 mixed.
W05F2.1 A total of 6 tags attributable dauer 4, mixed 2.
Y2H9A.3 only a single tag (dauer 1, mixed 0) is now observed (dauer 31,mixed 0 previously).
Thanks to Garth Patterson for helping with these corrections.

Data from this experiment is available here The fields in this tab-delimited file are:-

  1. The name of the gene
  2. A SAGE tag which correlates to the genes conceptual cDNA sequence, which has been observed. This is determined by making a dataset of all the cDNA sequences for C. elegans genes. To get UTR sequences, I used EST data where possible, otherwise I added genomic sequences to the 5' and 3' ends. The amount added was calculated to cover 99% of all possible UTRs, determined empirically, without allowing known transcribed sequences to overlap. I then, for each gene calculated all its possible SAGE tags.
  3. Number of dauer observations
  4. Number of mixed observations
  5. Significance (from the g-test).
  6. This is gene for which the tag unambiguously maps to. This is how I determine an unambiguous match. If only one transcript has a particular tag species. Then, obviously, we can unambiguously assign that tag to a gene. But many tags can be found in more than one transcript. Therefore, for example, if a tag matches 3 transcripts, but in only one is it the most 3' tag, then we unambiguously assign the tag to the latter gene. If the tag correlated to two genes, where it was both the most 3' tag, then a specific gene assignment is not possible. Essentially, the data for that tag is of no real use. e.g.

    "B0047.3"       "CATGGGGCTGGAGT"        1       2       0.544712   "B0047.3"
    
    We see that for B0047.3 we have a tag which unambiguosly can be assigned to B0047.3 and so the dauer and mixed data are applicable to B0047.3. In the following case,
    "B0205.6"       "CATGCCTCTGAAAT"        72      30      4.3e-05 "B0205.7" "aminotransferase"
    
    Although, the gene B0205.6 does have this tag sequence in its transcript sequence. The tag actually correlates, unambiguously, with B0205.7, so the dauer and mixed data in this case do not have anything to do with B0205.6. In the following case,

    "B0205.1"       "CATGGCCTAGAAAC"        2       4       0.391668
    
    We have a tag which could belong to B0205.1, but it cannot be mapped unambiguously. So there is nothing much we can do with this observation at present.
  7. This field contains a brief identification of the gene in field 1, if available.
The paper discusses datasets of dauer specific and mixed stage specific genes. The set of 358 dauer specific genes are listed here. The set of 533 mixed stage (dauer exclusive) genes are listed here.