An exciting set of papers was published in Nature last week, the culmination of the ENCODE project results. This project has been in the works since 2007 with the goal to characterize all of the important elements of the human genome (where each gene is, where do certain molecules bind, how is the genomic DNA organized, etc.). Remember, the genome is a sequence of DNA that provides instructions for each of our cells. A person has the same genomic DNA in every cell, but it is differences in how parts of that DNA are organized and how it interacts with other molecules in the cell that enable us to have skin cells and lung cells and muscle cells that look and function very different.
ENCODE stands for ENCyclopedia Of DNA Elements and there are separate projects for other species like the fruit fly and the nematode worm. The human version of the ENCODE project has been ongoing for five years now and they have now published their initial findings. I cannot express how much data was produced from these studies, it’s overwhelming! Thus, these publications (and there are many) are just the beginning and many more studies will be done using this data. Thankfully, Nature published a paper that gave a broad summary of some of the findings from this massive collection of data and analyses. This is the paper I will be discussing today: “An integrated encyclopedia of DNA elements in the human genome” by The ENCODE Project Consortium. To everyone’s benefit, this paper and many other ENCODE publications are freely available to everyone at http://www.encodeproject.org. So if anything here sounds interesting to you and you want to learn more, you can look up all the details. The data from the project is freely available too! So if you’re really ambitious, or have a favorite gene that you want to check out, you can with just your computer and an internet connection.
Problem/Question: In the past decade we uncovered the full sequence of the human genome, but that in itself did not tell us the role that each letter or part of the DNA sequence has. We found that much more data on how the DNA is organized and what interacts with it in a cell is needed to better understand the complex workings of the genome. Thus, the ENCODE Project was born.
Results: First let me try to give you an impression of the sheer amount of data collected for this project. There were 1, 640 datasets (these varied in size, but many are very large and detailed) and 147 different types of cells were examined. This is a lot of data and while it can be analyzed in broad strokes, to get all of the important information out of it, it will also need to be examined in finer detail. Many of the papers published last week are the broad view papers and some, and many future papers, look at the finer details.
One thing that the ENCODE project has that is beneficial for scientists (and whose results will benefit others) is the cohesiveness and compatibility of the datasets. Studies have been done previously on topics that are covered by the ENCODE project, but different labs have different interests, and while their information is by no means necessarily bad, it can be very difficult to compare data from two different experiments done in different ways. The ENCODE Project set up a set of guidelines to ensure that their data met strict criteria and could be compared freely, even though many many labs were working on different aspects of the project. This produces a huge reserve of information that can be used in different ways and give a very detailed description of exactly what is happening and where in the genome.
The major functional regions of the genome are genes: sequences of DNA that code for a protein. The protein that is produced performs some function in the cell and a wide variety of proteins are needed to keep a cell alive and allow it to produce more cells. The ENCODE Project went through the whole genome and, using past knowledge and computer predictions, found 20,687 protein-coding genes. This is pretty close to our estimates from the past few years and so is no surprise. What is more interesting, to me at least, is that they identified an average of 6.3 alternatively-spliced transcripts per gene. Think of a gene as a series of letters that make a word, such as carrot. From the letters C, A, R, R, O and T, we can spell carrot, car, cat, cot, art, and rot all from those letters in that order. That’s similar to how transcripts from a gene work. The gene has exons (regions that code for a part of a protein) that are separated by introns (DNA sequence that does not code for a protein). The exons are like our letters in CARROT. A protein reads along the gene and makes a copy of it in RNA (a molecule similar to DNA). This RNA copy has pieces removed, including the introns and some exons, to form a transcript that codes for a specific protein. Many transcripts can be made from a single gene and each of these transcripts can be cut and pasted together differently to make different proteins all from the same gene sequence. Stepping back a second, what this means is that while we have about 20,687 genes in our genome, from those genes, we make over 125,000 different proteins.
There’s a lot of information packed into our genome sequence, yet genes do not even make up half of the DNA sequence of our genome. Full-length genes, from tip to tail and including introns, cover 39.54% of the genome. Exons, the regions that code directly for proteins, only cover 2.94%. So what is the rest of the genome doing? Gene regulation seems to be the answer. Our genome has the information needed to make many many proteins, but if all of those proteins were produced at the same time, our cells would be a mess! Instead, they have very fine-tuned means of turning genes on and off and controlling what transcripts they produce at any given time.
There are many ways that genes are regulated and I will just touch on a few here. One way that genes are regulated is by the binding of special proteins called transcription factors to the DNA. These proteins often interact with the proteins that make RNA copies of the gene and control how fast or slow these copies can be made. The ENCODE project found many sites throughout the genome where different transcription factors bind and they also noticed that a transcription factor will frequently bind with one or more other transcription factors near a gene. By knowing which transcription factors bind where, we can better understand how and when they are regulating a specific gene.
Genes are also regulated by the organization of the DNA itself inside the cell. I will mention two ways that this is examined: through histone modifications (don’t worry I’ll explain this below) and chromosome interacting regions. Histones make up little balls of proteins that the DNA is wrapped around. Chemical changes of these proteins can change how and where the DNA is wrapped. DNA wrapped around histones can be compared to one of those accordian-style file folders with all of the different sections (genes) that can be expanded or collapsed. When certain histone modifications are present, it is like having an open section of the file folder. Any papers or documents in that section can be easily looked at and accessed; the DNA is wrapped loosely and the gene is accessible to protein interaction and is often active (being read and copied into RNA). When other histone modifications are present, the DNA is wrapped up very tightly and is inacessible; like when the file folder is squished closed. This prevents many proteins from interacting with that region of DNA. There are other modifications to histones with different effects on the DNA that wrap around them and the ENCODE project has marked where these modified histones are along the genome in different cell types (since they can vary greatly). Chromosome interacting regions are a little easier to understand, though we are just recently designing methods to find and study them. These are regions of the genome that can be close or far apart, but where one DNA region interacts with another (usually through attached proteins that stick them together). We’ve recently found that these interactions, especially long distance ones that are hard to identify, are very important for gene regulation and the ENCODE project is a good step towards a better understanding of where these interactions are and what they may do.
The last bit of data I want to mention here is disease-related SNPs (single nucleotide polymorphisms, or a change to a single letter in a DNA sequence). Many many studies have been done to try and identify changes in the DNA that cause or increase the risk of getting a disease. With the reduced price of genome sequencing, this is becoming more and more popular. One of the problems we’ve had in the past is that changes were found in a group of people with a certain disease that are not found in healthy persons, but those changes aren’t in any known genes. Thus, it has been hard to determine what affect those changes have on a person. With the in depth characterization of the genome done by the ENCODE project, we can now try and piece together what some of these SNPs are doing. Maybe a SNP is in a transcription factor binding site (these are usually in front of genes) and while it won’t change the protein made by the gene, it may affect how or when the transcription factor regulating that gene can bind. Perhaps it prevents binding, and so without that transcription factor, the gene is never turned on and no protein is made. This can cause a major defect in a cell and could easily be related to a disease. The list of possibilities goes on, and as we better and better understand the function and importance of each region of the genome, we will get better at identifying how disease-related SNPs cause the disease.
Big Picture: The results I have mentioned here are just the very tip of the iceberg of ENCODE project results. I hope they at least give you a sense of what this project is doing and why it is important and exciting. This is also by no means the end of the project, more data will be gathered by other labs and many people will reference these data for their favorite gene or region of interest and use it to learn more about the deep complexity of our genome. It is papers like this that put me in awe again at just how much information can be stored in a sequence of four letters and the beautiful complexity of how this information is used to make a living, breathing human being.