Mice and Men: The Mouse ENCODE Data Shows Just How Similar We Really Are

Background: The ENCODE Project stands for the Encyclopedia of DNA Elements and its goal is to identify and characterize all of the functional parts and pieces of the human genome. This goal has also been extended to other model organisms, including flies, worms (modENCODE), and now mice. Much of the initial progress on these projects is the collection of a vast amount of data. From genome sequences and gene expression data (which genes are turned on or off in certain cells or at certain times), to chromatin structure (the 3D physical structure of genomic DNA in a cell) and protein binding sites (where along the genome does a specific protein bind and when). This bulk data provides a basis for large-scale observations, such as what I’ll be discussing today, and also for future analyses focused on more specific questions. The ENCODE data for humans and other organisms is by no means perfect, nor is it an end all where we can say “Ok, we’ve got all the data we’ll ever need, now we just have to sit down and analyze it.” But, it is a nice resource for scientists and, as the data itself, as well as many of the journal articles describing it are open access (nematode worm, fruit fly, human, mouse), anyone can go looking for their favorite gene or see the results for themselves.

Today’s article by Cheng et al. looks at a portion of the mouse ENCODE data that addresses transcription factor binding. Transcription factors (TFs for short) are proteins that bind to the DNA and regulate when certain genes turn on and off. This controls not only what genes are on or off in a given cell, but by regulating how long each gene is on, it also determines the amount of gene product present in the cell. In animals such as mice and humans, there are many types of cells that function in different ways. Gene expression regulation is key to determining what a cell looks like and does, and keeping it alive and healthy. Thus, the regulation of most genes is rather complex and requires multiple TFs. This study looks at TF binding sites in the mouse genome and how similar they are to those in the human genome.

Problem/Question: We know that humans and mice share many of the same genes, but what about how those genes are turned on and off at the proper times?

Results: The authors examined 34 TFs in a number of mouse and human cell lines, cells taken from specific organs or tissues in an animal and grown separately in the lab, producing 120 data sets for analysis. They found that broadly, the primary binding sites of these TFs are conserved between human and mouse, as well as their positions relative to the gene they are regulating. Some TFs work by binding close to the gene, while others can bind very far away, yet still perform regulatory functions. The preference for each TF is conserved, as well as the chromatin state around each TF. Chromatin states are complex and not yet fully understood, but essentially reflect the availability of a region for binding of specific proteins. It’s kind of like a train with different types of cars along its length. Some cars are just flat beds and so can hold almost anything that fits on them, others are designed for specific contents like liquids. Any liquid can fit in a liquid car, but something like cattle would not fit. Likewise, you wouldn’t want to try holding liquid in a cattle car as it would leak out over time. Each chromatin state has specific proteins associated with it (like a specific train car) that determines how the DNA is structured and what other proteins can access it (what it can hold). The finding that the TFs examined bind within the same chromatin states in humans and mice suggests that these chromatin states are similar and likely have the same functions in both species.

While the DNA bases that each TF contacts when it binds are highly conserved, only 50% of the binding sites are in conserved locations (based on the sequence 50 bases to each side of the binding site). The best conserved sites were in promoter regions, those close to the gene that is being regulated. This is not surprising in that we know from previous studies that promoter regions are more well conserved between species than regulatory regions further from the genes they regulate. These more distant regions are likely more amenable to change and thus may facilitate some of the differences between humans and mice. It is also possible that the same binding sites exist, but they have moved and so their surrounding sequence is different. This is a common occurrence and would be expected for some distant TF binding sites.

Binding sites for TFs located far from the gene they are regulating are expected to be more strongly conserved if they are used to control expression in a number of cell types. Essentially, if you change something that only affects one cell type, it is less likely to be problematic than if you change something that affects 5 cell types. When the authors asked whether this holds true between humans and mice, they found that indeed, it does. TF binding sites in regions that are open for binding in many cell types tend to be better conserved than TF binding sites in regions that are only open in one cell type. To verify that this openness corresponds to the expression of genes controlled by the open region, they tested where the gene is turned on in mouse embryos. In most cases, these open regions caused the gene to be turned on (expressed) in multiple cell types, further confirming their finding. Similarly, the more TFs that bind to a region, the more conserved it is likely to be between human and mouse. However, when the authors looked at specific protein pairs that tend to bind together, they found that some bound the same places in human and mouse, and some bound different places. The latter group would be good candidates for studying differences between humans and mice.

Lastly, the authors looked at single nucleotide polymorphisms (SNPs) and how they relate to their TF binding sites. Many studies on humans and other animals try to identify SNPs that are correlated with certain diseases. In the simplest cases, they could find that a single DNA base change at a specific position causes a certain disease. Now, there are very very very few cases this simple, the vast majority of diseases are a result of a combination of multiple SNPs or other genomic changes and possibly environmental variables (living conditions, diet, etc.). When the authors looked at SNPs associated with a phenotype (SNPs that can cause a physical difference from the norm), they found that more were associated with TF binding sites that are conserved between humans and mice than those unique to either species. This indicates again that these conserved binding sites are very important and need to be strongly maintained so as to prevent their malfunction or destruction.

Big Picture: While earlier studies have shown us how similar we are to mice on a genomic scale (a large number of our genes are shared), this new data goes beyond that by showing that even the regulation of when those genes are on and off is conserved. The differences identified in this study will be great places to look for identifying what separates us and makes a mouse a mouse and a human a human. The association of TF binding sites with SNP data can also inform us with regards to disease.

While this study certainly produced a to of data and some very interesting results, I would like to remind everyone that it is still a very limited data set. Their 34 TFs are more than have been examined to this extent before, but there are thousands of potential TFs in the human genome. Additionally, while they examined a variety of cell lines, these represent only a small portion of the cell types present in our bodies, and their growth and maintenance in a lab (and no longer in a person) may affect how they act. Believe me, I’m not condemning these studies, I think they are very important and informative. But we need to be careful how much we generalize the findings from studies like these, and to not automatically assume they represent the full organism. Thus, we must remember that we are still just scratching the surface and much more work lies ahead to fully grasp the big picture of how a human or mouse works.

 

The ENCODE Project

An exciting set of papers was published in Nature last week, the culmination of the ENCODE project results. This project has been in the works since 2007 with the goal to characterize all of the important elements of the human genome (where each gene is, where do certain molecules bind, how is the genomic DNA organized, etc.). Remember, the genome is a sequence of DNA that provides instructions for each of our cells. A person has the same genomic DNA in every cell, but it is differences in how parts of that DNA are organized and how it interacts with other molecules in the cell that enable us to have skin cells and lung cells and muscle cells that look and function very different.

ENCODE stands for ENCyclopedia Of DNA Elements and there are separate projects for other species like the fruit fly and the nematode worm. The human version of the ENCODE project has been ongoing for five years now and they have now published their initial findings. I cannot express how much data was produced from these studies, it’s overwhelming! Thus, these publications (and there are many) are just the beginning and many more studies will be done using this data. Thankfully, Nature published a paper that gave a broad summary of some of the findings from this massive collection of data and analyses. This is the paper I will be discussing today: “An integrated encyclopedia of DNA elements in the human genome” by The ENCODE Project Consortium. To everyone’s benefit, this paper and many other ENCODE publications are freely available to everyone at http://www.encodeproject.org. So if anything here sounds interesting to you and you want to learn more, you can look up all the details. The data from the project is freely available too! So if you’re really ambitious, or have a favorite gene that you want to check out, you can with just your computer and an internet connection.

Problem/Question: In the past decade we uncovered the full sequence of the human genome, but that in itself did not tell us the role that each letter or part of the DNA sequence has. We found that much more data on how the DNA is organized and what interacts with it in a cell is needed to better understand the complex workings of the genome. Thus, the ENCODE Project was born.

Results: First let me try to give you an impression of the sheer amount of data collected for this project. There were 1, 640 datasets (these varied in size, but many are very large and detailed) and 147 different types of cells were examined. This is a lot of data and while it can be analyzed in broad strokes, to get all of the important information out of it, it will also need to be examined in finer detail. Many of the papers published last week are the broad view papers and some, and many future papers, look at the finer details.

One thing that the ENCODE project has that is beneficial for scientists (and whose results will benefit others) is the cohesiveness and compatibility of the datasets. Studies have been done previously on topics that are covered by the ENCODE project, but different labs have different interests, and while their information is by no means necessarily bad, it can be very difficult to compare data from two different experiments done in different ways. The ENCODE Project set up a set of guidelines to ensure that their data met strict criteria and could be compared freely, even though many many labs were working on different aspects of the project. This produces a huge reserve of information that can be used in different ways and give a very detailed description of exactly what is happening and where in the genome.

The major functional regions of the genome are genes: sequences of DNA that code for a protein. The protein that is produced performs some function in the cell and a wide variety of proteins are needed to keep a cell alive and allow it to produce more cells. The ENCODE Project went through the whole genome and, using past knowledge and computer predictions, found 20,687 protein-coding genes. This is pretty close to our estimates from the past few years and so is no surprise. What is more interesting, to me at least, is that they identified an average of 6.3 alternatively-spliced transcripts per gene. Think of a gene as a series of letters that make a word, such as carrot. From the letters C, A, R, R, O and T, we can spell carrot, car, cat, cot, art, and rot all from those letters in that order. That’s similar to how transcripts from a gene work. The gene has exons (regions that code for a part of a protein) that are separated by introns (DNA sequence that does not code for a protein). The exons are like our letters in CARROT. A protein reads along the gene and makes a copy of it in RNA (a molecule similar to DNA). This RNA copy has pieces removed, including the introns and some exons, to form a transcript that codes for a specific protein. Many transcripts can be made from a single gene and each of these transcripts can be cut and pasted together differently to make different proteins all from the same gene sequence. Stepping back a second, what this means is that while we have about 20,687 genes in our genome, from those genes, we make over 125,000 different proteins.

There’s a lot of information packed into our genome sequence, yet genes do not even make up half of the DNA sequence of our genome. Full-length genes, from tip to tail and including introns, cover 39.54% of the genome. Exons, the regions that code directly for proteins, only cover 2.94%. So what is the rest of the genome doing? Gene regulation seems to be the answer. Our genome has the information needed to make many many proteins, but if all of those proteins were produced at the same time, our cells would be a mess! Instead, they have very fine-tuned means of turning genes on and off and controlling what transcripts they produce at any given time.

There are many ways that genes are regulated and I will just touch on a few here. One way that genes are regulated is by the binding of special proteins called transcription factors to the DNA. These proteins often interact with the proteins that make RNA copies of the gene and control how fast or slow these copies can be made. The ENCODE project found many sites throughout the genome where different transcription factors bind and they also noticed that a transcription factor will frequently bind with one or more other transcription factors near a gene. By knowing which transcription factors bind where, we can better understand how and when they are regulating a specific gene.

Genes are also regulated by the organization of the DNA itself inside the cell. I will mention two ways that this is examined: through histone modifications (don’t worry I’ll explain this below) and chromosome interacting regions. Histones make up little balls of proteins that the DNA is wrapped around. Chemical changes of these proteins can change how and where the DNA is wrapped. DNA wrapped around histones can be compared to one of those accordian-style file folders with all of the different sections (genes) that can be expanded or collapsed. When certain histone modifications are present, it is like having an open section of the file folder. Any papers or documents in that section can be easily looked at and accessed; the DNA is wrapped loosely and the gene is accessible to protein interaction and is often active (being read and copied into RNA). When other histone modifications are present, the DNA is wrapped up very tightly and is inacessible; like when the file folder is squished closed. This prevents many proteins from interacting with that region of DNA. There are other modifications to histones with different effects on the DNA that wrap around them and the ENCODE project has marked where these modified histones are along the genome in different cell types (since they can vary greatly). Chromosome interacting regions are a little easier to understand, though we are just recently designing methods to find and study them. These are regions of the genome that can be close or far apart, but where one DNA region interacts with another (usually through attached proteins that stick them together). We’ve recently found that these interactions, especially long distance ones that are hard to identify, are very important for gene regulation and the ENCODE project is a good step towards a better understanding of where these interactions are and what they may do.

The last bit of data I want to mention here is disease-related SNPs (single nucleotide polymorphisms, or a change to a single letter in a DNA sequence). Many many studies have been done to try and identify changes in the DNA that cause or increase the risk of getting a disease. With the reduced price of genome sequencing, this is becoming more and more popular. One of the problems we’ve had in the past is that changes were found in a group of people with a certain disease that are not found in healthy persons, but those changes aren’t in any known genes. Thus, it has been hard to determine what affect those changes have on a person. With the in depth characterization of the genome done by the ENCODE project, we can now try and piece together what some of these SNPs are doing. Maybe a SNP is in a transcription factor binding site (these are usually in front of genes) and while it won’t change the protein made by the gene, it may affect how or when the transcription factor regulating that gene can bind. Perhaps it prevents binding, and so without that transcription factor, the gene is never turned on and no protein is made. This can cause a major defect in a cell and could easily be related to a disease. The list of possibilities goes on, and as we better and better understand the function and importance of each region of the genome, we will get better at identifying how disease-related SNPs cause the disease.

Big Picture: The results I have mentioned here are just the very tip of the iceberg of ENCODE project results. I hope they at least give you a sense of what this project is doing and why it is important and exciting. This is also by no means the end of the project, more data will be gathered by other labs and many people will reference these data for their favorite gene or region of interest and use it to learn more about the deep complexity of our genome. It is papers like this that put me in awe again at just how much information can be stored in a sequence of four letters and the beautiful complexity of how this information is used to make a living, breathing human being.