Researchers from the United States conducted a comparative study of 1,773 human metagenomes and found about four thousand protein families, most of which have not been previously described because of their small size, which makes identification difficult. About thirty percent of the proteins found were involved in intracellular interactions, according to a magazine article cell,
Finding short proteins of less than 50 amino acids in length is difficult. When annotating a new genome – marking on its sequence, the positions of existing genes – short sequences are usually ignored. This is due to the fact that the chance of taking a random part of the genome for a real gene is too great. Short proteins are low in databases and therefore the search for new sequences similar to the ones found previously does not work well. Moreover, even if there are short such sections, it is also difficult to prove that they are homologues (related sequences) and not random coincidences. Proteomic methods such as mass spectrometry also do not work in this case due to the lack of such proteins in the databases. In the meantime, those proteins that have been found to often have intriguing properties – for example, aid in intercellular communication.
A new study shows that already known small proteins make up only a small fraction of the real number. Using a combination of open-access genetic data, Hila Sberro of Stanford and her colleagues discovered and described about four thousand protein families, most of which are new and have no relatives in existing databases. As a resource, they used 1,773 human metagenomes from the Human Microbiome Project. In them, absolutely all potential open reading frames were detected and then a series of filters were applied sequentially to leave only the desired sequences. Initially, they are filtered along their length so that the translated protein does not exceed 50 amino acids, after which they are pooled in similar groups and those that have less than eight potential proteins are removed. This cut off a significant portion of the random sequences, but then the researchers further released the remaining sequences through a program that could isolate the coding sequences from the total mass based on evolutionary signatures and at the same time verify the sequence for the ribosome landing site required for protein translation. As a result of this filtration, only 4,539 clusters remain, each responsible for a distinct group of proteins.
Most of the protein families found have been unknown so far: when compared to databases for a total of 190 families, relatively similar sequences were found in the database of domains, and about a quarter found commenters homologues in general. As mentioned above, standard methods do not sharpen for the identification of short proteins, so the authors suggest that many of them are missed during annotation of genomes and therefore the number of homologs found is small. To work around this, they re-annotated the genomes in the database, removing the restriction on the size of the reading frame, and then repeated the search. Thanks to this move, they found homologs for the other 27 percent of the protein families, but still about half of the protein families were left without reference to already known genes.
In order to finally verify the truth of the genes found, the researchers checked whether protein synthesis came from them. To do this, they used metatranscript data – by analogy with ordinary transcripts, they contain sequences of active genes, but not for one organism, but for all the inhabitants of the sample at a time. It turned out that 75 percent of those genes that found homologs in metatranscriptomes were active. In addition, for genes belonging to bacteria Bacteroides thetaiotaomicron, it was possible to show that 40 percent of the homologs found in it undergo not only RNA synthesis but also protein.
In the next step, the researchers tried to understand what these proteins do. As there are no known function homologues for the majority, it was impossible to do this by extending the function of the already known relative to the desired protein. However, they were able to isolate from the mass of conserved widespread household proteins, locally specific proteins specific to, for example, the oral or intestinal metagenomes only, and at the same time discovered a new ribosomal protein. Scientists individually identified protein families, in sequences of which had a transmembrane domain and a secretory label, a sign that this protein was not used inside the cell but outside. About thirty percent of them are found, and researchers believe they are involved in cell-to-cell interaction. Another common group of proteins that have been detected due to the specific structure of the sequences are proteins that protect bacteria from phages.
All this suggests that the importance of short proteins is highly underestimated. Due to the inconvenience of working with them and the difficulty of defining standard methods, they are highly underrepresented in databases, while in fact they are quite large and play an important role in the life of cells. This has been demonstrated by the competent processing of metagenomic data, which contains information about DNA sequences of more than one species, but many at a time – ideally all – that were in the sample, including those not previously described. Such a wealth of metagenomic data allows us to use them not only to search for new proteins but also to predict their three-dimensional structures.