Abstract:
NLR (Nucleotide-Binding Domain Leucine-Rich Repeat) proteins have central roles in plant immunity, by directly detecting pathogen proteins, or by monitoring the effects of pathogens on plant proteins. To cope with the often rapid changes in the spectrum of pathogens they encounter, plants typically have a diverse NLR gene repertoire. Moreover, both the copy number of NLR genes and their sequences vary greatly within a species, presumably reflecting changing pathogen pressures. Many outstanding questions are associated with NLR repertoires. For example, two Arabidopsis thaliana NLR genes, RPM1 (RESISTANCE TO PSEUDOMONAS SYRINGAE PV. MACULICOLA 1) and RPS5 (RESISTANCE TO P. SYRINGAE 5), confer resistance to specific pathogens, but are known to carry fitness costs for the plant that can reach 10% - a surprisingly high value for natural fitness. Both genes also exhibit long-standing presence/absence (P/A) polymorphism, where some plants in a population will carry the gene, but it will be absent in others. The frequency of P/A genes has been estimated to be around 9% of all genes in the A. thaliana genome. This raises the question of how plants can afford such high fitness costs, and how common genes like RPM1 and RPS5 are in plant genomes and NLR repertoires of specific individuals. NLR genes are also interesting from a genomics perspective as the most variable gene family in plants. They are among the most repetitive families, often present as clusters of tandem duplicates in the genome. For these reasons, NLR genes do not lend themselves to regular reference-based analysis nor to de novo assembly. Thus, despite availability of large amounts of sequencing data for the model plant A. thaliana, no detailed evaluation of variation in the complete repertoire of NLR genes within this species exists.
In this study, I visualize and analyze patterns of diversity characteristic of NLR genes in A. thaliana. In the first chapter, I propose a method for the profiling of complex hypervariable regions of the genome based on short reads, even when only one reliable reference is available. In the second chapter, I apply the method to the reference set of 163 NLR genes in 80 accessions of A. thaliana. In the third chapter, I carry out a between-species comparison by applying my method to 26 accessions of Arabidopsis lyrata and 22 accessions of Capsella rubella, which represent the closest species and genus to A. thaliana, respectively. I compare these results with within-species polymorphism in A. thaliana. I found that NLR patterns of diversity fall into three categories: conserved (present), P/A genes, and genes with complex variation patterns. I identified 53 conserved NLR genes, of which 24 are were also present in A. lyrata and C. rubella, and 52 P/A genes, of which several, such as ADR1-L3 (ACTIVATED DISEASE RESISTANCE 1-LIKE3), also had P/A-like pattern in the two other species. I combined variation patterns with genomic context and nucleotide diversity information to make it possible to identify P/A genes with diversity patterns reminiscent of RPM1 and RPS5 genome-wide. I carried out a genome-wide association study (GWAS) on the P/A polymorphism genome-wide, and found that RPS5 is among the most significant genes across multiple phenotypes. RPM1 and RPS5 also show the second and third highest conservation in A. lyrata among NLR P/A genes. I conclude that genes like RPM1 and RPS5 are rare both among NLR genes and in the whole genome. I found no statistically significant enrichment for domain architecture type of NLR genes (TIR-/CC-NB-LRR) nor for genomic arrangement (single/clustered) in within-species comparison. However, clustered genes were more variable than single genes and there was significant enrichment of single genes among A. thaliana NLR genes that were also present in A. lyrata and C. rubella. My results reveal new insights into the NLR repertoires in Arabidopsis genomes.