Canine reference genome accuracy impacts variant calling: Lessons learned from investigating embryonic lethal variants

Dog Biomedical Variant Database Consortium

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


Deficient homozygosity of a variant maintained in a population suggests that the variant may be embryonic lethal. We examined whole genome sequence data from 675 canids to investigate for variants with missing homozygosity and high predicted impact. Our analysis identified 45 variants, in 32 genes. However, further scrutiny of the sequence reads revealed that all but one of these variants were artifacts of the variant calling process when using CanFam3.1, a widely utilized canine reference genome. We demonstrate that the use of multiple, newer reference genomes could reduce artifacts and lead to more accurate variant identification.

Original languageEnglish (US)
Pages (from-to)706-708
Number of pages3
JournalAnimal Genetics
Issue number5
StatePublished - Oct 2022

Bibliographical note

Funding Information:
Variants with considerably less frequent homozygosity than expected under Hardy–Weinberg equilibrium (HWE) have been suggested to be embryonic lethal or to cause developmental disorders in many species, including canids (Georges et al., 2019). Using missing homozygosity to annotate variants, we characterized novel and previously reported variants for potential impairment of canine development; we then examined the reliability of the mapping used to determine missing homozygosity. Whole genome sequence data from 675 canids mapped to CanFam3.1 from the Dog Biomedical Variant Database Consortium were analyzed using ensembl variant effect predictor (version 101) for variant impact prediction and sorting intolerant from tolerant (SIFT) scoring to predict missense variant impact (Howe et al., 2021; Jagannathan et al., 2019; McLaren et al., 2016; Vaser et al., 2016). Variants marked as ‘high-impact’ or having a SIFT prediction score of 0.0–0.05 (‘not tolerated’) were considered deleterious. Allelic frequencies and HWE exact test statistics were determined using plink2.3a (Chang et al., 2015; Wigginton et al., 2005). A minimum minor allele frequency of 5% and a maximum heterozygote frequency of 25% were set, using the maximum frequencies reported in cattle for recessive lethal variants as a guideline (Upperman et al., 2019). Variants with HWE p-values > 0.05 following a correction to control the false discovery rate were discarded; variants with more than two homozygous cases in the dataset were also discarded (Benjamini & Yekutieli, 2001). This pipeline yielded 45 variants suspected to be embryonic lethal or causing developmental disorders owing to their missing homozygosity and predicted deleterious impact; the 45 variants are located within 32 genes (Table S1). The variants were further investigated using a separate cohort of 39 dogs whose whole genome sequence reads were mapped to CanFam3.1. Integrative Genomics Viewer and blast was used to manually assess sequence read quality in variant regions for those 39 dogs and on selected sequence reads against the five available dog reference genomes on NCBI (annotation release 105) to check for mismapping (Altschul et al., 1990; Robinson et al., 2017). Of the 45 predicted deleterious variants deficient in homozygosity, only a single variant appeared to be valid and to have a developmental impact. The variant is a frameshift in ENSCAFG00000043059 (CFA 1), a zinc finger gene with widespread embryonic expression, making it a promising candidate for further investigation (Megquier et al., 2019; NCBI, 2021). The remaining identified variants appear to be false. Mismapping of sequence reads occurred for two main reasons. First, CanFam3.1 is based on a female boxer, and thus sequence reads from the Y chromosome are incorrectly mapped to autosomes (Tsai et al., 2019). For example, the suspected deleterious variant ascribed to CASP6, a gene found on chromosome CFA 32, is actually associated with sequence reads that most accurately align to an intergenic region found on CFA Y (Table S1). Secondly, CanFam3.1 possesses over 20 000 gaps with approximately 20% of them occurring within genes (Wang et al., 2021). Sequence reads that cover these gaps either erroneously map within the correct gene or map to an entirely incorrect gene. In either case, the variant would appear to be in a heterozygous state and would never be called homozygous, which could then be interpreted as missing homozygosity and attributed to causing embryonic lethality or developmental disorders (Jagannathan et al., 2019). The high percentage of candidate genes indicated by false variants owing to assembly issues is a major obstacle to studies focusing on loci with high heterozygosity. Several other dog genome assemblies (Table S2) have been recently reported that might mitigate the shortcomings of a single alignment (i.e. CanFam3.1), including ROS_Cfam_1.0, UNSW_CanFamBas_1.0, Basenji_breed-1.1, Dog10K_Boxer_Tasha, UU_Cfam_GSD_1.0 and UMICH_Zoey_3.1 (Edwards et al., 2021; Halo et al., 2021; Jagannathan et al., 2021; The Roslin Institute, 2020; Wang et al., 2021). The present findings demonstrate the importance of confirming putative variants across multiple assemblies, as a single reference genome may be insufficient for accurately identifying variants.

Funding Information:
Catherine André and Christophe Hitte were supported by the French Cani‐DNA CRB ( http://dog‐ ), which is part of the CRB‐Anim infrastructure, ANR‐11‐INBS‐0003. Kari J. Ekenstedt was supported by the Office of the Director, National Institutes of Health under award number K01‐OD027051. Eva Furrow was supported by the Office of the Director, National Institutes of Health under award number K01‐OD019912. Tosso Leeb was supported by the Albert‐Heim Foundation (project no. 105). Kim M. Summers receives core support from the Mater Foundation, Brisbane, Australia.

Publisher Copyright:
© 2022 Stichting International Foundation for Animal Genetics.


  • bioinformatics
  • canis lupus familiaris
  • genomics
  • lethal recessive
  • sequence mapping
  • whole genome sequencing


Dive into the research topics of 'Canine reference genome accuracy impacts variant calling: Lessons learned from investigating embryonic lethal variants'. Together they form a unique fingerprint.

Cite this