Click the link to go to the
help area of interest.

HGVBASEG2P OVERVIEW

The Human Genome Variation database of Genotype-to-Phenotype information (HGVbaseG2P) provides a centralized compilation of summary level findings from genetic association studies, both large and small. We actively gather datasets from public domain projects, and encourage direct data submission from the community.

HGVbaseG2P is built upon a basal layer of Markers that comprises all known SNPs and other variants from public databases such as dbSNP and the DBGV. Allele and genotype frequency data, plus genetic association significance findings, are added on top of the Marker data, and organised the same way that investigations are reported in typical journal manuscripts. Critically, no individual level genotypes or phenotypes are presented in HGVbaseG2P - only group level aggregated (summary level) data. The largest unit in a data submission is a Study, which can be thought of as being equivalent to one journal article. This may contain one or more Experiments, one or more Sample Panels of test subjects, and one or more Phenotypes. Sample Panels may be characterised in terms of various Phenotypes, and they also may be combined and/or split into Assayed Panels. The Assayed Panels are used as the basis for reporting allele/genotype frequencies (in `Genotype Experiments`) and/or genetic association findings (in 'Analysis Experiments'). Environmental factors are handled as part of the Sample Panel and Assayed Panel data structures.

DATABASE SCOPE AND CONTENT

The Human Genome Variation Genotype-to-Phenotype database (HGVbaseG2P) aims to provide an extensive, centralized compilation of summary level findings from human genetic association studies, both large and small. This is needed so that researchers have an easy way to access the totality of association study data in existence for their genes, genome regions, or diseases of interest. Such a depository will allow true positive signals to be more readily distinguished from false positives (type I error) that fail to consistently replicate, it will aid in the identification of technical artefacts in genotyping procedures, it will elucidate population specific signals, and it will minimize the serious problems of publication bias that cannot be solved unless sites like HGVbaseG2P are created where negative studies can be reported just as easily and as quickly as positive studies.

To produce the content of HGVbaseG2P, our curators actively gather large datasets, such as Whole Genome Association Study (GWA) findings, from many different public domain projects. We intend to grow this into a comprehensive effort in the very near future. All the data sources we have approached for help in this regard have been extremely helpful and forthcoming, and many automated data gathering pipelines from the larger projects have already been set up. For smaller datasets, such as gene or region specific investigations and replication efforts by small and medium sized laboratories, we encourage researchers to submit their study findings to us and provide a help text on

For the future, we are devising a standalone tool that submitters will be able to download and install locally, which will actively guide researchers through the process of gathering and checking their data before submitting it to us. The tool will organize their submission content into an XML formatted document that is stored on the submitter

As mentioned above, data in HGVbaseG2P is restricted to summary level or aggregated information (i.e., results on groups of individuals, but no individual level genotypes or phenotypes). As a consequence of this the project is not impacted by issues such as anonymising individuals, gaining informed consent, or data security. All records do, however, carry links and acknowledgements that lead back to the original data source, so that users who might wish to obtain the non-aggregated data can make suitable requests to the relevant data access authorities.

The complete content of HGVbaseG2P is available for download from our main ftp page. As of July 2008, this includes the full set of markers listed in dbSNP). In the near future we will also incorporate the data from UniSTS and DBGV), plus all dbSNP current allele and genotype frequency data. Upon each new build of these depositories we synchronise HGVbaseG2P records with any changed content in those new releases (e.g. marker or allele deletions or mergers) using custom software we have built for this purpose, and present everything on the same DNA strand as used by the marker source database. This updating procedure also ensures that correct relationships are retained between markers/alleles and their frequency and association constituents.

Genetic association study data, along with allele and genotype frequency findings, are layered on top of the extensive marker content of HGVbaseG2P. This layer of 'laboratory results' is complex in nature, and so to help users navigate their way through it we structured it in a way compatible with a journal article: A 'Study' with various Experiments, Sample Panels, and Phenotypes within it, wherein each Experiment contains data on Markers, Frequencies, and Associations. More information on these Study components and their relationships is provided here. The data modelling behind this Study and Experiment concept was with then taken forward by many Institutes worldwide to create what is now an Object Management Group approved standard called the 'Phenotype And Genotype Experiment Object model' (PaGE-OM). To find out more about the HGVbaseG2P data model, download the data model diagram and/or the MySQL relational schema definition).

GENERAL INFORMATION

Database History

HGVbaseG2P began life as a joint venture between the research team of Professor Anthony Brookes in the Karolinska Institute (Sweden) and staff at Interactiva GmbH (Germany). It was first released to the public in August 1998, and called the Human Genome Bi-Allelic SEquence (HGBASE) database, with a focus solely on providing a centralized collection of known human single nucleotide polymorphisms and other simple DNA variants. One year later, the program expanded and the database structure was completely overhauled via a consortium involving the Karolinska Institute, Sweden (Anthony Brookes), the European Bioinformatics Institute, UK (Heikki Lehvaslaiho), and the European Molecular Biology Laboratory, Heidelberg (Peer Bork). Funding at that stage was provided by each of the three participating institutes, plus corporate support from Pfizer and GlaxoSmithKline. In November 2001 the project adopted the new name Human Genome Variation database (HGVbase), as this better reflected the scope of the database, its emphasis on broad data collection from many different laboratories, and its additional new role as a central depository for data collection efforts in allegiance with the Human Genome Variation Society (HGVS).

In 2004, Professor Brookes moved from the Karolinska Institute in Sweden to the University of Leicester in the UK, and in light of impressive SNP discovery efforts by the TSC and HapMap projects, HGVbase was scaled back to simply provide an alternative representation of the full marker list from dbSNP. At the same time, Professor Brookes's team began to develop the Human Genome Variation Genotype-to-Phenotype database (HGVbaseG2P), representing the natural evolution of HGVbase into a central database for summary-level genetic association data. The work was funded by GlaxoSmithKline, the University of Leicester, and the European Community's Sixth Framework Programme ('INFOBIOMED' Network of Excellence) and Seventh Framework Programme ('GEN2PHEN' Integrated Project). Early work in the project involved devising a powerful way of modeling phenotype and genotype- phenotype data, which itself was adopted and adapted to become the global standard 'Phenotype And Genotype Experiment Object model' (PaGE-OM). HGVbaseG2P then went live and replaced HGVbase in the summer of 2008, thereby extending the projects marker content to a far broader and more comprehensive range of markers (i.e., SNPs, structural variants, and STSs), along with extensive disease association findings for many alleles and genotypes.

Looking forward, HGVbaseG2P represents a core component of the GEN2PHEN project, intending to provide an operational model, plus free software, to help others create many similar databases across the world. These will hosted by Institutes, Consortia, and even individual laboratories, to provide those groups with a way to post their genetic association findings on the web and have them integrated into the rapidly emerging network of similar valuable resources.

Intellectual Property Considerations

Individual records in HGVbaseG2P remain the intellectual property of the various original data sources, and any questions concerning acceptable use, copying, storage or distribution of individual data items should be directed to those sources. HGVbaseG2P claims ownership and copyright on the total compilation of records in this database, and provides open and free access to the full database content for searching, browsing, record copying, and further dissemination on the strict understanding that no records are altered in any way other than trivial format changes, and that HGVbaseG2P and the original data sources are acknowledged in any subsequent reporting or sharing of this information. The software developed in the HGVbaseG2P project is made fully and freely available under the standard terms of the GNU General Public License Version3.

Citations

To cite HGVbaseG2P in scientific communications, please state the full database name, acronym, and URL [Human Genome Variation Genotype-to-Phenotype database, (HGVbaseG2P) at www.hgvbaseg2p.org] along with any of these publication references, if appropriate, to show the project's history:

D.Fredman, M.Siegfried, Y.P.Yuan , P.Bork, H.Lehvaslaiho, A.J.Brookes.
HGVbase: A human sequence variation database emphasizing data quality and a broad spectrum of data sources.
Nucleic Acids Research, (2002) 30:387-91.

A.J.Brookes, H.Lehvaslaiho, M.Siegfried, J.G.Boehm, Y.P.Yuan , C.M.Sarkar , P.Bork, F.Ortigao.
HGBASE: a database of SNPs and other variations in and around human genes.
Nucleic Acids Research, (2000) 28:356-360.

Acknowledgements

First and foremost, we thank the many groups who have generated and provided the record content of HGVbaseG2P. We are most pleased and impressed by the extremely positive attitude shown by the many teams we have worked with on the data gathering challenge.

Funding for HGVbaseG2P is currently provided by the University of Leicester, GlaxoSmithKline, and the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 200754 (the GEN2PHEN project).
Previous funding in the databases history has been provided by Interactiva GmbH (Germany), the European Bioinformatics Institute (UK), the European Molecular Biology Laboratory (Heidelberg), Pfizer, GlaxoSmithKline, the Karolinska Institute (Sweden), and the European Community's Sixth Framework Programme ('INFOBIOMED' Network of Excellence).