FamAgg: an R package to evaluate familial aggregation of traits in large pedigrees

Summary: Familial aggregation analysis is the first fundamental step to perform when assessing the extent of genetic background of a disease. However, there is a lack of software to analyze the familial clustering of complex phenotypes in very large pedigrees. Such pedigrees can be utilized to calculate measures that express trait aggregation on both the family and individual level, providing valuable directions in choosing families for detailed follow-up studies. We developed FamAgg, an open source R package that contains both established and novel methods to investigate familial aggregation of traits in large pedigrees. We demonstrate its use and interpretation by analyzing a publicly available cancer dataset with more than 20 000 participants distributed across approximately 400 families. Availability and implementation: The FamAgg package is freely available at the Bioconductor repository, http://www.bioconductor.org/packages/FamAgg. Contact: Christian.Weichenberger@eurac.edu Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
This package provides basic pedigree analysis and plotting utilities as well as a variety of methods to evaluate familial clustering of cases from a given trait. Identification of families or groups of individuals within families with significant aggregation of cases can aid also in the selection of interesting and promising individuals for whole genome or exome sequencing projects.
For kinship coefficient calculations and pedigree plotting the package relies and extends the functionality of the kinship2 package [1].

Basic pedigree operations
In the examples below we perform some simple pedigree operations, such as plotting the pedigree for an individual or family, finding the closest common ancestor for a set of individuals in a pedigree or retrieving the identifiers (IDs) of all ancestors for an individual. Basic pedigree information is stored in FAData objects, thus we first generate such an object from a subset of the Minnesota Breast Cancer Study provided by the kinship2 package. In the example below, we generate the FAData providing a data.frame with the pedigree data, alternatively, the pedigree information could be imported from a file (see Section 3). Upon data set creation the kinship matrix (i.e. a matrix containing the kinship coefficient between each pair of individuals in the whole pedigree) is internally calculated using the functionality from the kinship2 package [1].
library(FamAgg) data(minnbreast) ## Subsetting to only few families of the whole data set. mbsub <-minnbreast[minnbreast$famid %in% 4:14, ] mbped <-mbsub[, c("famid", "id", "fatherid", "motherid", "sex")] ## Renaming column names. colnames(mbped) <-c("family", "id", "father", "mother", "sex") ## Defining the optional argument age. endage <-mbsub$endage names(endage) <-mbsub$id ## Create the object. fad <-FAData(pedigree=mbped, age=endage) We can access all the pedigree information stored in this object using the pedigree method, but also using $. The row names of the pedigree data.frame as well as the names of the vectors returned by $ are the IDs of the individuals in the pedigree. ## Use the pedigree method to access the full pedigree ## data.frame, head(pedigree(fad)) ## family id father mother sex ## 1 4  To extract the pedigree for a single family we can use the family method, specifying either the ID of the family or the ID of an individual in the family. ## Extract the pedigree information from family "4"... nrow(family(fad, family=4)) ## [1] 43 head(family(fad, family=4)) ## family id father mother sex ## 1 4  To explore this family we can plot its pedigree. By default, the plotting capabilities of the kinship2 package are used to plot pedigrees, but alternatively, if all required dependencies are available, the HaploPainter [2] perl script (http://haplopainter.sourceforge.net/) can be used instead. The switchPlotfun function can be used to switch the plotting back-end. Available arguments are ks2paint and haplopaint for kinship2 and HaploPainter plotting, respectively. Note however, that HaploPainter only allows to export plots to a file, while kinship2 plotting allows, in addition to export the plot, also to show it as a standard R plot.
Below we use the switchPlotfun to ensure the use of kinship2 plotting (usually not required) and plot the full available pedigree of individual 3. If the age of individuals is available, it will be plotted below the individual's ID.
switchPlotfun("ks2paint") ## By supplying device="plot", we specify that we wish to visualize the ## pedigree in an R plot. This is the default for "ks2paint", anyway. plotPed(fad, id=3, device="plot") A family pedigree may consist of many founder couples (i.e. individuals for which neither father nor mother is defined in the pedigree). To identify the pedigree's founder couple (being the couple with the largest number of offspring generations in the pedigree) the findFounders method can be used. Note that the function returns always only one couple, even if there might be two founder couples in the family pedigree with the same number of offspring generations. ## Find founders for family 4. findFounders(fad, "4") ## [1] "1" "2" Alternatively, it might be of interest to determine the closest common ancestor between individuals in a pedigree. Below we use the getCommonAncestor method to identify the common ancestor for individuals 21, 22 and 17 (which we know from the pedigree a bit above are 1 and 2). ## Find the closest common ancestor. getCommonAncestor(fad, id=c(21, 22, 17)) ## [1] "1" "2" Other useful methods are getChildren, getAncestors and getSiblings, that return the children (or all offspring generations up to a specified level), the parents (or all ancestors) or the siblings for the specified individuals, respectively. ## Get the children of ID 4. getChildren(fad, id="4", max.generations=1) ## Get the offsprings. getChildren(fad, id="4") ## [1] "3" "21" "22" "23" ## Get all ancestors. getAncestors(fad, id="4") ## Get the siblings. getSiblings(fad, id=c("4")) ## [1] "4" "5" "6" "7" "8" "9" "10" Unexpectedly, only in few families one of the founders is affected. For the other families additional (unaffected) ancestors might have been added at a later time point.
Next we get the number of affected individuals that are related to these affected founders.

Pedigree analysis methods
In this section we perform some more advanced pedigree operations. First, we identify all individuals in the pedigree that share kinship with individual 4. ## Get all individuals sharing kinship with individual 4. shareKinship(fad, id="4") ## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" ## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" Next, we determine generations within the pedigree. Generations can only be estimated for a single family, since in most instances e.g. the year of birth is not available. Thus, generations are estimated considering the relation between individuals, starting from the founder couple, i.e. generation 0, assigning generation 1 to their children and all the mates of their children and so on. The estimateGenerations method calculates such generation numbers for each family defined in the object (or for a single family, if the family ID is provided). The result is returned as a list with the list names corresponding to the family ID and the list elements being the estimated generation numbers (with names corresponding to the ID of the respective individual).  1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 2  Individuals without generation level (i.e. with an NA) are not connected to any other individual in the pedigree (and thus most likely represent errors in the pedigree).
In addition, it is also possible to calculate generation levels relative to a (single) specified individual: gens <-generationsFrom(fad, id="4") We can render these generation numbers into the pedigree: plotPed(fad, family=4, label2=gens)

Additional plotting options
If a trait information is available it might be of interest to highlight affected individuals in the pedigree. Trait information should always be coded as 0 (or FALSE) for unaffected and 1 (or TRUE) for affected. In the example below, we use the cancer information from the Minnesota Breast Cancer Study.

Graph utilities
Pedigrees can also be transformed to graphs using the ped2graph function. That way all graph theory methods implemented in e.g. the igraph package can be applied to pedigrees.
## Transform the full pedigree to a graph. fullGraph <-ped2graph(pedigree(fad)) ## In addition, build the graph for a single family. singleFam <-ped2graph(family(fad, family=4)) We can plot these pedigrees also as graph and could use any of the layout methods provided in the igraph package. ## Build the layout. plot(fullGraph) The connectedSubgraph function implemented in the FamAgg package provides additional functionality to find the smallest connected subgraph of a list of submitted nodes (i.e. individuals).
In the code below we want to extract the smallest possible connected subgraph of the pedigree-graph of family 4 containing individuals 7, 8, 27 and 17.

Testing for familial aggregation
Familial aggregation aims to identify families within large ancestral pedigrees that show a non-random aggregation of traits.
As an example, we analyze here data from the Minnesota Breast Cancer Record, which is provided by the kinship2 package. In brief, this data set consists of genealogical information from 426 unrelated founders diagnosed with breast cancer whose families entered a longitudinal study on cancer in the state of Minnesota (USA) in 1944. Cancer cases are encoded with a 1 in column cancer in the minnbreast data.frame. Note however that, besides breast cancer, also prostate cancer cases are reported. This unfortunately causes a systematic bias in the data set as families were only included if a founder was diagnosed with breast cancer, but all occurrences of both breast and prostate cancer are reported. Based on this bias many of the results below should be taken with caution. Another important information is provided in column endage, which represents either the age of cancer onset, the age at the end of the study or the age at death of the participant.
Note that, to reduce computation time, we perform the analysis only on a subset of families from the Minnesota Breast Cancer record and reduce the number of simulation runs. We specifically selected some families with a high percentage of cancer cases, thus, the analysis presented here is biased. Also, in a real analysis you should increase the nsim argument.
In Section 4.5 we apply the probability test based on the method from Yu et al [6] for inference on family disease clusters. We use the corresponding implementation in the gap package. However, currently gap cannot be directly applied to large pedigrees due to a specific limitation in the implementation.
The genealogical index of familiality, the familial incidence rate and the probability test are well established methods while the kinship sum test and the kinship group test are novel approaches presented here for the first time.

Genealogical index of familiality
We next calculate the genealogical index of familiality (GIF) [4] (referred to as the genealogical index in the original work) for cancer occurrence in a subset of the Minnesota Breast Cancer Record data set. For a given trait (e.g. whether or not an individual was diagnosed with a certain type of cancer), the method computes the mean kinship between affected individuals (cases) in the whole pedigree along with mean kinship values of randomly drawn sets of individuals. The distribution of average kinship values among the control sets is then used to estimate the probability that the observed level of kinship among the cases is due to chance.
Below, we perform the analysis using the genealogicalIndexTest method on the cancer trait. In its default setting, the genealogicalIndexTest function uses all phenotyped individuals in the pedigree as control population from which sets of random samples equal in size to the number of affected are drawn. The column genealogical index of the result data.frame shown above represents the mean kinship between all pairs of affected individuals in the pedigree multiplied by 100000 for easier interpretation. Thus, according to the GIF test, a clustering of cancer cases is present in the analyzed pedigree. The output messages from the method call indicate that some individuals have been excluded from the test since they were either not phenotyped in the trait (i.e. have a missing value in trait), or are not connected in the family pedigree (do not share kinship with any other individual in the pedigree after removing non-phenotyped individuals).
The genealogical index of familiality implementation in this package adds some more flexibility to the original approach. The definition of the appropriate set of control individuals from which random samples are drawn can be specified with the controlSetMethod argument. Also, it is possible to perform a stratified sampling, e.g. if the group of affected cases in a pedigree consists of 5 female and 3 male individuals, submitting the sex of each individual in the pedigree with the argument strata (i.e. strata=fad$sex, with fad being the FAData object on which the analysis is performed) allows the function to define random control sets with the same proportion of male/female individuals.
In the next example, we use the getSexMatched function to define the set of control individuals and also the getExternalMatched submitting the gender information of each individual. The results from both approaches are essentially identical, and in the present data set not that useful, as the Minnesota Breast Cancer data set lists both, breast cancer and prostate cancer in column cancer, thus, the set of control individuals will contain all individuals with known sex. ## Calculate the genealogical index of familiality using random sampling from ## a sex matched control set. giSexMatch <-genealogicalIndexTest(fad, trait=tcancer, traitName="cancer", nsim=nsim, controlSetMethod="getSexMatched") ## Use an external vector to perform the matching. ## The results are essentially identical. giExtMatch <-genealogicalIndexTest(fad, trait=tcancer, traitName="cancer", nsim=nsim, controlSetMethod="getExternalMatched", match.using=fad$sex) Note that any matching or stratified sampling can lead to the exclusion of individuals with missing values in either the matching criteria or the strata.
In the Minnesota Breast Cancer data set, the number of prostate cancer cases is much lower than the number of breast cancer cases, thus, simple random sampling might result in an biased genealogical index of familiality estimate since about the same proportion of male and female individuals will be sampled. To account for such cases a stratified sampling, as performed below, can be used instead. ## Evaluate the proportion of male and femal cases. The genealogical index of familiality can also be estimated by the gif function from the gap R-package. Below we calculate the estimate using both methods and compare the resulting estimate. Note that the gif method reports only the genealogical index of familiality estimate but does not estimate significance.

Familial incidence rate (FIR)
A per-individual risk of e.g. disease can be calculated using the familial incidence rate (FIR, abbreviated as FR in the original work) [5]. This measure considers the kinship of each individual with any affected in a given trait in the pedigree and the time at risk for each individual. Thus, the FIR is an estimate for the risk per gene-time for each individual given the disease-experience in the cohort.

Per family averaged familial incidence rate
In the next example, we calculate the familial incidence rate assessing in addition the significance of each estimate using Monte Carlo simulations. This extension to the original approach from Kerber [5] does also allow stratified sampling. ## Estimate the risk for each individual using the familial incidence rate method. ## We use the endage provided in the Minnesota Breast Cancer Record as ## a measure for time at risk. frTest <-familialIncidenceRateTest(fad, trait=tcancer, traitName="cancer", timeAtRisk=mbsub$endage, nsim=nsim) The familial incidence rate can be extracted easily from the result object using the familialIncidenceRate method or using $fir. Also, the empirical p-value from the simulation analysis and the time at risk can be accessed using the $ operator (i.e. using $pvalue, $tar or $timeAtRisk, respectively). We can also identify the families containing individuals with a significant FIR.
0.000 0.010 0.020 0.030 FIR As expected, the familial incidence rate (i.e., in the present data set, the risk of individuals to get cancer, given their kinship to other cancer cases) for individuals (whether affected or yet unaffected) in this family is higher than in the data set analyzed here.
Next, we plot the pedigree of this family.

Kinship group test
Here we apply the kinship group test to the data set. This test first defines for each affected individual a group of individuals considering only individuals that are as closely related as the most distant affected individual. For each of these kinship groups two tests are then performed, one by comparing the mean kinship among affected in the group with the mean kinship from Monte Carlo simulations (ratio test) and one evaluating the largest observed kinship value between