scoreassoc

scoreassoc is a program which performs a weighted burden analysis to test if there is an excess of rare, functional variants in cases compared with controls, as described in this paper:

A rapid method for combined analysis of common and rare variants at the level of a region, gene, or pathway. D Curtis. Advances and applications in bioinformatics and chemistry: AABC 5, 1. 2012.

It also implements analyses aimed to detect recessive effects by testing for an excess of cases possessing two or more rare functional variants as described in these papers:

Approaches to the detection of recessive effects using next generation sequencing data from outbred populations. D Curtis. Advances and applications in bioinformatics and chemistry: AABC 6, 29. 2013.

Investigation of recessive effects in schizophrenia using next-generation exome sequence data. Curtis D. Ann Hum Genet 2015.

Git

github.com/davenomiddlenamecurtis/scoreassoc – latest version of code

Downloads

scoreassoc.src.40.zip – C/C++ source

scoreassoc.40.zip – C/C++ source plus Windows executables and MSVC solution files

Unzip the file, which should produce a folder called src then chdir to this folder. Then run:

make –f scoreassoc.mak

This command should produce executables called scoreassoc, pathwayAssoc and permPathwayAssoc, which should then be copied to the directory assigned to DCBIN in scoreassoc.mak, by default called bin.

The second zip file contains the precompiled executables for Windows (to be run from the Command Prompt) and also contains .sln and .vproj files to allow compilation with Microsoft Visual Studio for Desktop.

The most convenient way to run scoreassoc if you have sequence data in vcf files is by using a utility called geneVarAssoc, and you are strongly advised to refer to this. It is distributed here: https://github.com/davenomiddlenamecurtis/geneVarAssoc

A script called runpa.sh is also provided to allow you to analyse genes from data stored in a PLINK/SEQ project. You should edit script by changing the project and phenotype to match your own and make sure that both pseq and scoreassoc are on the PATH. Also, make sure that the files funcweights.txt and exclusions.txt are in the current directory. (Examples of these are distributed in the exampleFiles folder.) Then you can analyse any gene by entering, for example:

bash runsa.sh DRD2

(Note that for PLINK/SEQ the gene name is case sensitive.)

The script extracts information for the relevant gene from the PLINK/SEQ project, sets up the necessary data files and then runs scoreassoc. You can instead run scoreassoc with the example files provided by entering:

scoreassoc --psdatafile DRD2.dat --annotfile DRD2.annot --weightfile funcWeights.txt --filterfile DRD2exclusions.txt --outfile DRD2.psao --minweight 80 --ldthreshold 0.7 –dorecessive 1

Method

Variants are weighted according to their rarity and predicted effect. A weighting factor can be specified to provide an arbitrary weight for a very rare variant compared with a common one. The default value for this weighting factor is 10. The weight according to effect means that, for example, an intergenic variant can be given a weight of 1 whereas a stop (nonsense) variant can be given a weight of 20. The weight according to frequency and the weight according to effect are multiplied together to produce an overall weight for each variant. Within a subject, the weights for all variants are added to produce an overall score. The average scores for cases and controls are compared using a t test.

To carry out tests for recessive effects, the programs test whether there is an excess of cases having two or more variants. In order to only consider only variants which are rare and/or have an effect on function a threshold can be specified which the overall weight for the variant must exceed. In order not to include pairs of variants in linkage disequilibrium (LD) with each other, if one variant usually occurs in subjects who also possess a specific different variant then it is discarded. The maximum proportion of times one variant is seen with the other before being discarded is set as an LD threshold. The default value for this LD threshold is 0.9. If the program is provided with genotypes phased into haplotypes then instead it tests whether there is an excess of cases having two haplotypes each bearing one or more variant exceeding the weight threshold.

 

Running scoreassoc

 

Most of the arguments to the program are given in pairs. The first member of the pair, beginning --, specifies a parameter and the second member provides the value for that parameter. One pair of arguments provides a datafile and other optional arguments specify other aspects of the analysis. For PLINK/SEQ datafiles the parameter is --psdatafile:

 

scoreassoc --psdatafile genotypes.dat [options]

 

Parameter list:

 

--psdatafile file (PLINK/SEQ datafile)

--gendatafile file (IMPUTE datafile with genotype probabilities)

--gcdatafile file (GENECOUNTING datafile)

--samplefile file (sample file to match IMPUTE2 datafile)

--weightfactor x (weight for very rare variants, default 10)

--outfile file

--weightfile file (specify functional weights)

--annotfile file (annotations from plink/seq)

--filterfile file

--locusweightfile file (specify functional weight for each locus)

--locusnamefile file (provide name for each locus)

--locusfilterfile file (exclude specific loci)

--casefreqfile file (provide allele frequency for each locus in cases)

--contfreqfile file (provide allele frequency for each locus in controls)

--triofile file (subjects consist of trios)

--ldthreshold x (to discard variants in LD for recessive analysis, default 0.9)

--minweight x (to include in recessive analysis)

--dorecessive

--usehaps

--numloci x (needed with --gcdatafile)

--argfile file (additional arguments)

 

Example usage:

 

scoreassoc --psdatafile DRD2.dat --annotfile DRD2.annot --weightfile funcWeights.txt --filterfile DRD2exclusions.txt --outfile DRD2.psao --minweight 80 --ldthreshold 0.7 --dorecessive

 

Typical script to produce data and analyse one gene (use as e.g. bash runsa.sh DRD2):

 

#!/bin/bash

# runsa.sh - edit to set PLINK/SEQ project and phenotype then run with gene as argument, e.g. bash runpsa.sh DRD2

PROJECT=SSS

PHENO=scz

GENE=$1

pseq $PROJECT v-view --geno  --phenotype $PHENO --gene $GENE > $GENE.dat

pseq $PROJECT counts --annotate refseq --gene $GENE --phenotype $PHENO > $GENE.annot

scoreassoc –-psdatafile $GENE.dat --annotfile $GENE.annot --weightfile funcWeights.txt --filterfile exclusions.txt --outfile $GENE.psao --minweight 80 --ldthreshold 0.7 --dorecessive 1

 

PLINK/SEQ datafile format

 

The --psdatafile file consists of a list of variants and individual subject genotypes as produced by the pseq command v-view, e.g.:

 

pseq SSS v-view --geno  --phenotype scz --gene DRD2 > DRD2.dat

 

The first few lines of DRD2.dat appear as follows:

 

chr11:113281476  rs77930100   C/T .   1   PASS

28279    1   CASE C/C

28285    1   CASE C/C

28292    1   CASE C/C

28294    1   CASE C/C

28296    1   CASE C/C

 

IMPUTE2 datafile format

 

The --gendatafile file format is produced by the IMPUTE2 software and consists of a gen file containing sets of genotype probabilities. See https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#output_options.

 

If this format is used then the number of loci in the file must be specified using the --numloci parameter and there needs to be a corresponding –samplefile to provide case-control phenotypes.

 

GENECOUNTING datafile format

 

The --gcdatafile format may be easier to produce from some programs. The data consists of one row per subject with entries separated by spaces. The first entry is an arbitrary ID. This is followed by an affection status code consisting of 0 for controls and 1 for cases. Subsequent entries consist of pairs of alleles for each marker, numbered 1 or 2, with 0 0 denoting a missing genotype. The reference allele is 1 and any other allele must be coded as 2.

 

Example format for start of first few lines:

 

28279      1    1 2  1 1   1 1  1 1  1 1   1 1  . . .

28285      1    2 2  1 1   1 1  1 2  1 1   1 1  . . . 

28292      0    1 2  1 1   1 1  1 1  1 1   1 1  . . . 

28296      0    2 2  1 1   1 1  1 1  1 2   1 1  . . . 

28300      1    2 2  1 1   1 1  1 1  1 1   1 1  . . . 

28301      1    2 2  1 1   1 1  1 1  1 1   2 2  . . . 

 

If this format is used then the number of loci in the file must be specified using the --numloci parameter.

 

--weightfactor x

 

This option allows one to specify a value for the extra weight given to very rare variants (MAF approaches 0). The default value is 10. A parabolic function is used to gradually reduce this weight towards 1 as the MAF increases towards 0.5. A high value will give relatively more weight to rarer variants while a value of 1 will mean that no weighting is applied for frequency.

--outfile file

This option sends the output to file instead of to stdout.

--scorefile file

This option produces an additional output file containing the scores for each subject.

--weightfile file

 

This reads the weights for the function of each variant from the specified file. There must be a value for each possible function produced by the --annotate option of pseq. The file funcWeights.txt has the following values:

 

intergenic 1

intronic   3

silent     3

splice     5

readthrough 5

esplice 10

missense   10

frameshift 20

nonsense   20

 

These values can be changed or a different file can be used.

--annotfile file

 

This specifies the file containing the annotations produced by PLINK/SEQ. The first few lines appear as follows:

 

VAR  REF/ALT    MINOR CNTA  CNTU TOTA TOTU  FUNC     GENE

chr11:113281476 C/T  T     1    2    5090  5090     silent     NM_000795,NM_016574

chr11:113281503 G/A  A     1    3    5090  5090     silent     NM_000795,NM_016574

chr11:113281521 G/A  A     1    3    5090  5090     silent     NM_000795,NM_016574

chr11:113281524 G/A  A     1    0    5090  5090     silent     NM_000795,NM_016574

chr11:113281545 C/T  T     2    5    5090  5090     silent     NM_000795,NM_016574

chr11:113281554 G/A  A     1    0    5090  5090     silent     NM_000795,NM_016574

chr11:113281616 A/C  C     0    1    5090  5090     missense   NM_000795,NM_016574

chr11:113281641 G/A  A     0    1    5090  5090     splice     NM_000795,NM_000795,NM_016574,NM_016574

 

Both the annotations file and the weights file need to be specified in order for weights to be allocated according to effect rather than just frequency. There must be an entry for every locus in the datafile.

 

--dorecessive

 

This option specifies that an analysis for recessive effects should be performed to see if there is an excess of cases having two or more rare variants.

 

--ldthreshold x

 

This option allows one to specify the threshold to discard a variant from the recessive analysis if it is seen in the same subject as a different variant on more than a certain proportion of occasions. The default value for the LD threshold is 0.9.

 

--minweight x

 

This is the minimum overall weight a variant must have, obtained by multiplying the frequency weight by the functional weight, in order to be included in analyses for recessive effects. The default value is 0, meaning that all variants would be included. To include relatively rare variants producing changes in amino acid sequence one can use a value of 80.

 

--usehaps

 

If a recessive analysis is performed then this option indicates that the genotypes have been phased, with the first allele of each variant in the first haplotype and the second allele in the second haplotype. The program will then look for an excess of cases in which both haplotypes have at least one variant with weight exceeding minweight. If trios are used then this option can be used to specify that the transmitted alleles will be treated as phased haplotypes.

 

--locusweightfile file

 

This allows one to specify the functional weights to be used for each locus. It consists of a number of weights separated by white space (space, tab, new line) equal to the number of loci. If this option is used, these weights will override those derived from an annotation file.

 

--locusnamefile file

 

This allows one to specify the name to be given to each locus in the output file. It consists of a number of names separated by white space (space, tab, new line) equal to the number of loci.

 

--locusfilterfile file

 

This allows one to specify which loci are to be included in the analysis. It consists of 0s and 1s separated by white space (space, tab, new line) equal to the number of loci. Only the loci corresponding to the 1s will be used.

 

--contfreqfile file

 

If there is no genotype data for controls then this file can be used to provide allele frequencies obtained from an external source, such as 1000 genomes. It consists of a number of allele frequencies separated by white space (space, tab, new line) equal to the number of loci.

--casefreqfile file

 

If there is no genotype data for cases but one does have summary allele frequencies then this file can be used to provide them. It consists of a number of allele frequencies separated by white space (space, tab, new line) equal to the number of loci.

 

--filterfile file

 

This allows one to specify a file which contains a list of expressions, one for each line, which are evaluated for each variant and which eliminate the variant from all analyses if true. One can build up complex logical and arithmetic expressions and precedence rules mean that in the examples below the parentheses are optional. The symbols &, | and = mean AND, OR and IS EQUAL TO. For example, to keep only variants with MAF<=0.1 in cases or controls, one can have the following line:

 

freq(0)>0.1 & freq(1)>0.1

 

Thus, the filter removes variant whose MAF is greater than 0.1 in controls and in cases.

 

The following filter will remove variants which are not typed in more than 100 out of 2545 cases or controls:

 

nsub(0)<2545-100 | nsub(1)<2545-100

The full list of filter expressions is as follows:

freq(x)

frequency of the rarer allele in controls (x=0) or cases (x=1)

nsub(x)

number of controls (x=0) or cases (x=1) typed for this variant

weight(0)

The effect weight for this variant. Note this is different from the overall weight which can be used as a threshold to include variants in the recessive analyses. This effect value depends only on function and the exclusion is applied to the weighted burden test as well as to the recessive analyses. For example, to include only missense, splice site and more severe variants one would use:

weight(0) < 10

 

x genocount y

Number of controls (x=0) or cases (x=1) who have genotype AA (y=0), AB (y=1) or BB (y=2). This expression excludes variants where both homozygous counts are greater than the heterozygote count in both cases and controls:

(0 genocount 0) > (0 genocount 1) & (0 genocount 2) > (0 genocount 1) & (1 genocount 0) > (1 genocount 1) & (1 genocount 2) > (1 genocount 1)

 

altcount x

The number of B alleles (0, 1 or 2) possessed by subject x. Subjects are numbered from 1 in the order they are listed for the first variant in the datafile. To restrict attention to variants not seen in controls and shared between three subjects from the same family (here 143, 572, 93) one could use these two lines:

(0 genocount 1) > 0 | (0 genocount 2) > 0

(altcount 143)=0 | (altcount 572)=0 | (altcount 93)=0

 

--triofile file

 

This option means that the subjects consist of trios of affected subjects and their parents. The trio files consists of a list of IDs, three per line and separated by spaces, for proband, father, mother. If a subject does not appear in the file or has ID of 0 for the parents then this subject will be ignored. The CASE/CONTROL status in the main data file will be ignored. Instead, a dummy control subject is made with genotype consisting of the untransmitted alleles of both parents. If any of the three subjects has an unknown genotype the whole trio is ignored.

 

--argfile file

 

This file can contain arguments in the same format as the command line arguments. If desired, one --argfile can include another one so that they can be nested.

Output from scoreassoc

 

Here is the example output from the weighted burden test:

 

scoreassoc output

Locus                   controls     frequency        cases          frequency   frequency allele  weight

                     AA  :   AB  :  BB                  AA  :   AB  :  BB                    

chr11:113281476_sil   2543 :      2 :      0  0.000393    2544 :      1 :      0  0.000196  0.000295  2      49.95  chr11:113281476_silent_C/T

chr11:113281503_sil   2542 :      3 :      0  0.000589    2544 :      1 :      0  0.000196  0.000393  2      49.93  chr11:113281503_silent_G/A

chr11:113281521_sil   2542 :      3 :      0  0.000589    2544 :      1 :      0  0.000196  0.000393  2      49.93  chr11:113281521_silent_G/A

chr11:113281524_sil   2545 :      0 :      0  0.000000    2544 :      1 :      0  0.000196  0.000098  2      49.98  chr11:113281524_silent_G/A

chr11:113281545_sil   2540 :      5 :      0  0.000982    2543 :      2 :      0  0.000393  0.000688  2      49.88  chr11:113281545_silent_C/T

chr11:113281554_sil   2545 :      0 :      0  0.000000    2544 :      1 :      0  0.000196  0.000098  2      49.98  chr11:113281554_silent_G/A

chr11:113281616_mis   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113281616_missense_A/C

chr11:113281641_spl   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      49.98  chr11:113281641_splice_G/A

chr11:113283337_mis   2544 :      1 :      0  0.000196    2544 :      1 :      0  0.000196  0.000196  2      99.93  chr11:113283337_missense_C/T

chr11:113283362_mis   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113283362_missense_G/A

chr11:113283429_sil   2545 :      0 :      0  0.000000    2544 :      1 :      0  0.000196  0.000098  2      49.98  chr11:113283429_silent_C/T

chr11:113283437_mis   2519 :     26 :      0  0.005108    2516 :     29 :      0  0.005697  0.005403  2      98.07  chr11:113283437_missense_T/C

chr11:113283450_sil   2541 :      4 :      0  0.000786    2541 :      4 :      0  0.000786  0.000786  2      49.86  chr11:113283450_silent_G/A

chr11:113283459_sil    579 :   1255 :    711  0.525933     568 :   1272 :    705  0.526916  0.473576  1       5.13  chr11:113283459_silent_G/A

chr11:113283476_mis   2544 :      1 :      0  0.000196    2544 :      1 :      0  0.000196  0.000196  2      99.93  chr11:113283476_missense_C/T

chr11:113283477_sil    238 :   1106 :   1201  0.689195     220 :   1104 :   1221  0.696660  0.307073  1      11.70  chr11:113283477_silent_A/G

chr11:113283484_mis   2457 :     87 :      1  0.017485    2457 :     87 :      1  0.017485  0.017485  2      93.82  chr11:113283484_missense_G/C

chr11:113283488_mis   2544 :      1 :      0  0.000196    2538 :      7 :      0  0.001375  0.000786  2      99.72  chr11:113283488_missense_G/A

chr11:113283638_int   2529 :      1 :      0  0.000198    2528 :      5 :      0  0.000987  0.000593  2      29.94  chr11:113283638_intronic_C/T

chr11:113283639_int   2516 :     14 :      0  0.002767    2526 :      8 :      0  0.001579  0.002172  2      29.77  chr11:113283639_intronic_G/A

chr11:113284066_int   2514 :      1 :      0  0.000199    2507 :      2 :      0  0.000399  0.000299  2      29.97  chr11:113284066_intronic_C/A

chr11:113284075_int   2513 :      1 :      0  0.000199    2512 :      1 :      1  0.000597  0.000398  2      29.96  chr11:113284075_intronic_G/A

chr11:113285066_int   2537 :      8 :      0  0.001572    2539 :      6 :      0  0.001179  0.001375  2      29.85  chr11:113285066_intronic_C/A

chr11:113285077_int   2543 :      2 :      0  0.000393    2544 :      1 :      0  0.000196  0.000295  2      29.97  chr11:113285077_intronic_T/C

chr11:113285166_sil   2542 :      3 :      0  0.000589    2541 :      4 :      0  0.000786  0.000688  2      49.88  chr11:113285166_silent_G/A,C

chr11:113286125_int   2499 :     39 :      7  0.010413    2503 :     32 :     10  0.010216  0.010314  2       9.63  chr11:113286125_intergenic_TG/TGG,T

chr11:113286189_mis   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113286189_missense_T/C

chr11:113286202_sil   2545 :      0 :      0  0.000000    2544 :      1 :      0  0.000196  0.000098  2      49.98  chr11:113286202_silent_G/T

chr11:113286268_mis   2545 :      0 :      0  0.000000    2544 :      1 :      0  0.000196  0.000098  2      99.96  chr11:113286268_missense_C/T

chr11:113286346_int   2543 :      1 :      0  0.000197    2544 :      1 :      0  0.000196  0.000197  2      29.98  chr11:113286346_intronic_G/C

chr11:113286370_int   2519 :     19 :      0  0.003743    2518 :     20 :      0  0.003940  0.003842  2      29.59  chr11:113286370_intronic_C/T

chr11:113287594_mis   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113287594_missense_T/C

chr11:113287657_mis   2544 :      1 :      0  0.000196    2543 :      2 :      0  0.000393  0.000295  2      99.89  chr11:113287657_missense_C/T

chr11:113287668_mis   2543 :      2 :      0  0.000393    2545 :      0 :      0  0.000000  0.000196  2      99.93  chr11:113287668_missense_C/T

chr11:113287691_sil   2539 :      6 :      0  0.001179    2539 :      6 :      0  0.001179  0.001179  2      49.79  chr11:113287691_silent_G/A

chr11:113287694_sil   2367 :    175 :      3  0.035560    2370 :    174 :      1  0.034578  0.035069  2      43.91  chr11:113287694_silent_C/T

chr11:113288730_int   2543 :      2 :      0  0.000393    2545 :      0 :      0  0.000000  0.000196  2      29.98  chr11:113288730_intronic_C/A

chr11:113288904_int   2541 :      4 :      0  0.000786    2543 :      2 :      0  0.000393  0.000589  2      29.94  chr11:113288904_intronic_T/C

chr11:113295073_int   2519 :     26 :      0  0.005108    2523 :     22 :      0  0.004322  0.004715  2       9.83  chr11:113295073_intergenic_CA/C

chr11:113295137_sil   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      49.98  chr11:113295137_silent_G/A

chr11:113295145_mis   2544 :      1 :      0  0.000196    2545 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113295145_missense_C/T

chr11:113295218_sil   2545 :      0 :      0  0.000000    2543 :      2 :      0  0.000393  0.000196  2      49.96  chr11:113295218_silent_G/A

chr11:113295221_sil   2539 :      6 :      0  0.001179    2537 :      8 :      0  0.001572  0.001375  2      49.75  chr11:113295221_silent_G/T

chr11:113295230_sil   2542 :      3 :      0  0.000589    2544 :      1 :      0  0.000196  0.000393  2      49.93  chr11:113295230_silent_G/A

chr11:113295304_mis   2543 :      0 :      0  0.000000    2543 :      1 :      0  0.000197  0.000098  2      99.96  chr11:113295304_missense_C/T

chr11:113295316_mis   2542 :      1 :      0  0.000197    2544 :      0 :      0  0.000000  0.000098  2      99.96  chr11:113295316_missense_G/A

chr11:113295346_mis   2542 :      1 :      0  0.000197    2543 :      1 :      0  0.000197  0.000197  2      99.93  chr11:113295346_missense_C/G

chr11:113295363_mis   2543 :      2 :      0  0.000393    2543 :      2 :      0  0.000393  0.000393  2      99.86  chr11:113295363_missense_A/C

             Controls  Cases    

N                 2545      2545

Mean score      21.775    21.521

SD              27.220    27.551

t (5088 df) = -0.330

p = 0.74114677

=    -0.13 (signed log10(p), positive if cases score higher than controls)

The program uses the variant names which were provided. For each variant this shows the genotype counts then frequency of the second allele in controls then in cases. Then the overall frequency of the second allele is given. Then the allele number of the rarer allele is given. In this example, it is always allele 2. However if for one variant allele 1 was rarer then the program would assign the score to allele 1 because it tests whether the rarer alleles are more frequent in cases. Finally, the weight assigned to the rarer allele is output. The overall score for each subject is the sum of the weights of each rare allele which that subject possesses.

To see how the weights work based on frequency, one can look at the variants at 113283450 and 113283459. They are both silent (synonymous) but the second one is commoner and so gets given a lower weight. The variant at 113283437 is missense (non-synonymous) and is assigned a higher weight for its effect. It is also rare. Thus it ends up with a higher overall weight.

The program reports the overall mean score among controls and cases and applies a t test to test whether the scores tend to be higher in cases than controls. In this example, the mean score in controls is in fact very slightly higher, going against the hypothesis that rare, functional variants occur more frequently in cases and producing a large, non-significant p value.

The overall result is conveniently expressed as a signed log p value (SLP). This is the log10 of the p value but it is given a positive sign if the mean score is higher in cases and controls and a negative value if the mean score is higher in controls.

If the option to perform a recessive test is specified then the output for that will follow:

Recessive analyses based on loci with weight reaching 80.000000

Using the following loci:

  1 chr11:113281616_mis

  2 chr11:113283337_mis

  3 chr11:113283362_mis

  4 chr11:113283437_mis

  5 chr11:113283476_mis

  6 chr11:113283484_mis

  7 chr11:113283488_mis

  8 chr11:113286189_mis

  9 chr11:113286268_mis

 10 chr11:113287594_mis

 11 chr11:113287657_mis

 12 chr11:113287668_mis

 13 chr11:113295145_mis

 14 chr11:113295304_mis

 15 chr11:113295316_mis

 16 chr11:113295346_mis

 17 chr11:113295363_mis

 

Subjects with two or more minor alleles:

313289     1   5-6-14-14-16-16

338891     1   6-6-14-14-16-16

367129     0   6-6-14-14-16-16

 

No loci need to be removed for being in LD

Will retain the following loci:

  1 chr11:113281616_mis

  2 chr11:113283337_mis

  3 chr11:113283362_mis

  4 chr11:113283437_mis

  5 chr11:113283476_mis

  6 chr11:113283484_mis

  7 chr11:113283488_mis

  8 chr11:113286189_mis

  9 chr11:113286268_mis

 10 chr11:113287594_mis

 11 chr11:113287657_mis

 12 chr11:113287668_mis

 13 chr11:113295145_mis

 14 chr11:113295304_mis

 15 chr11:113295316_mis

 16 chr11:113295346_mis

 17 chr11:113295363_mis

 

Subjects with two or more minor alleles:

313289     1   5-6-14-14-16-16

338891     1   6-6-14-14-16-16

367129     0   6-6-14-14-16-16

 

Genotype counts for cases and controls

 

(AB have one minor allele, BB have two or more minor alleles)

            AA      AB      BB  

Controls 2416.00  128.00    1.00

   Cases 2413.00  130.00    2.00

 

Observed:

2544.00    1.00

2543.00    2.00

 

Expected:

2543.50    1.50

2543.50    1.50

 

 

Recessive chi-squared = 0.333530, 1 df, p = 0.281794

-log(p) = 0.55

 

Expected probabilities for subjects to have total of 0, 1 or more minor alleles:

0.949133 0.049877 0.000990

 

Observed: 2543.00    2.00

Expected: 2542.48    2.52

 

Recessive HWE for cases exact test, p = 0.246588

-log(p) = 0.61

 

Observed: 2544.00    1.00

Expected: 2542.48    2.52

 

Recessive HWE for controls exact test, p = 0.246588

-log(p) = 0.61

 

Counts for cases and controls homozygous for at least one variant

          NotHOM    HOM  

Controls 2544.00    1.00

   Cases 2544.00    1.00

 

Expected:

Controls 2544.00    1.00

   Cases 2544.00    1.00

 

 

Recessive chi-squared for homozygotes = 0.000000, 1 df, p = 0.500000

-log(p) = 0.30

 

Observed: 2544.00    1.00

Expected: 2544.15    0.86

 

Recessive HWE for true homozygote cases exact test, p = 0.211126

-log(p) = 0.68

 

Observed: 2544.00    1.00

Expected: 2544.15    0.86

 

Recessive HWE for true homozygote controls exact test, p = 0.211126

-log(p) = 0.68

 

NB Any loci meeting the following exclusion criteria were not included in the above analysis:

freq(0)=0 & freq(1)=0

nsub(0)<2545-100 | nsub(1)<2545-100

(0 genocount 0) > (0 genocount 1) & (0 genocount 2) > (0 genocount 1) & (1 genocount 0) > (1 genocount 1) & (1 genocount 2) > (1 genocount 1)

 

The output for recessive analysis first consists of a numbered list of all the variants which are included, i.e. which meet the weight criterion. This is followed by a list of all subjects which carry the minor allele of two or more variants, with the variants being listed for each subject according to the variant numbers. The ID of each subject is given, followed by 0 for a control or 1 for a case, followed by a list of the variants for that subject.

 

Next there is a report of which variants are excluded because they appear to be in LD with others, in this example those which occur on 90% or more occasions with another. None meet this criterion so none is excluded.

 

Then again there is a list of all variants being considered followed by a list of all subjects having two or more minor alleles.

 

The genotype counts are given for cases and controls. AA means the subject has no minor allele, AB means 1 and BB means 2 or more.

 

The first chi-squared test is whether the proportion of cases with two or more minor alleles (BB) is higher than the proportion of controls.

 

The next test is whether cases are more often BB than would be expected under HWE.

 

The next test is whether controls are more often BB than would be expected under HWE.

 

The next test is whether more cases than controls are homozygous for at least one minor allele (i.e. true homozygotes rather than compound heterozygotes).

 

The next test is whether more cases are homozygous for at least one minor allele than would be expected under HWE.

 

The next test is whether more controls are homozygous for at least one minor allele than would be expected under HWE.

 

To support the existence of a recessive effect, the tests in cases should be positive and the tests in controls should not be and the pattern of distribution of variant alleles should not suggest that an excess of subjects with two or more minor alleles is due to LD rather than compound heterozygotes. Thus, if one keeps seeing the same pair recurring one may suspect that the variants are in LD and that the alleles are in cis rather than forming compound heterozygotes in trans.

 

If the use_haplotypes option has been specified to indicate that the genotypes are phased, the output for the recessive analyses will instead appear like this:

 

Recessive analyses based on loci with weight reaching 80.000000

Using the following loci:

  1 113281616-NON_SYNON

  2 113283337-NON_SYNON

  3 113283362-NON_SYNON

  4 113283437-NON_SYNON

  5 113283476-NON_SYNON

  6 113283484-NON_SYNON

  7 113283488-NON_SYNON

  8 113286189-NON_SYNON

  9 113286268-NON_SYNON

 10 113287594-NON_SYNON

 11 113287657-NON_SYNON

 12 113287668-NON_SYNON

 13 113295145-NON_SYNON

 14 113295304-NON_SYNON

 15 113295316-NON_SYNON

 16 113295346-NON_SYNON

 17 113295363-NON_SYNON

 

Subjects with minor alleles in different haplotypes:

313289     1

1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

338891     1

1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

367129     0

1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1

 

Counts of haplotypes bearing a minor allele for cases and controls

 

(AB have at least one minor allele on one haplotype, BB have minor alleles on both haplotypes)

            AA      AB      BB  

Controls 2416.00  128.00    1.00

   Cases 2414.00  129.00    2.00

 

Observed:

2544.00    1.00

2543.00    2.00

 

Expected:

2543.50    1.50

2543.50    1.50

 

 

Recessive chi-squared = 0.333530, 1 df, p = 0.281794

-log(p) = 0.55

 

Expected probabilities for subjects to have total of 0, 1 or 2 hapltypes bearing a minor allele:

0.948997 0.050335 0.000667

 

Observed: 2543.00    2.00

Expected: 2543.30    1.70

 

Recessive HWE for cases exact test, p = 0.242394

-log(p) = 0.62

 

Observed: 2544.00    1.00

Expected: 2543.30    1.70

 

Recessive HWE for controls exact test, p = 0.242394

-log(p) = 0.62

 

As before, the variants meeting the minimum weight threshold are listed. Then are listed all subjects for which both haplotypes possess the minor allele for at least one of these variants and the haplotypes are shown.

 

The first test is whether more cases than controls have both haplotypes with at least one minor allele.

 

The next test is whether the number of cases with both haplotypes having a minor allele exceeds HWE.

 

The final test is whether the number of controls with both haplotypes having a minor allele exceeds HWE. To support the existence of a recessive effect one would want to see deviation from HWE in cases but not in controls.

 

It should be said that at time of writing the methods to detect recessive effects by deviation from HWE do not seem robust. The comparisons between cases and controls should be more reliable and it can be helpful to inspect the pattern of genotypes driving the results to see if it might plausibly represent a true effect.

 

pathwayAssoc

pathwayAssoc is a program which performs the same weighted burden analysis as scoreassoc but which is designed to be applied to groups of genes. It tests whether a group of genes has an excess of rare, functional variants in cases compared with controls, using the same approach as described in this paper:

A rapid method for combined analysis of common and rare variants at the level of a region, gene, or pathway. D Curtis. Advances and applications in bioinformatics and chemistry: AABC 5, 1. 2012.

Rather than using the genotypes for all variants in for these genes, the pathwayAssoc takes the gene-wise scores for each subject as output by scoreassoc and sums these across genes. Then it tests to see whether cases on average have higher scores than controls. The effect of this is exactly the same as would be obtained if all the variants for all genes in the pathway were input to scoreassoc but it is more convenient to handle the data in this way, especially when multiple pathways are tested and the same gene can be a member of more than one pathway. The approach is described in this paper, which should also be cited:

 

Pathway analysis of whole exome sequence data provides further support for the involvement of histone modification in the aetiology of schizophrenia. D Curtis. Psychiatr Genet. 2016.

 

It should be noted that pathwayAssoc tests whether there is an excess of rare, functional variants among cases rather than controls within a group of genes. It does not test for “clustering” of the effect into this group, in that it does not test that the observed excess is higher than would be expected given that there may be an overall excess of genes which have more rare, functional variants in cases rather than controls. On the other hand, it differs from some pathway analyses in that it is not biased to include pathways with many genes nor genes with many variants because the test is symmetrical and by chance variants can occur more commonly in controls rather than cases.

 

Before you run pathwayAssoc you must run scoreassoc on all genes, making sure you use the --scorefile option so that scores are output for all subjects. If a gene has no usable variants then a score file will not be output and pathwayAssoc will report that it was missing. If something goes wrong with the system then scoreassoc may write an incomplete file. The sizes for all the score files should be identical, so you can check if there are any small files. Under Unix you can do this by entering:

 

ln –lS *.sco

 

This will list all 21000 files with the small ones at the end.

Running pathwayAssoc

 

All of the arguments to the program are given in pairs. The first member of the pair, beginning --, specifies a parameter and the second member provides the value for that parameter. The pairs of arguments can be conveniently contained in an argument file,  which itself can be specified as the --arg-file parameter.

 

Parameter list:

 

--arg-file file

 

Specifies the file to read arguments from. These files can be chained but not nested, i.e. one file can read in further parameters from another file.

 

--pathway-file file

 

Provides the name of the file providing the pathways and sets of genes they include. The format is that there is one line per pathway. The first word of the line is the pathway name. The second word is ignored. All subsequent words are the genes making up the pathway. A suitable format is produced by downloading c5.all.v5.0.symbols.gmt from the Molecular Signatures Database at http://www.broadinstitute.org/gsea/msigdb/collections.jsp. However it is easy to produce a file specifying any sets of genes which one might be interested in.

 

--score-file-spec file

 

Provides the specification for the score files output by scoreassoc. This should take the form something.GENE.something.sco. to match the name of the score files. The pathwayAssoc will replace GENE with the name of each gene to read the relevant files.

 

--output-file-spec file

 

Provides the specification for the output files to be produced, one for each pathway. This should take the form something.PATHWAY.something.pao. Then pathwayAssoc

will replace PATHWAY with the name of each pathway.

 

--summary-file file

 

Provides the name for the summary output file to be produced.

 

--gene-level-output-threshold SLP

 

Any individual genes which produce a p value such that the SLP is higher than the threshold will be output along with this value. Setting this threshold will show the genes making the biggest individual contributions to the overall test statistic for the pathway. An SLP of 1.6 corresponds approximately to a gene-wise p value of 0.05 (0.025 one-tailed) for a significant excess of rare, functional variants among cases. However it may be there are many genes each individually making small contributions and then they will not be output. Setting the threshold to 0 would output all genes in which there was even a slight excess and setting it to -99 would output all genes. This threshold only applies to which genes are output along with their SLPs – all the genes are included in the analysis.

 

Example argument file for pathwayAssoc

 

An example argument file called SSS.ct08.rare.parg is provided and appears as follows:

 

--pathway-file ../reference/c5.all.v5.0.symbols.gmt

--score-file-spec ../results/SSS.ct08.rare.GENE.sco

--output-file-spec ../pathways/SSS.ct08.rare.PATHWAY.pao

--summary-file ../pathways/SSS.ct08.rare.summ.pao

--gene-level-output-threshold 1.6

 

To run it, one would enter:

 

pathwayAssoc --arg-file SSS.ct08.rare.parg

 

Output from pathwayAssoc

 

A typical output file from pathwayAssoc appears like this:

 

PROTEIN_N_TERMINUS_BINDING

http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_N_TERMINUS_BINDING

 

             Controls  Cases    

N                 2545      2545

Mean score     156.947   166.599

t (5088 df) =  3.037

p = 0.00240291

SLP =     2.62 (signed log10(p), positive if cases score higher than controls)

 

 

SLPs for individual genes:

NBN     0.72

HAX1     0.11

EIF5A     0.00

KCNIP2     0.08

DAXX    -0.02

CSNK2A2     0.13

PGR    -0.27

ERCC5     0.46

ERCC6    -0.60

CSNK2A1    -0.05

PEX14     0.34

NR1H4    -0.06

GLRX    -0.15

ERCC2     0.16

NFE2     0.61

RELA    -0.17

SLA2     0.85

TP53     1.60

ESR1    -0.49

APTX    -0.32

GTF2H3    -0.56

VWF     1.09

MNAT1     0.32

NCOA1     1.23

SMARCE1     1.07

TSC1    -0.60

NCOA3     2.19

ZWINT     0.11

SMARCC1     1.22

CALM3     0.00

SDCBP     0.80

PARP1    -0.59

SMARCA4     1.26

 

 

List of genes for which no score file was found:

ERCC3

ERCC4

DCTN4

GTF2H2

ATM

 

The first two lines consist of the first two words of the input pathway file.

 

Next are provided the average scores, having been summed across all genes, for cases and controls and the results of a t test comparing them.

 

Next is a list of genes whose individual SLPs exceed a supplied threshold (for this example a negative threshold has been provided.

 

Finally there is a list of genes for which no score file was found. These might be genes which do not contain any variant or which were missing for some other reason.

 

permPathwayAssoc

 

The individual p values for each pathway tested by pathwayAssoc are correct. However if multiple pathways are tested it can be difficult to know how many to expect to reach a given p value by chance as the same gene can occur in different pathways and hence the pathway p values are not independent of each other. permPathwAssoc uses permutation testing so that empirical p values can be obtained. The permutation testing simply consists of calculating the SLP for a pathway in the same way as does pathwayAssoc and then repeating the process permuting case and control labels. The approach is described in this paper, which should be cited:

 

Pathway analysis of whole exome sequence data provides further support for the involvement of histone modification in the aetiology of schizophrenia. D Curtis. Psychiatr Genet. 2016.

 

Running permPathwayAssoc

 

All of the arguments to the program are given in pairs. The first member of the pair, beginning --, specifies a parameter and the second member provides the value for that parameter. The pairs of arguments can be conveniently contained in an argument file,  which itself can be specified as the --arg-file parameter.

 

Parameter list:

 

--arg-file file

 

Specifies the file to read arguments from. These files can be chained but not nested, i.e. one file can read in further parameters from another file.

 

--pathway-file file

 

Provides the name of the file providing the pathways and sets of genes they include, in the format as described above.

 

--score-file-spec file

 

Provides the specification for the score files output by scoreassoc as described above.

 

--SLP-file file

 

Provides the name for the verbose output file to which the SLPs produced from the real and simulated datasets will be written.

 

--output-file file

 

Provides the name for the output file to which the empirical p values will be written. These will be provided for the SLP for each pathway, the maximum SLP obtained and for the number of pathways which achieve SLPs above thresholds ranging from 1 to 5.

 

--summary-file file

 

Specifies the output file to which is written the empirical p value for each of the top pathways and the p value derived from the t test, which is used to produce the SLP.

 

--pathway-threshold SLP

 

Specify the threshold for the SLP which a pathway must exceed if it is to be included in the summary file.

 

--nPerms N

 

Specify the number of permutations to perform.

 

Example argument file for pathwayAssoc

 

An example argument file called SSS.ct08.rare.perm.parg is provided and appears as follows:

 

--score-file-spec /home/rejudcu/SSS.ct08.rare.GENE.sco

--pathway-file /home/rejudcu/reference/c5.all.v5.0.symbols.gmt

--pathway-threshold 1.735

--SLP-file /home/rejudcu/tmp/SSS.ct08.rare.SLPs.pao

--output-file /home/rejudcu/tmp/SSS.ct08.rare.perm.pao

--summary-file /home/rejudcu/tmp/SSS.ct08.rare.summ.perm.pao

--nPerms 999

 

To run it, one would enter:

 

permPathwayAssoc --arg-file SSS.ct08.rare.perm.parg

 

This indicates that only pathways with an SLP greater than 1.73 should be included in the summary file. 999 permutations are performed, which means that if a pathway produces an SLP which is not exceeded by any of the permuted datasets then it will have an empirical p value of 0.001, as described here:

 

A Note on the Calculation of Empirical P Values from Monte Carlo Procedures. BV North, D Curtis, P C Sham. Am J Hum Genet. 2003 7: 498–499.

 

Output from permPathwayAssoc

 

A typical summary output file from permPathwayAssoc begins like this:

 

Pathway       SLP    t_test_p      empirical_p

PROTEIN_SERINE_THREONINE_PHOSPHATASE_COMPLEX    2.64   0.211090      0.100000

COATED_VESICLE      1.84   0.132168      0.100000

CLATHRIN_COATED_VESICLE    2.41   0.190709      0.100000

 

This shows the pathway, the SLP produced by the real data, the one-sided , the one-sided p value associated with that SLP and the one-sided empirical p value calculated as (nOver_1)/(nPerms+1), where nOver is the number of times a permuted dataset produces that SLP or one higher in nPerms permutations.

 

The SLP file shows the SLP for each pathway produced by the real dataset followed by the maximum of these SLPs and then the number of SLPs which exceeded 1, 2, 3, 4 or 5. Then follow lines giving these values for each permuted dataset.

 

The output file contains the empirical p values for each of the results obtained for the real dataset, calculated according to the formula above. This means that one can obtain an empirical p value for achieving the observed maximum SLP and also for the number of SLPs above each threshold. For example, if 14 pathways achieve an SLP of over 3 but in only 2 out of 999 permutations does one observe 14 or more pathways exceeding 3, then one can say that the empirical p value to observe 14 or more pathways exceeding 3 is (2+1)/(999+1)=0.003.

Contact details

 

I will aim to keep up to date versions of this distribution at http://www.davecurtis.net/software/dcurtis/scoreassoc.html

 

A live version will be at this git repository: https://github.com/davenomiddlenamecurtis/scoreassoc

 

Please read the relevant publications and cite them if you use this software. Please feel free to contact me with any comments or queries.

 

David Curtis (d.curtis@ucl.ac.uk)