Archive

Using NHST in your hypothesis testing

Hypothesis testing using statistics usually proceeds via something along these lines:

  1. Collect some data, say on the response of some variable in two different groups to a treatment and a control.
  2. State a null and alternate hypothesis. The null hypothesis is something you want to falsify and typically is a nill hypothesis i.e. one that hypothesizes no effect. In this case we might have H_0:\mu_T - \mu_C = 0, H_1: |\mu_T - \mu_C| > 0. The alternate hypothesis is usually the negation of the null hypothesis.
  3. Calculate the test statistic, t. In our case this would be the difference of the two mean responses in each group divided by the pooled variance t = \sqrt{2}(\bar{X}_T - \bar{X}_C)/\sqrt{n}s^2_{p}.
  4. Calculate the probability of observing t or more a extreme value under the distribution implied by the null hypothesis, this is the p-value.
  5. If the p-value is less than some pre-set significance level, e.g. \alpha = 0.05, then the we reject the null hypothesis in favour of the alternative. If p > \alpha then we say we fail to reject the null.

The hypothesis test may be on group means, on regression coefficients or some other quantity. All that is needed is the sampling distribution of the statistic in question under the null hypothesis. In our example, sample means divided by sample variance is distributed as a Student t distribution.

This process is called Null Hypothesis Significance Testing (NHST) and, n a nutshell, it is the process of accepting or rejecting a null hypothesis based on a p value being greater or less than a threshold (significance level).

This post is about using NHST in your hypothesis testing, in particular this post explains:

  1. the history of NHST (briefly),
  2. why you should in most cases not use NHST,
  3. when you should use NHST,
  4. what you should do instead of NHST.

A brief history of NHST

This history and the references therein are largely a synopsis of some of the points raised in The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? by Lehman.

Modern (frequentist) hypothesis testing arose out of the work of Fisher and of Neyman & Pearson (N&P). The two are logically distinct but NHST has elements of both and so I’ll describe them both here.

Fisher originally proposed stating a null hypothesis and calculating p values as an index of the strength of evidence agains the null hypothesis. He suggested 5% and 1% as p values below which count strongly against the null hypothesis (Fisher, 1946, Statistical Methods for Research Workers)

If P is between . 1 and .9, there is certainly no reason to suspect the hypothesis tested. If it is below .02, it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05…

However he later rejected the need for standard threshold values for assessing significance (Fisher, R. A. 1956: Statistical methods and scientific inference.):

“no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects [null] hypotheses; he rather give his mind to each particular case in the light of his evidence and ideas.

Fisher also believed that one should use p values as a method for drawing conclusions about the experimental data, rather than making decisions by accepting and rejecting hypotheses, (Fisher, 1973, Statistical Methods and Scientific Inference):

The conclusions drawn from such tests constitute the steps by which the research worker gains a better understanding of his experimental material, and of the problems which it presents.

… More recently, indeed, a considerable body of doctrine has attempted to explain, or rather to reinterpret, these tests on the basis of quite a different model, namely as means to making decisions in an acceptance procedure.

Neyman and Pearson took the the idea of cut-offs and formed a statistical decision making process. They advocated controlling type I errors (falsely rejecting the null hypothesis) by using a significance level, \alpha. Rejecting the null when p < \alpha maintains the false rejecting rate at \alpha. They also suggested controlling type II errors (falsely accepting the null hypothesis) through the concept of statistical power, $\beta$. In order to do this a specific alternative hypothesis (e.g. H_1: \mu_T - \mu_C = 1) must be stated and sample sizes calculated to maintain a type II error rate of \beta. However, N&P did advocate leaving the balancing the control of these two types of error to the experimenter, rather than having a universal cut-off.

In summary then, both Fisher and N&P both used p-values to test hypotheses. In the end Fisher advocated p values as a continuous index of evidence against a null hypothesis and a specific set of experimental data. N&P dichotomised the p values and introduced specific alternative hypotheses in a decision making framework which attempted to control type I and type II errors.

Over time these two approaches have blended into NHST described in the introduction. Predominantly NHST is based on the N&P paradigm. But while N&P advocated selecting significance and power levels based on scientific expediency, modern NHST has adopted the early ideas of Fisher of having levels of significance (and power) dictated by convention.

Why you (probably) shouldn’t use NHST

This section largely taken from Abandon Statistical Significance – McShane, Gal, Gelman et. al. plus some of my own thoughts. With the amount of criticism NHST has received it’s not worth summarising here so I shall just summarise those points which will have the most resonance with PhD students.

There are broadly two types of argument against NHST. The first type are those which result from features of method which are in themselves poor. The second type are those which are by-products of the method which result in undesirable outcomes. With the exception of bias in point 2 below, the criticisms presented here are of the first type. For more of both types of criticism but especially the second, see the relevant Wikipedia page.

  1. The null hypothesis is not (very) realistic.
  2. Statistical tests are only part of the evidence for or against a hypothesis.
  3. Failing to publish “non-signficant” results biases the literature and hurts you (less publications).
  4. Thresholds for deciding what’s true or not don’t make sense.
  5. Accepting or rejecting hypotheses is not what your publication is about.

Let’s expand on these points.

  1. A null hypothesis of zero effect is not realistic because of the presence of systematic error (which you may or may not be aware of) due to, amongst other things, measurement error, hidden confounders, failure to randomise etc. It’s also unrealistic because zero effects amongst the biomedical, social and clinical sciences are themselves unrealistic, e.g. no matter how you select your participants there’s unlikely to be a homogenous group which you are measuring.
  2. Gelman notes that p values have taken the place of discussion of other neglected factors such as prior evidence. See recommendations at the end for larger list of these factors.
  3. Work which doesn’t attain significance goes unpublished, which is both demoralising and your publication record then fails to document your work as a researcher. In addition only publishing significant results results in a literature with biased estimates of effects. This issue is subtle and complicated and I’ll address this in another post.
  4. The concept of accepting or rejecting hypotheses based on a sharp threshold doesn’t make much sense when thinking about scientific hypotheses. In reality p = 0.049 and p = 0.051 represent the same strength of evidence against the null hypothesis, but under NHST, these two results point to opposite conclusions.
  5. A single study cannot decide on the truth or falsity of a particular hypothesis so it doesn’t make sense to frame your results that way.

When you should use NHST

Criticisms aside there are times when you might want to use NHST, such as screening and quality control purposes. For example you may want to identify genes which show an association with an effect. Typically you would consider many thousands of genes and those which do show an effect might be candidates for further research. In this case it may well be the only practical solution to have hard cut-offs for deciding which genes get studied. Another example would be quality control for industrial processes. The “hypothesis” being tested in this case is not a scientific one but one that only has bearing on whether to keep or reject a manufactured widget.

What you should do instead of NHST

The take away message of this post is to avoid, as Gelman puts it, uncertainty laundering – that is turning data into true/false pronouncements on your hypothesis. So what should you do? While there is not going to be a template applicable to all scientific areas here are some recommendations:

  1. You should calculate p values and report their actual value, not as significant/not-significant. Realise that p values are contingent on some very restrictive assumptions (zero effect size, zero systematic error etc.) and that a small p value indicates a problem with at least one of the assumptions, not just the null hypothesis.
  2. Present estimated effect sizes and confidence intervals or standard errors alongside p values.
  3. Include descriptive statistics and informative visual display.
  4. Promote the discussion of the other aspects of the work, the neglected factors, e.g. systematic errors, prior evidence, plausibility of mechanism, study design and data collection plus any other domain specific issues.

Conclusion

At the heart of NHST is the dichotomization of p values which turns data into true/false statements about a hypothesis. This is bad practice primarily because it is both illogical and demotes other pieces of evidence (the neglected factors). Instead, p values should form only part of the statistical and non-statistical evidence for or against a hypothesis.

Blog piece written by Robert Arbon, Data Scientist at the Jean Golding Institute.
For help with any data queries please get in touch ask-jgi@bristol.ac.uk

European Study Group with Industry: linking data science and mathematical modelling

European Study Group with Industry – July 2018

Across a five day workshop in July 2018, the 138th European Study Group with Industry brought together mathematicians and industrialists to work side by side to solve the real and important issues that companies are facing today.

GW4 Seed Corn Funding enabled the free attendance of a non-profit organisation: NHS Digital.

This problem was focused on the development of automatic classifiers of dialogue between a member of the public calling an NHS helpline, and the call handler. Clinicians report that when callers feel less anxious, they are more likely to listen to the advice given, and follow it. A simple method for classifying the distress level of a call as `good’ or `bad’ could be used to provide real-time call handler feedback.

This was one of the most popular problems among study group participants and we had a lively group working on it for the whole week.

Due to data protection, actual NHS helpline phone calls were not available for analysis by study group participants. Instead, we were provided with transcripts from radio programs consisting of heated political interviews (representing `bad’ calls) and relaxed conversational discussion (representing `good’ calls). Our initial data exploration revealed significant differences between these two categories of conversation.

We used the data provided to explore various `features’ that, applied to a conversation, might indicate which category it was in: Number of words in a turn, Duration of a turn, and Inter-speaker gap length, Inter-word gap length, speaking rate (words/second), and turn-taking rate (turns/minute), interruptive behaviour, and mimicry. Any of these features could be relevant when assessing real NHS call data.

We were able to classify the conversations as  `good’ or `bad’, using a number of approaches trained on a subset of the data and tested on a separate subset: Bayesian updating scheme; a hidden Markov model; and  a classier based on the K-S statistical test. In addition, we had success with a scorecard method that characterises the trajectory of a dialogue as a sequence of coordinates in the feature space. All the methods were able to generally distinguish between the `good’ and `bad’ conversations. The Bayesian method achieved correct classification noticeably faster than the hidden Markov method and the K-S test classifier. Whilst less rigorously tested, the scorecard method showed great promise, with classification achieved relatively quickly.

The methods outlined here, including the determination of useful features, now need to be repeated on real NHS call data which has been classified according to call outcome, where the differences between call types may be more subtle. These findings demonstrate great potential for development of early intervention to influence the outcome of a dialogue.

Because this was such a popular problem to work on, our contact at NHS Digital got to meet a number of people who were able to solve other problems. For example, they are currently in discussions with on participant about potentially working on a resource planning project for NHS trusts.

Blog written by Dr Lorna Wilson, Institute of Mathematical Innovation, University of Bath.

The European Study Group was held at the University of Bath, organised by the Institute for Mathematical Innovation (IMI) in collaboration with the Engineering Mathematics Department at the Faculty of Engineering at the University of Bristol. The event was sponsored by IMI, GW4 and the Jean Golding Institute at the University of Bristol.

Data visualisation working group – 13 September 2018

The data visualisation working group is an informal gathering hosted at the University of Bristol to discuss new ideas and data vis methodologies, the group meets monthly. If you are interested in participate, sign up to the mailing list by sending a blank email to sympa@sympa.bristol.ac.uk with the title “subscribe data-visualisation-group” (remember to remove any email signature from the body of the message). The group is also on Yammer (DataViz Working Group) and there is a group wiki.

Below a summary of the meeting on the 13 September 2018:

Thanks to everyone who came along to last week’s Data Visualisation Working Group, and thanks for the really interesting discussion of how visual perception and optical illusions interact with data vis. For those who weren’t able to make it, here’s the link to the TED talk by Beau Lotto that we discussed.

Related to this, we highlighted artist Luke Jerram’s current exhibition “The Impossible Garden” in the University Botanic Garden, a set of experimental sculptures exploring visual phenomena, open until 25th November. Luke will also be delivering this year’s Richard Gregory Memorial Lecture at 6pm on Tuesday 13th November, titled “Exploring the Edges of Perception”.

Thanks too to our presenters. This year we’re introducing a new show-and-tell format, with three-minute lightning presentations. These can be about anything you like, such as a visualisation you’ve seen and liked, a tool you’ve used, a project you’ve been working on or a problem you’d like help with. The idea is that these are low pressure and should require minimal preparation. We think it worked really well, so do let us know if there’s something you’d like to share at a future meeting. Having said that, there’s no obligation to present, so if you’d like to just come along and watch, that’s fine too.

And Harriet presented a data visualisation problem she was approached with recently, and her suggested solution.

We also discussed an upcoming opportunity to influence the data visualisation infrastructure available at the University’s new Temple Quarter site. Do let us know if you’ve seen an exciting data visualisation set-up elsewhere that you’d like to bring to Bristol.

Finally, Polly announced a Tableau workshop she’s arranged for 7 November 2018 10.00-12.00 in the Seminar Room, Beacon House.

Oliver, Harriet, Polly and Nat

Whose Culture? Data Researcher (volunteering opportunity) with Rising Arts Agency

Context

Bristol boasts a strong creative and cultural hub. For young people of colour, however, it can be difficult to be in and enjoy cultural spaces where you don’t see yourself represented. For local arts organisations, next-to-no data exists on people of colour’s cultural engagement, meaning there’s a lack of evidence to guide them in engagement work.  

Rising Arts Agency is a micro-social enterprise whose mission is to nurture more diverse participation, staffing and leadership across Bristol’s creative sector by providing young creatives (16-25) with platforms, networks and training to showcase work and influence cultural strategy.

Rising’s ‘Whose Culture’ programme is a youth engagement and creative data mapping project which provides opportunities for young people of colour to ask questions within the sector, exploring what “culture” means to them through consultations, workshops and sharings.

Over eighteen months, Whose Culture will collaborate with young people of colour on some creative tech which measures and records what “culture” means to them, providing data capable of creating a radical shift towards inclusion in the sector.

The Brief

Rising Arts Agency are looking for a post-graduate research student who specialises in data collection and analysis. This is an exciting research opportunity and the student will be working in a voluntary capacity. We are looking for an individual who can devise methods for collecting and analysing data as part of two phases in this project: the Workshops phase and the Creative Technology phase.

Firstly, we need to collect data during the workshop phase, which runs throughout October and November 2018. Over eight weeks, we will run twelve Whose Culture workshops taking place across Bristol. Run with experienced facilitators and young assistants. This is an opportunity for Rising to begin to understand what ‘culture’ means to the participants, young local creatives of colour in our target areas (St Pauls, Lawrence Hill, Whitchurch Park and Southmead).

From December we will begin to work on a piece of creative technology (we envisage this being a digital platform e.g. an app or social media network) that will enable us to gather city wide data about cultural habits and tastes of young creatives of colour. We then want to share this data with the city’s arts organisations and funders to affect cultural change at a strategic level.

We would like to understand what data we can safely, efficiently and effectively collect during these two phases, so that we are in line with GDPR and in line with the project’s needs. We would like to understand how we can collect this data, analyse it and disseminate it.

We foresee this process taking around 18 months in total. This would be a unique research opportunity and you could be involved for either or both of the project phases. We are looking for someone who is confident advising and taking a lead on data collection. You will work alongside the Whose Culture Project Coordinator Roseanna Dias and a Social Media Coordinator Fatima Safana, as well as Rising’s Director Kamina Walton.

Next steps

If you are interested in this opportunity, please contact Roseanna Dias on roseanna@rising.org.uk with a short statement detailing your relevant experience, how you would approach the task and why you would like to be a part of the project. Please get in touch with us before Friday 21 September – we will be contacting people as enquiries come in, meaning we may close this opportunity early. We are keen to get someone on board as soon as possible.

Contact us:  If you have questions about the role please contact Roseanna Dias on roseanna@rising.org.uk

Enabling advanced analytics for all users of the proteomics facility

Schematic of BayesProt, which evaluates protein ‘biomarkers’ that can differentiate between healthy and diseased biological samples such as blood.

 Discovery proteomics 

Research in the life sciences and translational medicine is being driven forward by cutting-edge techniques for studying the molecules acting in cells. We are interested in studying what proteins are present in diseased cells and in what quantities, compared with normal cells, since the identity of the proteins may help us understand the disease process, the search for new drug targets, and act themselves as diagnostic tests for the disease. The technologies used to study proteins on a large scale are collectively called discovery proteomics, and the main method used in proteomics is mass spectrometry (MS). 

Improved and novel data processing for mass spectrometry 

The Dowsey group has been working on improved and novel data processing for MS for over a decade. In collaboration with proteomics laboratories in Manchester and Liverpool, they have developed a Bayesian model called BayesProt which has recently been extended to take in Bristol’s TMT data. BayesProt has proved the cornerstone of several large-scale translational studies. 

BayesProt 

BayesProt is fundamentally novel and enables analyses previously not possible, such as determining the relative protein levels derived from different transcripts of a gene, or the products of in-cell proteolysis. The purpose of this project is to port BayesProt to Bristol’s BlueCrystal, the University of Bristol’s High Performance Computing (HPC) machine, and integrate its functionality into the Proteomics Facilities’ workflow so that all studies passing through will benefit. 

Integration of BayesProt into Galaxy Integrated Omics 

BayesProt has now been ported to PBS and SLURM cluster managers utilised by BlueCrystal Phase 3 and 4. We have also integrated BayesProt into the Galaxy Integrated Omics’(GIO) environment available on BlueCrystal 3. GIO integrates a collection of state-of-the-art tools for genomics and proteomics to enable ‘proteomics informed by transcriptomics’. This system is key to the study of the effects of differential transcription or ‘gene switching’ caused by e.g. viral infection. BayesProt’s new deconvolution functionality will be critical to quantitative understanding of gene switching, and hence bringing this tool into GIO will enable us to demonstrate these possibilities 

Acknowledgements / People involved in this project 

Andrew Dowsey & Ranjeet Bhamber, Population Health Sciences; Kate Heesom, Biochemistry; Andrew Davidson, David Matthews, Christoph Wuelfing, & David Lee, Cellular and Molecular Medicine, all at the University of Bristol.

This project was funded by the Jean Golding Institute Seed Corn Funding Scheme 2018. To find out about other projects supported by this scheme, take a look at the Jean Golding Institute Projects.