What is statistics? Why do we need it?
Did you already perform the climbing assay? It is a simple experiment in which a group of old and young flies are tapped down in a vial and are given a defined time (15 s) to climb up again. For each group you have determined how far the ten individuals in each vial have come on a scale from 0 to 10. Now you need to analyse your data and, for this, you need data description and statistical analysis. Such analyses are important and shouldn’t be scary! They are used to portray your data and work out what your results mean, thus turning raw data into clear statements:
 They make trends more visible, and give you a better idea of whether the results show what you think they do.
 They make it easier to compare different sets of data to see whether there is a significant difference or correlation.
 They give an indication of whether a seemingly relevant difference could have happened by chance.
This page will first explain how to describe data. It will then explain ttest and chisquare as two examples of common statistical tests and why they might or might not be useful for analysing data obtained from the climbing assay. For a light and entertaining introduction to statistics, you can also read the Cartoon Guide to Statistics.
Data description
Usually experimental results are a cloud of data points. To describe these data in easier terms and see potential trends, you can use various descriptors, which are listed below. The most useful of these for your experiment can depend both on the type of data you’ve collected and on what you’re trying to find out, as will be explained.
Only certain data analyses can be performed in Microsoft Excel. But there are a number of (unfortunately expensive) dedicated software programs. Furthermore, there are online calculators for each and every purpose – either collections of them (e.g. Social Science Statistics) or dedicated calculators for specific tests (e.g. to generate boxandwhisker plots or calculate the confidence interval).
General features
 Quantitative versus qualitative data:
 Quantitative data are numerical, such as how many flies, and how far they’ve climbed. It is measured or counted.
 Qualitative data is descriptive but not measured, such as eye colour or statements in opinion polls. You may be able to make a qualitative statement about climbing performance of both groups before you carry out the actual analysis.
 If data are quantitative they can be discrete or continuous.
 Discrete data are given as counts (only positive whole numbers) or integers (positive and negative numbers). For example, the number of people in a room can only be a whole number, not halves or quarters. Due to technical limitations, the flies’ positions in the climbing assay were determined as discrete data by determining simply the sector they were found in, which is given as whole numbers (1 – 10)
 Continuous data are generally measurements, and hence include a decimal point. For example, in each sector flies can be higher up or further down. With more precise measurement tools and time, we could have given the position of flies as 5.3 or 5.9. Obviously, taking continuous data from the climbing assay would be more timeconsuming but also more accurate than the discrete data which you have determined.
 Normal (Gaussian) distribution. Knowing data distribution is important for choosing the right analysis strategies, since parametric statistical tests usually require a Gaussian distribution. Data are normally distributed if they fall symmetrically around the mean, i.e. are bellshaped and are not skewed (see middle of Fig.3 for data that are not normally distributed). Body height is an example of parametric data: most individuals will be around the mean height, with fewer and fewer each side as you get further away from the peak on the graph. Some individuals may be very tall, or very short, but the further away that you go from the mean, the less likely you are to find an individual of that height. The climbing assay should have normally distributed data when accumulating data across all experimental repeats. You can determine this with specific tests, such as the KolmogorovSmirnov test or ShapiroWilk test. Often data are not normally distributed (see Figure 3, middle and top right), and nonparametric tests have to be used for their further analysis.
 Outliers are individual data points which stand out from the rest of the data because they are abnormally high or low (see Figs. 2 and 3). For example, in the climbing assay all the young flies may have climbed above sector 6 whereas only one “outlier fly” has stayed at the bottom in sector 1. Discounting outliers is only justifiable if your sample number is high and/or if you have repeated the experiment several times and this outlier is the absolute exception; IMPORTANT: exclusion of such data needs to be discussed in your report.
 The range of data is the difference between the lowest and highest values of a data set. For example, in the set of data shown for working out the mode, the lowest value is 2 and the highest value is 48. The range of the data would be 48 – 2 = 46.
Descriptive statistics
 The mean is what you get when you add up all the values, and divide the sum by the number of values in the set (see Figure 3). For example, the mean of the following set of 6 numbers is: (3 + 2 + 4 + 4 +5 + 6) / 6 = 24 / 6 = 4. The mean can be useful to know for all types of data, but especially when it is normally distributed and continuous.
 The median is the value that’s in the middle (remember: median, middle, both have “d” in them). If data are not normally distributed, the median may deviate substantially from the mean (see Figure 2). To determine the median, put all of the data in order and find the value that shows up half way through. For example, you have values for ten flies per group, and half way is between the fifth and sixth values (i.e. you have to find the mean of these two values). To indicate the trend in your data you usually add the quartile (1/4) and third quartile values (3/4) of the data set (see Figure 2). Take the following 11 data as an example: 3, 8, 3, 5, 9, 4, 2, 7, 5, 7, 10. When sorted (2, 3, 3, 4, 5, 5, 7, 7, 8, 9, 10) the value 5 in the middle is the median, the first quartile is 3 and the third quartile is 8; you write 5 (3 ; 8). When the data is skewed, or has outliers (see below), since such properties bias the mean.
 The Mode is the most frequent value of a data collection. For example, in the following set 14 appears more times than any other number and is therefore its mode: 2, 2, 6, 9, 14, 14, 14, 26, 26, 37, 48 (remember: Mode, most common). The mode is not always a good representation of the data, but if the data is normally distributed, then it can be useful to see which the most accurate reading is, if it keeps occurring frequently.
 The standard deviation (SD) indicates how much the different data points spread out from the mean of your data. Logically, our measurements are more reliable the smaller their standard deviation is. For example, you have a fairly good and a bad thermometer and measure the temperature in the garden five times, respectively. The bad thermometer shows 16, 16.5, 19.5, 15.5, 18.5, so that the mean is 17.2 and the standard deviation is 1.72 (17.2 ± 72 SD; compare Figure 3, right). The better thermometer measures: 17.2, 17.5, 17.8, 17.3, 17.4 (17.44 ± 0.23 SD). Do you agree that the second data set looks better and more reliable? To find out how to calculate the standard deviation, please go here.
 The standard error (of the mean) (SEM) is calculated by dividing the standard deviation by the square root of the sample size (n). It is used to indicate how accurate your mean value is. The larger the sample size, the more accurate the mean will be, and your SEM value will be smaller.
 The SD and sample size can also be used to estimate the (95%) confidence interval (CI). Usually, the 95% CI approximates the SEM multiplied by a factor of ~1.96. There are online calculators that can be used to determine the CI.
 It is very important to visualise your data graphically, and examples are given in Figures 1 and 2. For both those figures the same two data sets were used which are length measurements (axon length) in two genetically different groups of neurons (set 1, set 2). If you want to have a go yourself, you can download the data.
A layman’s guide to ttests
 A ttest is a parametric test which can only be used when the data are normally distributed.
 Ttests are suitable when comparing two different groups for the same trait; you are attempting to find out whether there is a significant difference between their means. For example, we could compare the height of females and males, or in the climbing assay you compare the mean climbing height of old versus young flies.
 First come up with a null hypothesis. A null hypothesis states that the results are only occurring by chance – that there is no significant difference between the groups, since it is mathematically possible to disprove but impossible to prove something. In our example, the null hypothesis is that “there is no significant difference between the climbing heights of old versus young flies”. If there is a difference between them, our assay would then show that we disprove our null hypothesis.
 The outcome of the ttest is given as the pvalue (p for probability). The smaller the p value, the more likely it is that you a looking at a real effect, and the less likely that you have a false positive risk due to stochastic/random events. Therefore, the lower the pvalue, the more likely it is that you can reject the null hypothesis. It was recommended not to use the terms ‘significant’ and ‘nonsignificant’ when talking about your results, but rather to provide the actual pvalues and explain the likelihood of a false positive risk (Colquhoun, 2017, R Soc Open Sci 4, 171085). Based on the ‘False Positive Risk Calculator‘, Coquhoun argues that, for example ..
 .. for p=0.05 that is often used as the cutoff for ‘significance’, “the odds in favour of there being a real effect (given by the likelihood ratio) are about 3 : 1“.
 .. “p=0.001 in a wellpowered experiment gives a likelihood ratio of almost 100 : 1 odds“.
 Paired or unpaired?
 A paired ttest is used when comparing the same group at different times or in different conditions. For example, you test the fitness of a group of people and compare each individual’s performance to their performance after they took an energy drink.
 An unpaired ttest is used when comparing a single trait in two separate groups, for example the fitness of a group of males and females.
 So which ttest should be used to analyse the climbing assay, paired or unpaired? The climbing assay uses two different groups of flies, and compares how far the flies have climbed. Therefore, the unpaired ttest is the adequate one, provided the data are normally distributed.
 For a detailed guide on how to carry out a ttest, please go here.
 In case data are not normally distributed, a nonparametric test needs to be used, such as the MannWhitney U test. Roughly spoken, this tests compares two data sets as to whether one of them tends to have higher values. For example, the data sets in Figures 1 and 2 have a very low p value (high statistical significance) because their curves (Figure 3, middle) mostly stay well separated from each other.
A layman’s guide to chisquare tests
Chisquare tests (χ^{2}, pronounced “kaisquare”; χ is the thirdlast letter in the Greek alphabet) are used to check whether your experimental results match an expectation. Chisquare is a nonparametric test, which means it can be used with both parametric and nonparametric data (see information above). Chisquare can work with real values, or with percentages, but you must make sure all values are discrete and in the same format.
 First example: if you flipped an unbiased coin twenty times, your expected value would be ten heads and ten tails. Obviously, you will be able to tell without any analysis whether the observed and expected values are the same or not. However, using chisquare, you are trying to find out whether this difference is just by chance (for example, heads or tails on an unbiased coin) or whether the difference is significant (the coin is unevenly weighted so that it has a bias towards one side). Your nullhypothesis would be: “There is no significant difference between the observed and expected frequencies.”
 Second example: You want to find out whether the recessive ebony mutation (dark body colour in homozygosis) is heterosomal (on the X chromosome) or autosomal (on any of the other chromosomes). Expectation: The rules of Mendel predict that a cross between two animals who are heterozygous for an autosomal recessive mutation (+/) will yield a 1:2:1 genotypic distribution and a 3:1 phenotypic distribution (1: “+/+“, normal body colour; 2: “+/“, normal colour; 1: “/“, dark colour). For 100 flies, we expect therefore 75 normal and 25 darkcoloured animals. Your experiment: You count 100 flies and find that 70 have normal body colour, while 30 are dark. You use χ^{ 2} statistics to test whether this measured distribution matches the expectation. Your null hypothesis: The observed values match the expected values.
 For each observed number (O) you recorded, subtract the corresponding expected number (E), i.e. O — E.
 Now square the difference [(O —E)^{2}].
 Divide the squares obtained by the expected number [(O – E)^{2} / E ].
 The chi square statistic is the sum of all those values
Observed  Expected  (O — E)  (O — E)^{2}  (O — E)^{2}/ E  
normal  70  75  5  25  0.33 
dark body  30  25  5  25  1.00 
Total  100  100  χ^{ 2} = 1.33 

 You can use a χ2 distribution table to look up a p value associated with this statistic. Knowing that we have 1 degree of freedom, the χ^{2} statistic for a significant difference has to be greater than 3.841 (setting the level for significance at p<0.05). We have χ^{2}=1.33, so there is no difference.
 Nowadays, you can enter the numbers directly into a computer program which gives us the exact p value of p=0.2482 indicating no difference (because p>0.05).
 Because p>0.05, we cannot reject the null hypothesis. This means that it is likely that the ebony mutation is an autosomal recessive allele (and that the parents were likely to be heterozygous for this recessive mutation).
 You can apply this form of statistics also to the climbing assay. For example, you can count flies in all vials across the course and ask how many flies are in the lower sectors 15 and how many in the upper sectors 610, and you compare these numbers for old versus young flies.
Acknowledgements
We would like to thank Douda Bensasson and Simon Pearce for very helpful feedback, Andre Voelzmann for provision of the data sets and some of the original graphs, and Steve Royle for providing essential explanations and examples for the Chisquare section.
Pingback: What Teaching Teachers Taught Us  droso4schools