Listed below are definitions of some common summary statistics:
1) Mean - The average of all your data points. To obtain the mean, simply add up all your values, and divide by the total number of data points.
2) Standard Deviation - this is a estimate of deviation or probable error within your measurements. This is an expression of precision--how widely your experimental values vary. Standard deviation can be used not only to express precision, but also the help determine whether an individual value from your set of values is in error.
If one could take an infinite set of measurements of something, then the distribution graph of these values should look something like this:
This is called a normal distribution, or Gaussian curve. At the center of the curve is the mean, which reflects the highest number of measured values, and the best possible value for your experimental measurement.
A standard deviation is a percentage of the mean value of your data. As the graph shows, most of the values cluster together around the mean, and taper off with distance in either direction. The number of values falling outside of two standard deviations above or below the mean is very small and probability predicts that it will be less than or equal to 5 percent of the total number of measurements (or, 95% of all data is within 2s, i.e. 2 standard deviations of the mean). Therefore it is common practice to reject data, as belonging to your population, if it falls outside of two standard deviations (2s), as those data fall within the outermost 5% of all possible values, and are probably in error. Standard deviation is calculated by:
s = Ö S (Xi - X)2
(N - 1)
which, in English, means to take the differences between the individual values (Xi) and the mean (X), square those differences, and then sum all the squares. Divide the sum of the squares by the total number of values (N) minus one, and take the square root of your result.
The
easiest way to determine s (standard
deviation) is by making a table of your data, then putting each value into the s equation. A useful example would be the measurement of
stonefly larvae in
Xi Xi - X (Xi - X)2
147 10 100
140 3 9
92 - 45 2025
142 5 25
153 16 256
+ 146 9 + 81
S (Xi) = 820 S (Xi - X)2 =2496
So the Standard Deviation is:
s =Ö S (X1 - X)2 = Ö 2496 = 22.34, or 22 (utilizing correct
(N - 1) 5 significant figures)
Two standard deviations would be 44. The X plus or minus 2s = 93 to 181, and is the range of accepted deviation to be included in the population. The value of 92 lies outside of 2s, and can be rejected as being in error. The lower standard deviation indicates greater precision in estimating the mean. Microsoft Excel provides a formula for determining the standard deviation.
Many times people use the Standard Error of the Mean (SE) to reflect the variation in the sample. The SE is calculated as the standard deviation divided by the square root of the samples size. It is a nice statistic to eye-ball whether the mean of two samples are different. If the mean + 1SE of each sample do not overlap, then in most cases the samples will be statistically different using a t-test.
Once you have collected your data you must determine if it is valuable (does it say anything?), and you also need to find a way to present it so that it is clear and easy to understand. Knowledge of graphing and statistics is tremendously helpful in presenting your data and determining if the results are meaningful. All the graphing and statistics you will need to do can usually be accomplished on a computer using Microsoft Excel.
Excel is a powerful spreadsheet program which allows you to organize and manipulate data you have collected in the field, as well as graph and run statistical tests on the data. You can produce many different types of graphs in Excel, depending on the nature and format of your data. The processes for producing different types of graphs vary, so a general description of the procedure will be given here. In general, you will highlight one or more columns of your data with the mouse, then click on the “Chart Wizard” icon located along the top of your screen (it looks like a little bar graph). From here you simply follow the directions to assign your x and y variables, label axes, and give the graph a title. Once you hit finish and your graph appears on the screen, you may change the dimensions by clicking and dragging on the corners of the graph, and you may edit and modify any aspect of the graph by double-clicking on the specific thing you want to change. A sample Excel graph follows.
Points to Remember when creating figures
Figure titles are always
placed at the bottom of the figure.
The title should
be meaningful and explain what the figure illustrates.
The axes should be labeled and in general, abbreviations should never be
used, unless the abbreviation is explained in the figure title.
Font sizes should be consistent for axes, axes titles, and figure title.
Most
of the statistical tests we ask you to run can be conducted in Microsoft
Excel. Below are descriptions of the
tests you will be expected to use and directions for how to run them on the
computer.
There are two classes of statistical tests that are typically used in biology.
Parametric statistical tests assume normal distributions and equal variances, whereas non-parametric tests have fewer assumptions about the distribution of the data.
Consequently, parametric tests are more powerful but assumptions that are required for their usage are more easily violated than non-parametric tests.
There is usually a nonparametric equivalent for every parametric test.
t-test ~ Mann-Whitney U
ANOVA ~ Kruskall-Wallis
Linear Regression ~ Spearman Rank
Listed below are examples of the types of questions and corresponding
statistical tests used in ecology (Assuming parametric statistics assumptions
are met).
What is the difference in the mean between two
populations?
t-test or paired t-test
What is the difference in the mean of one trait among
more than two populations?
ANOVA
What is the difference in the mean of one trait among one or more pairs or combinations of populations after one has determined that the ANOVA test among populations is significant.
Post-hoc multiple comparison tests or A-priori contrast tests
What is the difference in the mean of more than one trait
among more than two populations?
MANOVA
Are two or more variables related to each other, that is
do they tend to co-vary?
Correlation
(assuming no cause and effect) or Regression (assuming cause and effect)
Explanation
t-Tests are used for determining the similarity or difference between two sampled populations. The value you get is called a t-statistic, and should be used in its absolute value form. The smaller the value, the more similar are your two samples. Other important values include t-critical value (two-tail) which is calculated with an alpha level of 0.05 (remember the 5% on either side of 2 standard deviations from the mean). If your t-statistic is greater than this number, you can reject the null hypothesis (that the two samples are the same), and say that there is a statistically significant difference between the two samples. The P-value (two-tail) is important also, in that it gives the exact probability of mistakenly rejecting the null hypothesis when you shouldn’t have. A P-value less than 0.05 means the same thing as having a t-statistic greater than your t-critical value, and thus allows you to reject the null hypothesis as above.
Instructions for Excel*
1) Open Excel.
2) Under the file menu, select NEW, which will give you a blank worksheet.
3) You will use two columns, each of which represents one of the data
samples you collected.
4) At the top of each column, type in a label that describes the sample.
5) Under each label, enter the data you have collected for that sample.
6) At this point you should have your data in two columns with a label at the top of
each. Now it’s time to analyze your data.
7) Go to the options menu and select Tools.
Under Tools, select Data
Analysis.
8) Select the option called: t-test: two sample assuming equal variances.
9) Selecting this will bring up another window that asks for information
regarding where Excel can find your data. This is called the input range.
10) Enter the cell name and number where the data begin and end, separated by a
colon (example, a2:a6).
11) You need to do this for both columns of data, separately.
12) You also need to tell Excel where to put the results of the t-test. This is called
the output range.
13) It is easiest to specify the output range as the row directly below the data.
(example, a10).
14) Press OK.
* from a handout developed by Dr.
Diane Ebert-May at NAU’s Science and
The Analysis of Variance test is an extension of the t-test. T-test is used when you compare two populations, Analysis of Variance tests are used when you want to compare more than two populations to determine the similarity or difference among the sampled populations. Analysis of Variance tests are referred to by several names including F-tests, ANOVA, and AOV. The value obtained is called a F-statistic, and should be used in its absolute value form. The smaller the value, the more similar are your samples. Other important values include the F-critical value, which is comparable to the t-critical value, and the P-value which is interpreted the same as it is in the t-test. The ANOVA only tells you if there are differences between at least two populations in your set. If you want to examine differences between population pairs you need to conduct multiple comparison tests. If you know beforehand what groups should be different you can apply a priori contrast tests. Most people apply post-hoc multiple comparison tests of which there are many to choose from and available in statistical programs.
The simple one-way ANOVA includes just one classification variable, where groups differ in a single way (e.g. hi, med, lo fertilization). You may have an experiment where individuals differ in more than one way (degrees of fertilization and watering). In these cases you apply 2-way on up to n-way ANOVA’s to include as many independent factors as you have in your design. In these analyses you can examine main effects of single independent variables and interaction effects between main effects.
ANOVA tests come in many variations (nested, repeated measures, multivariate), which can accommodate a wide variety of experimental and observational designs.
Instructions for Excel for a One-Way ANOVA*
1) Open Excel.
2) Under the file menu, select NEW, which will give you a blank worksheet.
3) You will use three columns, each of which represents one of the data
samples you collected.
4) At the top of each column, type in a label that describes the sample.
5) Under each label, enter the data you have collected for that sample.
6) At this point you should have your data in three columns with a label at the top of
each. Now it’s time to analyze your data.
7) Go to the options menu and select Tools.
Under Tools, select Data
Analysis.
8) Select the option called: Anova: Single Factor.
9) Selecting this will bring up another window that asks for information
regarding where Excel can find your data. This is called the input range.
10) Enter the cell name and number where the data begin and end, separated by a
colon (example, a2:c6).
11) You also need to tell Excel where to put the results of the t-test. This is called
the output range.
12) It is easiest to specify the output range as the row directly below the data.
(example, a10).
13) Press OK.
A correlation is a way to statistically measure the association between two variables. A correlation produces an r-value, which tells you how closely correlated the two variables are. r-Values range from 0-1, with 0 indicating completely unrelated variables and 1 indicating a perfect (linear) relationship between the two. If you square the r-value (r2) you get a number that can be expressed as a percentage, telling you how much of the change in one variable can be explained by the other.
For example, imagine you were trying to determine if sample temperature or sample density more strongly affected the number of invertebrates present. Let’s say you calculated an r-value of .15 for the correlation between sample density and # of invertebrates present. That gives an r2 of .0225, or 2%. This means that the # of invertebrates is essentially not influenced by sample density. Let’s say then for sample temperature and # of invertebrates you got an r-value of .85. This gives an r2 of .7225, or 72%. This means that 72% of the variation in the number of invertebrates per sample is due to variations in sample temperature.
Instructions for Excel*
1) Open Excel.
2) Under the file menu, select NEW, which will give you a blank worksheet.
3) You will use two columns, each of which represents one of the data
samples you collected.
4) At the top of each column, type in a label that describes the sample.
5) Under each label, enter the data you have collected for that sample.
6) At this point you should have your data in two columns with a label at the top of
each. Now it’s time to analyze your data.
7) Go to the options menu and select Tools.
Under Tools, select Data
Analysis.
8) Select the option called: Correlation
9) Selecting this will bring up another window that asks for information
regarding where Excel can find your data. This is called the input range.
10) Enter the cell names and numbers where the data begin and end, separated by
a colon (example, a2:b6, where the data starts at a2 and ends at b6) This
range should include both columns of data at one time but should not include
the data labels.
11) You also need to tell Excel where to put the results of the regression. This
is called the output range.
12) It is easiest to specify the output range as the row directly below the data.
(example, a7)
13) Press OK.
* from a handout developed by Dr. Diane Ebert-May at NAU’s Science and Mathematics Learning Center.
Linear regression is a statistical method that allows you to predict one variable from another. If we graphed the data from our invertebrate experiment (X-axis = temperature, y-axis = # of invertebrates), we would get a scatter plot of points. The Excel program can draw the line of best fit for us, which is a regression. The equation for the regression is in the form Y=mx+b. Y is the dependent variable, X is the independent variable, b is the y-intercept, and m is the slope. Referring again to our previous example, by using this equation you can predict how many invertebrates you would find if the sample temperature were 30 oC. As with the correlation you are also provided with an r2 value which again tells you what percentage of the observed variation is explained by your selected variable. The closer your r2 value is to 100% the more dependable your predictions will be.
Instructions for Excel
The following instructions assume that you have already entered your data into an Excel worksheet for graphing or other statistical tests.
1) Highlight both columns of your data with the mouse.
2) Click on the “Chart Wizard” icon located along the top of the screen.
3) Choose the XY (Scatter) chart type, and the option with the points not
connected by lines.
4) Follow the rest of the graphing instructions, labeling your axes and giving the
chart a title.
5) Once your graph is displayed on the worksheet with your data, click on one
of the data points with your right mouse button.
6) From the small menu which appears, choose the “add Trendline” option.
7) You will be presented a series of regression types to choose from, and the
linear option will already be highlighted.
8) Before you hit OK, however, click on the Options menu and check the boxes
entitled “Display equation on chart” and “Display R-squared value on chart.”
9) Now click OK and your graph, complete with the equation and r2-value, will
be displayed on the worksheet next to your data.
*Note: the regression tool in Excel can be used to make regressions other than linear
Explanation
The statistical procedures described above all deal with testing hypotheses about continuous variables--those which involve direct measurement of variables and the comparison of values obtained. Sometimes ecologists deal with data that are grouped into classes, such as the number of individual organisms from each of 10 different species obtained in a soil sample. You might want to know if the distribution of individuals among each species differs from that determined in a previous study for a different soil type. The chi-square test is a simple way to test for differences in frequency data--variables distributed into two or more classes (in this example, the classes are species). As with the t-test, you will use Excel to calculate a test statistic, in this case the chi-square (X2) statistic. The chi-square test statistic is then compared with a critical value for X2. By comparing the test statistic with the critical value obtained from a statistical table or software package, you will decide whether you are able to reject your null hypothesis at the predetermined probability (typically P = 0.05).
Instructions for Excel
1) Open Excel.
2) Under the file menu, select NEW, which will give you a blank worksheet.
3) You will use two columns. One of these represents the data from the
samples you collected. The other may be a second set of sample data, or it
may be an expected or hypothesized frequency suggested from ecological
theory, past studies, or a well-known frequency distribution, such as a
normal distribution. In all cases, you are testing to see if the “observed” and
“expected” frequencies differ.
4) At the top of each column, type in a label that describes the sample
frequency distribution.
5) Under each label, enter the appropriate data you have collected.
6) At this point you should have your data in two columns with a label at the
top of each. Now it’s time to analyze your data.
7) Go to the insert menu and select Function. Under Function, select
Statistical.
8) In the right-hand column select the option called: ChiTest.
9) Selecting this will bring up another window that asks for information
regarding where Excel can find your data. These are the input ranges. In
this test there are two: one is called the “observed,” and the other the
“expected range.” Note that even when you are comparing two sample frequencies, you will have to calculate the expected values. Think of
“expect” the distribution to exhibit a random pattern. You will compare this random pattern of "expected values" to the "observed" data.
10) Enter the cell name and number where the first set of data begins and ends,
separated by a colon. (Example a2:a6, where the data starts at a2 and ends at
a6). Next enter the cell name and number where the second set of data begins
and ends, in the same manner.
11) After you enter this second range, the p-value for your X2 test will automatically
appear in the upper right-hand corner of the box. You may want to write this
down now before hitting the “finish” button, as sometimes it can be difficult to find
where
Excel chose to print out your p-value.