ELEMENTARY STATISTICS

 

            Listed below are definitions of some common summary statistics:

 

1)  Mean - The average of all your data points.  To obtain the mean, simply add up all your values, and divide by the total number of data points.

 

2)  Standard Deviation - this is a estimate of deviation or probable error within your measurements.  This is an expression of precision--how widely your experimental values vary.  Standard deviation can be used not only to express precision, but also the help determine whether an individual value from your set of values is in error.

 

            If one could take an infinite set of measurements of something, then the distribution graph of these values should look something like this:

 

 

            This is called a normal distribution, or Gaussian curve.  At the center of the curve is the mean, which reflects the highest number of measured values, and the best possible value for your experimental measurement.

           

A standard deviation is a percentage of the mean value of your data.  As the graph shows, most of the values cluster together around the mean, and taper off with distance in either direction. The number of values falling outside of two standard deviations above or below the mean is very small and probability predicts that it will be less than or equal to 5 percent of the total number of measurements (or, 95% of all data is within 2s, i.e. 2 standard deviations of the mean).  Therefore it is common practice to reject data, as belonging to your population, if it falls outside of two standard deviations (2s), as those data fall within the outermost 5% of all possible values, and are probably in error.  Standard deviation is calculated by:

 

s = Ö S (Xi - X)2

(N - 1)

 

which, in English, means to take the differences between the individual values (Xi) and the mean (X), square those differences, and then sum all the squares.  Divide the sum of the squares by the total number of values (N) minus one, and take the square root of your result.

            The easiest way to determine s (standard deviation) is by making a table of your data, then putting each value into the s equation.  A useful example would be the measurement of stonefly larvae in Oak Creek.  At one site six values (the Xi) were obtained for individuals in an area of stream:  147, 140, 92, 142, 153, and 146.  Here X (the mean) is calculated by summing all the Xi values (820) and dividing by how many Xi values there are (6).  820/6 = 137, so X = 137 ppm.

 

                          Xi                               Xi - X                          (Xi - X)2

                        147                                10                                 100

                        140                                  3                                     9

                          92                              - 45                              2025

                        142                                  5                                   25  

                        153                                16                                 256

                   +   146                                  9                              +  81              

          S (Xi) = 820                                              S  (Xi - X)2 =2496

 

So the Standard Deviation is:   

                               

            s =Ö S (X1 - X)2 =      Ö  2496           =          22.34, or 22  (utilizing correct

                        (N - 1)                     5                                                  significant figures)

 

Two standard deviations would be 44.  The X plus or minus 2s = 93 to 181, and is the range of accepted deviation to be included in the population.  The value of 92 lies outside of 2s, and can be rejected as being in error.  The lower standard deviation indicates greater precision in estimating the mean.  Microsoft Excel provides a formula for determining the standard deviation.

            Many times people use the Standard Error of the Mean (SE) to reflect the variation in the sample.  The SE is calculated as the standard deviation divided by the square root of the samples size.  It is a nice statistic to eye-ball whether the mean of two samples are different.  If the mean + 1SE of each sample do not overlap, then in most cases the samples will be statistically different using a t-test.


GRAPHING AND STATISTICS ON THE COMPUTER

                        Once you have collected your data you must determine if it is valuable (does it say anything?), and you also need to find a way to present it so that it is clear and easy to understand.  Knowledge of graphing and statistics is tremendously helpful in presenting your data and determining if the results are meaningful.  All the graphing and statistics you will need to do can usually be accomplished on a computer using Microsoft Excel. 

 

CREATING GRAPHS IN MICROSOFT EXCEL

            Excel is a powerful spreadsheet program which allows you to organize and manipulate data you have collected in the field, as well as graph and run statistical tests on the data.  You can produce many different types of graphs in Excel, depending on the nature and format of your data.  The processes for producing different types of graphs vary, so a general description of the procedure will be given here.  In general, you will highlight one or more columns of your data with the mouse, then click on the “Chart Wizard” icon located along the top of your screen (it looks like a little bar graph).  From here you simply follow the directions to assign your x and y variables, label axes, and give the graph a title.  Once you hit finish and your graph appears on the screen, you may change the dimensions by clicking and dragging on the corners of the graph, and you may edit and modify any aspect of the graph by double-clicking on the specific thing you want to change.  A sample Excel graph follows.

 

Points to Remember when creating figures

Figure titles are always placed at the bottom of the figure.

The title should be meaningful and explain what the figure illustrates.

The axes should be labeled and in general, abbreviations should never be used, unless the abbreviation is explained in the figure title.

Font sizes should be consistent for axes, axes titles, and figure title.


STATISTICS USING MICROSOFT EXCEL

 

            Most of the statistical tests we ask you to run can be conducted in Microsoft Excel.  Below are descriptions of the tests you will be expected to use and directions for how to run them on the computer.

 

Parametric vs Non-Parametric Statistics

There are two classes of statistical tests that are typically used in biology.

 

Parametric statistical tests assume normal distributions and equal variances, whereas non-parametric tests have fewer assumptions about the distribution of the data. 

Consequently, parametric tests are more powerful but assumptions that are required for their usage are more easily violated than non-parametric tests.

 

There is usually a nonparametric equivalent for every parametric test.

t-test  ~ Mann-Whitney U

ANOVA ~ Kruskall-Wallis

Linear Regression ~ Spearman Rank

 


Listed below are examples of the types of questions and corresponding statistical tests used in ecology (Assuming parametric statistics assumptions are met).

 

What is the difference in the mean between two populations?

t-test or paired t-test

 

What is the difference in the mean of one trait among more than two populations?

ANOVA

 

What is the difference in the mean of one trait among one or more pairs or combinations of populations after one has determined that the ANOVA test among populations is significant.

Post-hoc multiple comparison tests or A-priori contrast tests

 

What is the difference in the mean of more than one trait among more than two populations?

MANOVA

 

Are two or more variables related to each other, that is do they tend to co-vary?

Correlation (assuming no cause and effect) or Regression (assuming cause and effect)

 

 

 

 

t-Test

 

Explanation

            t-Tests are used for determining the similarity or difference between two sampled populations.  The value you get is called a t-statistic, and should be used in its absolute value form.  The smaller the value, the more similar are your two samples.  Other important values include t-critical value (two-tail) which is calculated with an alpha level of 0.05 (remember the 5% on either side of 2 standard deviations from the mean).  If your t-statistic is greater than this number, you can reject the null hypothesis (that the two samples are the same), and say that there is a statistically significant difference between the two samples.  The P-value (two-tail) is important also, in that it gives the exact probability of mistakenly rejecting the null hypothesis when you shouldn’t have.  A P-value less than 0.05 means the same thing as having a t-statistic greater than your t-critical value, and thus allows you to reject the null hypothesis as above.

 

Instructions for Excel*

1)    Open Excel.

2)    Under the file menu, select NEW, which will give you a blank worksheet.

3)    You will use two columns, each of which represents one of the data

       samples you collected.

4)    At the top of each column, type in a label that describes the sample.

5)    Under each label, enter the data you have collected for that sample.

6)    At this point you should have your data in two columns with a label at the top of

       each.  Now it’s time to analyze your data.

7)    Go to the options menu and select Tools.  Under Tools, select Data

       Analysis.

8)    Select the option called: t-test: two sample assuming equal variances.

9)    Selecting this will bring up another window that asks for information

       regarding where Excel can find your data.  This is called the input range.

10)  Enter the cell name and number where the data begin and end, separated by a

       colon  (example, a2:a6). 

11)  You need to do this for both columns of data, separately.

12) You also need to tell Excel where to put the results of the t-test.  This is called

       the output range.

13)  It is easiest to specify the output range as the row directly below the data.

       (example, a10).

14)  Press OK.

* from a handout developed by Dr. Diane Ebert-May at NAU’s Science and Mathematics Learning Center.


STATISTICS USING MICROSOFT EXCEL (continued)

           

Analysis of Variance

            The Analysis of Variance test is an extension of the t-test.  T-test is used when you compare two populations, Analysis of Variance tests are used when you want to compare more than two populations to determine the similarity or difference among the sampled populations.   Analysis of Variance tests are referred to by several names including F-tests, ANOVA, and AOV.  The value obtained is called a F-statistic, and should be used in its absolute value form.  The smaller the value, the more similar are your samples.  Other important values include the F-critical value, which is comparable to the t-critical value, and the P-value which is interpreted the same as it is in the t-test.  The ANOVA only tells you if there are differences between at least two populations in your set.  If you want to examine differences between population pairs you need to conduct multiple comparison tests.  If you know beforehand what groups should be different you can apply a priori contrast tests.  Most people apply post-hoc multiple comparison tests of which there are many to choose from and available in statistical programs.

            The simple one-way ANOVA includes just one classification variable, where groups differ in a single way (e.g. hi, med, lo fertilization).  You may have an experiment where individuals differ in more than one way (degrees of fertilization and watering).  In these cases you apply 2-way on up to n-way ANOVA’s  to include as many independent factors as you have in your design.  In these analyses you can examine main effects of single independent variables and interaction effects between main effects.

            ANOVA tests come in many variations (nested, repeated measures, multivariate), which can accommodate a wide variety of experimental and observational designs.

 

Instructions for Excel for a One-Way ANOVA*

1)    Open Excel.

2)    Under the file menu, select NEW, which will give you a blank worksheet.

3)    You will use three columns, each of which represents one of the data

       samples you collected.

4)    At the top of each column, type in a label that describes the sample.

5)    Under each label, enter the data you have collected for that sample.

6)    At this point you should have your data in three columns with a label at the top of

       each.  Now it’s time to analyze your data.

7)    Go to the options menu and select Tools.  Under Tools, select Data

       Analysis.

8)    Select the option called: Anova: Single Factor.

9)    Selecting this will bring up another window that asks for information

       regarding where Excel can find your data.  This is called the input range.

10)  Enter the cell name and number where the data begin and end, separated by a

       colon  (example, a2:c6). 

11) You also need to tell Excel where to put the results of the t-test.  This is called

       the output range.

12)  It is easiest to specify the output range as the row directly below the data.

       (example, a10).

13)  Press OK.


 

STATISTICS USING MICROSOFT EXCEL (continued)

 

Correlation Analysis: Pearson Correlation

 

Correlation Analysis: Pearson Correlation

            A correlation is a way to statistically measure the association between two variables.   A correlation produces an r-value, which tells you how closely correlated the two variables are.  r-Values range from 0-1, with 0 indicating completely unrelated variables and 1 indicating a perfect (linear) relationship between the two.  If you square the r-value (r2) you get a number that can be expressed as a percentage, telling you how much of the change in one variable can be explained by the other.

   For example, imagine you were trying to determine if sample temperature or sample density more strongly affected the number of invertebrates present.  Let’s say you calculated an r-value of  .15 for the correlation between sample density and # of invertebrates present.  That gives an r2 of .0225, or 2%.  This means that the # of invertebrates is essentially not influenced by sample density.  Let’s say then for sample temperature and # of invertebrates you got an r-value of .85.  This gives an r2 of .7225, or 72%.  This means that 72% of the variation in the number of invertebrates per sample is due to variations in sample temperature.

 

 

Instructions for Excel*

1)    Open Excel.

2)    Under the file menu, select NEW, which will give you a blank worksheet.

3)    You will use two columns, each of which represents one of the data

       samples you collected.

4)    At the top of each column, type in a label that describes the sample.

5)    Under each label, enter the data you have collected for that sample.

6)    At this point you should have your data in two columns with a label at the top of

       each.  Now it’s time to analyze your data.

7)    Go to the options menu and select Tools.  Under Tools, select Data

       Analysis.

8)    Select the option called:  Correlation

9)    Selecting this will bring up another window that asks for information

       regarding where Excel can find your data.  This is called the input range.

10)  Enter the cell names and numbers where the data begin and end, separated by

       a colon (example, a2:b6, where the data starts at a2 and ends at b6)  This

       range should include both columns of data at one time but should not include

       the data labels.

11)  You also need to tell Excel where to put the results of the regression.  This

       is called the output range.

12)  It is easiest to specify the output range as the row directly below the data. 

       (example, a7)

13)  Press OK.

 

* from a handout developed by Dr. Diane Ebert-May at NAU’s Science and Mathematics Learning Center.


STATISTICS USING MICROSOFT EXCEL (continued)

 

 

Regression Analysis:  Linear Regression

            Linear regression is a statistical method that allows you to predict one variable from another.  If we graphed the data from our invertebrate experiment (X-axis = temperature, y-axis = # of invertebrates), we would get a scatter plot of points.  The Excel program can draw the line of best fit for us, which is a regression.  The equation for the regression is in the form Y=mx+b.  Y is the dependent variable, X is the independent variable, b is the y-intercept, and m is the slope.  Referring again to our previous example, by using this equation you can predict how many invertebrates you would find if the sample temperature were 30 oC.  As with the correlation you are also provided with an r2 value which again tells you what percentage of the observed variation is explained by your selected variable.  The closer your r2 value is to 100% the more dependable your predictions will be.

 

 

Instructions for Excel

       The following instructions assume that you have already entered your data into an Excel worksheet for graphing or other statistical tests.

 

1)    Highlight both columns of your data with the mouse.

2)    Click on the “Chart Wizard” icon located along the top of the screen.

3)    Choose the XY (Scatter) chart type, and the option with the points not

       connected by lines.

4)    Follow the rest of the graphing instructions, labeling your axes and giving the

       chart a title.

5)    Once your graph is displayed on the worksheet with your data, click on one

       of the data points with your right mouse button.

6)    From the small menu which appears, choose the “add Trendline” option.

7)    You will be presented a series of regression types to choose from, and the

       linear option will already be highlighted.

8)    Before you hit OK, however, click on the Options menu and check the boxes

       entitled “Display equation on chart” and “Display R-squared value on chart.”

9)    Now click OK and your graph, complete with the equation and r2-value, will

       be displayed on the worksheet next to your data.

 

*Note: the regression tool in Excel can be used to make regressions other than linear


STATISTICS USING MICROSOFT EXCEL (continued)

 

Frequency Analysis:  Chi-square test

 

Explanation

The statistical procedures described above all deal with testing hypotheses about continuous variables--those which involve direct measurement of variables and the comparison of values obtained.  Sometimes ecologists deal with data that are grouped into classes, such as the number of individual organisms from each of 10 different species obtained in a soil sample.  You might want to know if the distribution of individuals among each species differs from that determined in a previous study for a different soil type.  The chi-square test is a simple way to test for differences in frequency data--variables distributed into two or more classes (in this example, the classes are species).  As with the t-test, you will use Excel to calculate a test statistic, in this case the chi-square (X2) statistic.  The chi-square test statistic is then compared with a critical value for X2.   By comparing the test statistic with the critical value obtained from a statistical table or software package, you will decide whether you are able to reject your null hypothesis at the predetermined probability (typically P = 0.05).

 

Instructions for Excel

1)    Open Excel.

2)    Under the file menu, select NEW, which will give you a blank worksheet.

3)    You will use two columns.  One of these represents the data from the

       samples you collected.  The other may be a second set of sample data, or it

       may be an expected or hypothesized frequency suggested from ecological

       theory, past studies, or a well-known frequency distribution, such as a

       normal distribution.  In all cases, you are testing to see if the “observed” and

       “expected” frequencies differ.

4)    At the top of each column, type in a label that describes the sample

       frequency distribution.

5)    Under each label, enter the appropriate data you have collected.

6)    At this point  you should have your data in two columns with a label at the

       top of each.  Now it’s time to analyze your data.

7)    Go to the insert menu and select Function.  Under Function, select

       Statistical.

8)    In the right-hand column select the option called: ChiTest.

9)    Selecting this will bring up another window that asks for information

       regarding where Excel can find your data.  These are the input ranges.  In

       this test there are two: one is called the “observed,” and the other the

“expected range.”  Note that even when you are comparing two sample frequencies,  you will have to calculate the expected values. Think of

       it this way: if your null hypothesis is that the frequencies are the same, you

“expect” the distribution to exhibit a random pattern.  You will compare this random pattern of "expected values" to the "observed" data.

10)  Enter the cell name and number where the first set of data begins and ends,

       separated by a colon.  (Example a2:a6, where the data starts at a2 and ends at

       a6).  Next enter the cell name and number where the second set of data begins

       and ends, in the same manner.

11)  After you enter this second range, the p-value for your X2 test will automatically

       appear in the upper right-hand corner of the box.  You may want to write this

       down now before hitting the “finish” button, as sometimes it can be difficult to find

       where Excel chose to print out your p-value.