OBJECTIVES:
1. To demonstrate the Central Limit Theorem by generating a pdf of Gaussian data from uniformly distributed random data.
2. To demonstrate the relation between statistical variability and sampling variability as well as the sensitivity of the Chi-Square goodness-of-fit test to the sample size of Gaussian data.
3. To compute the pdf of geophysical data and to compare it with the expected Gaussian pdf by using the Chi-Square test.
PROCEDURE:
1. Log on to the linux workstation. Open a terminal window by right-clicking on the desktop and choosing "New Terminal".
2. At the linux command line, create a subdirectory called oc3150:
mkdir ~/oc3150
Download the necessary data and program files to this
directory. Begin by
selecting the data sets that you have been assigned from the table below.
The data set will appear in the browser window; choose "save as" under
the "file" menu and save it to your oc3150 directory. Then use the "Back"
button to return to this page. (Or: Right-click on your data set name and
choose "Save Link Target As".) Keep the filename that
is already suggested in the "Save As" window (e.g., c22x.data, c31y.data, etc).
| c22x w29 c31x c31y | p7b p7ab c31xb | atemp1 atemp2 apres1 |
You should be able to find a table describing each of the above data sets along with a plot of instrument locations near the back of your class notes. For these data sets, the x-axis is positive in the OFFSHORE direction and the y-axis is positive alongshore to the SOUTH, while x-directional currents (e.g., c22x, c31x, etc.) are positive in the ONSHORE direction and y-directional currents (e.g., c31y) are positive alongshore to the NORTH.
3. Now download the lab M-files to your oc3150 directory. With your internet browser, go to the main website for this course (http://www.oc.nps.navy.mil/oc3150/) and click on the link for "Required Matlab Script Files". From this location you can either download individual m-files by clicking on their names or download all files together by selecting "mfiles.tar" and saving it to your oc3150 directory (and/or your PC, if desired). In linux, use the command "tar xvf mfiles.tar" to extract the m-files from the tar file. On a PC, use WinZip or a similar utility to extract/unzip them.
We will be using the following M-files in today's lab:
mydata.m
random.m whitenoise.m
gaussian.m
pdfdat.m
datstats.m pltpdfdat.m
4. Enter the matlab environment from your terminal window by typing:
cd oc3150
matlab
Note: In this lab, you will be analyzing data sets by determining mean values for a number (nm) of sample subsets, each of a fixed size (nn). In the following we shall define
nm: the total number of data samples used to generate a given probability distribution histogram (one mean value will be calculated from each data sample and included in the distribution)
nn: the size of each data sample
For example, if you select fifty sample sets of values from one of your data records,
and each of those data samples has twenty points, your resulting distribution will have a
total of fifty values in it (nm = 50), each of which
represents the mean of a twenty-point sample (nn = 20).
5. Run random (just type 'random' and answer the
questions) to generate Gaussian (or Normal) random
data using the Central Limit Theorem. Random
creates nm
Vary the sample sizes, nn (e.g., 1, 50, 300) as well as the number of means (or samples) in the Gaussian distribution, nm (e.g., 100, 512, 2048). Hold one constant while varying the other. You should have at least three trials in which nn is varying with nm held constant, and at least three more trials in which nm is varying with nn held constant. Try to have one or more trials in which nn is less than 5 and one or more trials in which nn is greater than 50.
For each trial, find Chi-Square critical for the specified number
of degrees of freedom and level of significance (use the table in your class
notes). Compare this to the computed Chi-Square value
to decide whether the data are Gaussian using the null hypothesis test.
Keep track of these results for the table you will construct in the lab report,
as described below.
6. Type 'whitenoise' to generate white noise, a known
a) the fit improves with increasing number of samples.
b) a larger number of samples is more
likely to give out-lying points.
7. Type 'mydata' to run mydata, which computes the
pdf's of your data sets. Here again, nn = 1. Save the statistics
of each data set for comparison.
8. REPORT:
Turn in a nine-column table for parts 5, 6 and 7, listing
PRINTING PLOTS:
1. >> print (B/W plot: goes to LJ8100 "bw" printer in lab)
2. >> print -dpsc filename
(Color plot: saves your figure to a file called filename.ps)
>> !lp -dcolor
filename.ps (prints the .ps plotfile
on "color" printer)
QUESTIONS:
1. In part 5, how did your histogram plots change
when you varied the number of
means in the distribution (nm)? How did the statistics change?
What did you
observe when you varied the sample size (nn) in part 5? How
are the results in each case related to the Central Limit Theorem?
2. When you increased the number of samples (nm) in part 6, keeping
nn=1, your histogram should have become more Gaussian in shape.
Why is this result NOT an example of the Central Limit Theorem? Explain what
is causing the histogram to better fit the Gaussian curve in this case.
3. If your two data sets have the same units, compare the simple statistics (mean, std deviation, skewness) calculated for your two data sets and explain the differences in their statistics by referring to the specific physical characteristics of each data set (e.g., type of data, sensor location, time period of record, etc.). If your data sets have different units, don't compare them directly. Instead, describe what each of the statistics tells you about the physical measurements made by each of your two instruments.
4. Make the null hypothesis that your data are Gaussian. Use a significance level (alpha) of 0.05. Based on the Chi-square value computed in the goodness-of- fit test do you accept or reject the null hypothesis? If you rejected the null hypothesis, what is the probability of your being wrong? How might you choose a value of alpha so that you could reach the opposite conclusion about the null hypothesis? Why is it possible to either reject or accept the null hypothesis using exactly the same data set?
5. Assume you have just calculated the standard
normalized variable, z, from a recently acquired data time series.
If you now find that you have 18 bins and you compute the Chi-square
goodness of fit test on the z values you've
generated,
how many degrees of freedom will
there be? Please explain your answer.