OC3150 LAB 1



OBJECTIVES:

1. To demonstrate the Central Limit Theorem by generating a pdf of Gaussian data from uniformly distributed random data.

2. To demonstrate the relation between statistical variability and sampling variability as well as the sensitivity of the Chi-Square goodness-of-fit test to the sample size of Gaussian data.

3. To compute the pdf of geophysical data and to compare it with the expected Gaussian pdf by using the Chi-Square test.

PROCEDURE:

1. Log on to the linux workstation. Open a terminal window by right-clicking on the desktop and choosing "New Terminal".

2. At the linux command line, create a subdirectory called oc3150:

          mkdir     ~/oc3150

Download the necessary data and program files to this directory.  Begin by selecting the data sets that you have been assigned from the table below.  The data set will appear in the browser window; choose "save as" under the "file" menu and save it to your oc3150 directory. Then use the "Back" button to return to this page. (Or: Right-click on your data set name and choose "Save Link Target As".) Keep the filename that is already suggested in the "Save As" window (e.g., c22x.data, c31y.data, etc).
 
 c22x     w29       c31x      c31y

 w21       p30       c37x      w38

 c40x      w41

 p7b       p7ab      c31xb

c36xb     p30b
 

 atemp1     atemp2     apres1

 apres2     wsped1     wsped2

 stemp1      stemp2

You should be able to find a table describing each of the above data sets along with a plot of instrument locations near the back of your class notes. For these data sets, the x-axis is positive in the OFFSHORE direction and the y-axis is positive alongshore to the SOUTH, while x-directional currents (e.g., c22x, c31x, etc.) are positive in the ONSHORE direction and y-directional currents (e.g., c31y) are positive alongshore to the NORTH.

3. Now download the lab M-files to your oc3150 directory. With your internet browser, go to the main website for this course (http://www.oc.nps.navy.mil/oc3150/) and click on the link for "Required Matlab Script Files". From this location you can either download individual m-files by clicking on their names or download all files together by selecting "mfiles.tar" and saving it to your oc3150 directory (and/or your PC, if desired). In linux, use the command "tar xvf mfiles.tar" to extract the m-files from the tar file. On a PC, use WinZip or a similar utility to extract/unzip them.

We will be using the following M-files in today's lab:

      mydata.m       random.m       whitenoise.m       gaussian.m
      pdfdat.m        datstats.m        pltpdfdat.m
 

4. Enter the matlab environment from your terminal window by typing:

        cd oc3150
        matlab

Note: In this lab, you will be analyzing data sets by determining mean values for a number (nm) of sample subsets, each of a fixed size (nn). In the following we shall define

nm: the total number of data samples used to generate a given probability distribution histogram (one mean value will be calculated from each data sample and included in the distribution)

nn: the size of each data sample

For example, if you select fifty sample sets of values from one of your data records, and each of those data samples has twenty points, your resulting distribution will have a total of fifty values in it (nm = 50), each of which represents the mean of a twenty-point sample (nn = 20).

5. Run random (just type 'random' and answer the questions) to generate Gaussian (or Normal) random data using the Central Limit Theorem.  Random creates nm uniformly distributed random data samples, each of size nn. The nm means of these data sets are normalized (to "standard normal variable" format), subtracting their mean and dividing by their standard deviation. They are then combined to form a normalized pdf histogram which is compared to the Gaussian pdf using the Chi-Square goodness-of-fit test. (Note: In an unnormalized histogram, degrees of freedom would be equal to the number of bins. However, when a data set is converted to "standard normal" form, the value of d.o.f. must be reduced to account for the calculations of mean, standard deviation, and the total number of points.)

Vary the sample sizes, nn (e.g., 1, 50, 300) as well as the number of means (or samples) in the Gaussian distribution, nm (e.g., 100, 512, 2048).  Hold one constant while varying the other. You should have at least three trials in which nn is varying with nm held constant, and at least three more trials in which nm is varying with nn held constant. Try to have one or more trials in which nn is less than 5 and one or more trials in which nn is greater than 50.

For each trial, find Chi-Square critical for the specified number of degrees of freedom and level of significance (use the table in your class notes). Compare this to the computed Chi-Square value to decide whether the data are Gaussian using the null hypothesis test. Keep track of these results for the table you will construct in the lab report, as described below.
 

6. Type 'whitenoise' to generate white noise, a known Gaussian population. These data will already be distributed in a normal distribution, and you will only be taking the sample points one at a time (nn = 1). What you see in your histogram will now depend on how many samples you take rather than the sample size. Now, compare the histogram to the expected Gaussian pdf. Vary the number of samples (e.g., 100, 500, 5000). Observe that

    a) the fit improves with increasing number of samples.

    b) a larger number of samples is more likely to give out-lying points.
 

7. Type 'mydata' to run mydata, which computes the pdf's of your data sets. Here again, nn = 1. Save the statistics of each data set for comparison.
 

8. REPORT:

Turn in a nine-column table for parts 5, 6 and 7, listing

Answer the questions listed below, and turn in the plots of your two data sets.
 

PRINTING PLOTS:

1. >> print      (B/W plot: goes to LJ8100 "bw" printer in lab)

2. >> print  -dpsc  filename          (Color plot: saves your figure to a file called filename.ps)
    >> !lp  -dcolor  filename.ps       (prints the .ps plotfile on "color" printer)
 

QUESTIONS:

1. In part 5, how did your histogram plots change when you varied the number of means in the distribution (nm)? How did the statistics change? What did you observe when you varied the sample size (nn) in part 5? How are the results in each case related to the Central Limit Theorem?

2. When you increased the number of samples (nm) in part 6, keeping nn=1, your histogram should have become more Gaussian in shape. Why is this result NOT an example of the Central Limit Theorem? Explain what is causing the histogram to better fit the Gaussian curve in this case.

3. If your two data sets have the same units, compare the simple statistics (mean, std deviation, skewness) calculated for your two data sets and explain the differences in their statistics by referring to the specific physical characteristics of each data set (e.g., type of data, sensor location, time period of record, etc.). If your data sets have different units, don't compare them directly. Instead, describe what each of the statistics tells you about the physical measurements made by each of your two instruments.

4. Make the null hypothesis that your data are Gaussian. Use a significance level (alpha) of 0.05. Based on the Chi-square value computed in the goodness-of- fit test do you accept or reject the null hypothesis? If you rejected the null hypothesis, what is the probability of your being wrong? How might you choose a value of alpha so that you could reach the opposite conclusion about the null hypothesis? Why is it possible to either reject or accept the null hypothesis using exactly the same data set?

5. Assume you have just calculated the standard normalized variable, z, from a recently acquired data time series. If you now find that you have 18 bins and you compute the Chi-square goodness of fit test on the z values you've generated, how many degrees of freedom will there be? Please explain your answer.