Skip to content

collect.py usage hints for converting time series observation data

Kyle Wilcox edited this page Aug 18, 2015 · 18 revisions

collect.py is a program to read EPIC format netcdf(3) files [grouped by the experiment in which they were collected] and convert them to CF-1.6 compliant (discrete samples) netcdf(4) files. These output CF-1.6 files will be incorporated into the portal, and be harvested into the IOOS database.

Ignored Files

By default, the following experiment/sensor/file types are ignored and not converted:

  • Raw hourly data (files with a1h or A1H or A1h or a1H) - 8543sc-a1h.nc
  • lp data (files with alp) - 3971-alp.nc
  • Burst variances (files with *var-) - 8545advbvar-cal.nc
  • b-cal data (files with b-cal.nc) - 8545advb-cal.nc

Collect.py must have a file that provides experiment-level metadata. The default name of this file is project_metadata.csv, and it should be in the same location as collect.py. The columns must contain:

    1. Experiment Name [project_name]- string must match the directory name where the data files are (case sensitive)
    1. Scientist Name [contributor_name] - Name of the PI conducting the research
    1. Title [project_title] - longer version of the Experiment Name
    1. abstract [project_summary] - experiment summary: what was collected and why?
    1. default server location [catalog_xml] - where to download the data from A different file name may be specified using the -c option

To see the help for collect.py enter:

 python collect.py -h

If you don't specify any projects with the -p option, it will try to do all the experiments it finds in the .csv file. The default command to do this is (it's always a good idea to git pull first):

python collect.py --download --output=../../CF-1.6new/ 

When this is complete, it will have put ALL the files into a directory called download under the cwd. I usually cd to ...emontgomery/stellwagen/usgs-cmg-portal/woods_hole_obs_data before working, so the download directory is here. In theory it will run all the experiments perfectly after downloading and put them into sub-directories under whatever was put as the output. Should it fail and you need to re-run one or more experiments, use a command like this to just re-do DIAMONDShoals:

 python collect.py -p DIAMONDSHOALS --folder download -o ../../CF-1.6new

In some instances (say a new experiment being added), you want to just download from one directory and convert that. The command below downloads and converts the HURRIRENE_BB files. The last column of the project_metadata.csv has the URL from which to get the data for the experiment directory listed in column 1.

python collect.py --projects HURRIRENE_BB --download --output=../../CF-1.6/

The default location to put the downloaded files is the /download directory in the cwd. If we use this we get all the files we've ever collected in that directory. To retain our directory structure, a command like this puts the EPIC data in ../../../tmp/HURRIRENE_BB (a location not in the datasetScan path), and the CF output in ../../CF-1.6 (where it will make a subdirectory of the project_name).

$python collect.py -p HURRIRENE_BB --folder ../../../tmp/HURRIRENE_BB --download -o ../../CF-1.6

The output directory CF-1.6 is in the datasetScan path. If you write to a different output location, you need to copy or move the files to under CF-1.6. It's best to save the original if a major revision is done.

You MUST do a --download on each dataset initially because it adds an id global attribute containing the string in --projects and filename_root to each file. If that attribute isn't there, subsequent runs of collect.py using the --folder option will fail. Therefore, if you have a local set of files that you want to convert, collect.py won't work until either you a) put the data on a TDS and use the --download option or b) add an id attribute to each file.

In the case above, since we've already downloaded HURRIRENE_BB, if we wanted to update the CF-1.6 by running a more recent version of collect.py, we can skip the --download and use the local folder we created in the previous step.

python collect.py -p HURRIRENE_BB --folder ../../../tmp/HURRIRENE_BB -o ../../CF-1.6