This repository provides tools to download, process, and structure bike-sharing datasets from multiple cities for analysis and research purposes.
You can download bike-sharing datasets from the following sources:
- New York: CitiBike
- Washington: Capital Bikeshare
- Bay Area: BayWheels
- Columbus: CoGo
- Boston: Blue Bikes
Clone the repository and install dependencies:
git clone <repo_link>
cd <repo_directory>
pip install -r requirements.txt
The dataset pipeline consists of two main stages:
- Data Extraction: Downloads and preprocesses raw data.
- Sub-dataset Building: Constructs train and evaluation datasets from raw data.
This stage includes:
downloader
: Downloads raw trace files from the sources.split
: Splits the dataset into smaller chunks for efficient processing.docking
: Extracts unique docking stations.raw
: Extracts and stores trips in raw format (start station, end station, time, duration).
Each step can be executed independently using:
python main.py <command_name>
To execute all extraction steps for a given year:
python main.py all 2022
To fetch weather data for a specific time range:
python main.py weather new_york 2022-03-01 2022-09-01 --collection-name observations
This step constructs the final dataset from raw trips. The subdataset
command is used for this purpose.
python main.py subdataset "none" 10,20,40 data/dataset citibike --name-suffix training --min-date 2022-03-01 --max-date 2022-08-01
python main.py subdataset "none" 10,20,40 data/dataset citibike --name-suffix evaluation --min-date 2022-08-01 --max-date 2022-09-01
- Generate zones:
python main.py zones 51 34 data/zones/ny --filter-region 71
- Use zones to create the train dataset:
python main.py subdataset "39,0,28,41,31" 5-zones data/dataset-zones citibike --name-suffix training --min-date 2022-03-01 --max-date 2022-08-01 --nodes-from-zones --zones-path data/zones/ny/zones.json --add-weather-data --weather-db weather-new_york --weather-collection observations
- Generate the evaluation dataset:
python main.py subdataset "39,0,28,41,31" 5-zones data/dataset-zones citibike --name-suffix evaluation --min-date 2022-08-01 --max-date 2022-09-01 --nodes-from-zones --zones-path data/zones/ny/zones.json --add-weather-data --weather-db weather-new_york --weather-collection observations
To access CDRC bike-sharing data, apply at CDRC for access to meddin-bike-sharing-world-map-data for the following cities:
- Dublin
- London
- Paris
- New York
Each city dataset consists of the following files:
<city>_bikelocations.csv
- Docking station details, including location changes over time.<city>_ind_<year>.csv
- 10-minute interval observations.<city>_ind_hist.csv
- Historical observations (similar to file 2).<city>_sum.csv
- Aggregated system statistics (every 10 minutes).
🔹 Identifiers: ucl_id
and tfl_id
are unique across files.
The CDRC processing pipeline consists of:
docking
- Extracts unique docking stations.raw
- Extracts and stores raw trips.zones
- Creates geographical zones.subdataset
- Merges raw data, docking station info, and zones.
The entire pipeline can be executed with:
python main.py --cdrc all all <city> --dataset-path data/cdrc_data/datasets --nodes-from-zones --min-date <start_date> --max-date <end_date> --name-suffix <suffix>
Training:
python main.py --cdrc all all dublin --dataset-path data/cdrc_data/datasets --nodes-from-zones --min-date 2021-09-01 --max-date 2022-08-01 --name-suffix training
Evaluation:
python main.py --cdrc all all dublin --dataset-path data/cdrc_data/datasets --nodes-from-zones --min-date 2022-08-01 --max-date 2022-09-01 --name-suffix evaluation --skip docking,raw,zones
Training:
python main.py --cdrc all all london --dataset-path data/cdrc_data/datasets --nodes-from-zones --min-date 2021-09-01 --max-date 2022-08-01 --name-suffix training
Evaluation:
python main.py --cdrc all all london --dataset-path data/cdrc_data/datasets --nodes-from-zones --min-date 2022-08-01 --max-date 2022-09-01 --name-suffix evaluation --skip docking,raw,zones
This repository streamlines bike-sharing dataset processing for research and analysis. It supports multiple cities, integrates weather data, and allows for structured dataset generation for training and evaluation.
For further details, run:
python main.py --help