Skip to content

Latest commit

 

History

History
37 lines (23 loc) · 1.69 KB

README.md

File metadata and controls

37 lines (23 loc) · 1.69 KB

metaphacts ETL pipeline

The Extract-Transform-Load (ETL) pipeline provides a means to convert structured data to RDF, perform post-processing steps, and ingest it into a graph database.

The pipeline follows the principles described in Concepts and is based on an opinionated selection of components and tools:

Features

The ETL pipeline has the following features:

  • read source files from a S3 bucket
  • convert source files to RDF using RML mappings
  • supported formats are CSV, XML, JSON, JSONL, also in compressed (gzipped) form
  • the RDF files are written to an S3 bucket, one RDF file per source file
  • the RDF files are ingested into a graph using the GraphDB Preload tool
  • adding new files into the source bucket after the initial ingestion will add them as incremental updates

Setup and Operation

See ETL Pipeline Setup for how to set up and run the pipeline.

Architecture

The following diagram shows the architecture of the ETL pipeline:

See Architecture for a detailed description.

Copyright

All content in this repository is (c) 2023 by metaphacts.