diff --git a/jsoc/gsoc/tables.md b/jsoc/gsoc/tables.md deleted file mode 100644 index 64f2d4ae81..0000000000 --- a/jsoc/gsoc/tables.md +++ /dev/null @@ -1,56 +0,0 @@ - -# Tabular Data – Summer of Code - -## Parquet.jl enhancements - -**Difficulty**: Medium - -**Duration**: 175 hours - -[Apache Parquet](https://parquet.apache.org/) is a binary data format for tabular data. It has features for compression and memory-mapping of datasets on disk. A decent implementation of Parquet in Julia is likely to be highly performant. It will be useful as a standard format for distributing tabular data in a binary format. There exists a Parquet.jl package that has a Parquet reader and a writer. It currently conforms to the Julia Tabular file IO interface at a very basic level. It needs more work to add support for critical elements that would make Parquet.jl usable for fast large scale parallel data processing. Each of these goals can be targeted as a single, short duration (175 hrs) project. -@@tight-list -* Lazy loading and support for out-of-core processing, with Arrow.jl and Tables.jl integration. Improved usability and performance of Parquet reader and writer for large files. -* Reading from and writing data on to cloud data stores, including support for partitioned data. -* Support for missing data types and encodings making the Julia implementation fully featured. -@@ - -**Resources:** -@@tight-list -* The [Parquet](https://parquet.apache.org/documentation/latest/) file format (also are many articles and talks on the Parquet storage format on the internet) -* [A tour of the data ecosystem in Julia](https://quinnj.home.blog/2019/07/21/a-tour-of-the-data-ecosystem-in-julia/) -* [Tables.jl](https://github.com/JuliaData/Tables.jl) -* [Arrow.jl](https://github.com/JuliaData/Arrow.jl) -@@ - -**Recommended skills:** Good knowledge of Julia language, Julia data stack and writing performant Julia code. - -**Expected Results:** Depends on the specific projects we would agree on. - -**Mentors:** [Tanmay Mohapatra](https://github.com/tanmaykm) - -## DataFrames.jl join enhancements - -**Difficulty**: Hard - -**Duration**: 175 hours - -[DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) is one of the more popular implementations of tabular data type for Julia. One of the features it supports is data frame joining. However, more work is needed to improve this functionality. The specific targets for this project are (a final list of targets included in the scope of the project can be decided later). -@@tight-list -* fully implement multi-threading support by joins, reduce memory requirements of used join algorithms (which should additionally improve their performance), verify efficiency of alternative joining strategies in comparison to those currently used and implement them along with adaptive algorithm choosing the right joining strategy depending on the passed data; -* implement join allowing for efficient matching on non-equal keys; special attention should be made to matching on keys that are date/time and spatial objects; -* implement join allowing for an in-place update of columns of one data frame by values stored in another data frame based on matching key and condition specifying when an update should be performed; -* implement an more flexible mechanizm than currently available allowing to define output data frame column names when performing a join. -@@ - -**Resources:** -@@tight-list -* [DataFrames.jl](https://github.com/JuliaData/DataFrames.jl) -* [Tables.jl](https://github.com/JuliaData/Tables.jl) -* [DataAPI.jl](https://github.com/JuliaData/DataAPI.jl) -@@ - -**Recommended skills:** Good knowledge of Julia language, Julia data stack and writing performant multi-threaded Julia code. Experience with benchmarking code and writing tests. Knowledge of join algorithms (as e.g. used in databases like [DuckDB](https://duckdb.org/) or other tabular data manipulation ecosystems e.g. [Polars](https://www.pola.rs/) or [data.table](https://github.com/Rdatatable/data.table)). - -**Expected Results:** Depends on the specific projects we would agree on. - -**Mentors:** [Bogumił Kamiński](https://github.com/bkamins) diff --git a/jsoc/projects.md b/jsoc/projects.md index 4ff296f798..b1e79a2cd4 100644 --- a/jsoc/projects.md +++ b/jsoc/projects.md @@ -30,7 +30,6 @@ We have our project ideas organized below roughly by domain but you can also see * [QuantumOptics](/jsoc/gsoc/quantumoptics) - Quantum dynamics and master equations * [Signal processing](/jsoc/gsoc/kalmanbucy/) - Continuous time Signal Processing * [Symbolic computation](/jsoc/gsoc/symbolics/) - User friendly symbolic programming -* [Tabular Data](/jsoc/gsoc/tables/) - Working with data * [Taija](/jsoc/gsoc/taija/) - Trustworthy Artificial Intelligence in Julia * [Turing](/jsoc/gsoc/turing/) - for probabilistic modelling and probabilistic programming * [Topology optimisation](/jsoc/gsoc/topopt/) - improving topology optimisation tools in Julia.