Skip to content

Neo4j Implem Overview

Chase William edited this page Mar 14, 2023 · 1 revision

Data Storage Overview

Why We Need Storage

We want to store project type & member information so that webpages can be generated from queried information. This queried information is provided by the user when they initially added their project via submitting their github project url.

Why neo4j, a Graph Database

Due to the nature of type systems, a graph database would be the best option for representation. From my research online, neo4j is a rather large solution, being called a 800lbs gorilla by an instructor at RIT, but it overs a lot of community support. This includes tools to assist with learning how to use the damn thing which is paramount considering graph databases are new to me.

What a Repository should Store

We are looking to record the following information:

  1. Repository
    1. The name of the repository
    2. The url of the repository (where it was cloned from)
    3. The associated commit hash
    4. Date & time this repository was added to our network
    5. The projects that belong to this repository
  2. Projects
    1. The repository this project came from
    2. Projects this project depends on
    3. The produced assembly from this project
  3. Assemblies
    1. Is this assembly produced by a local project
    2. The name of the assembly
    3. The version of the assembly
    4. The types within this assembly
    5. Assembly Dependencies
  4. Information about Types
    1. The assembly this type was defined in
    2. The name of this type
    3. Types this type depends on (needs further elaboration)
    4. The accessibility modifiers this type has (e.g, private, public, internal...)
  5. Information about Members
    1. The type this member is defined within
    2. The kind of member this member is (i.g, Event, Property, Field, Method)
    3. The accessibility modifiers this member has (e.g, private, public, internal...)
    4. The data type of this member

User Assemblies

We will be storing all information about a repository that could be used in either external or internal perspectives. This will prevent partial data as maybe not everything is present in the database to view the repository in an internal perspective. I could allow a repo to be uploaded with only the information needed for external perspective, but what if they want to change that later? Yes, the server could pull it back down using the saved url and commit hash to reprocess. Yes, another repository referencing this repository as a supporting repository would only have access to those public members, therefore, this reprocess could work. However, I am not interested in adding this extra complexity for seemingly little gain. All projects using this are open source, so all their information is visible anyway making this concept pointless.

Expanding on the partial data idea, this could make version tracking and any clever mechanics for duplication reduction much more difficult to implement/troubleshoot.

Supporting Assemblies

Supporting assemblies will have all information required for either external or internal perspectives saved.

What local projects referenced that are actually git sub-modules?

Questions & Concerns

  1. If supporting assemblies are being saved, what is the point of authenticating a user owns a repository when adding it directly. This idea undermines the entire purpose of authenticating.
  2. When handling different versions of the same assembly, should we store snapshots or diffs?
  3. Should the web server for the user website be a separate server than the one processing the C# projects? Meaning, the web server for users could be a Node.js application, whereas, the C# project processing server can be a ASP.NET server running .NET 6.0.

A developer attempts to add an assembly that has already been added indirectly as a supporting assembly to another repository. How will this work?

Since we are storing all information for external and internal perspectives for supporting assemblies; we can simply inform the user this repository has already been documented.

How will version tracking work?

Version tracking will be done via an assembly's version.

More information about .NET's assembly versioning can be found here. The implementation below is how assembly information will be retrieved in DotDocs.

Assembly.GetName().Version // <major version>.<minor version>.<build number>.<revision>

Using assembly versions for version tracking was selected as it offers the most consistent & simplistic approach. For example, if a user adds a repository that adds supporting assemblies to the database (i.e., supporting assemblies owned by another party). If the owner of the supporting library later adds their repository, we need to identify that their assemblies already exists in the database, moreover, we need to identify if that exact version is already added. Once this is determined, we can use the existing records and prevent overwriting or duplication.

Checkout Time-bound and Versioning Documentation for Neo4j

How can we reduce duplicate instances of data between versions?

See Questions.

Should members be their own nodes or be apart of their owning type's node?

Generics?!? Constructed Generic Types?!?

Both generic types and constructed generic types will be saved as normal concrete types would.