Merge pull request #975 from dondi/beta

v6.0.0
dondi · Aug 31, 2022 · cea4225 · cea4225
2 parents 9dda904 + 7177aaa
commit cea4225
Show file tree

Hide file tree

Showing 31 changed files with 1,480 additions and 295 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,9 @@ logs
 results
 /.idea
 
+database/network-database/script-results
+database/network-database/source-files
+
 npm-debug.log
 node_modules
 package-lock.json

diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
 GRNsight
 ========
 [![DOI](https://zenodo.org/badge/16195791.svg)](https://zenodo.org/badge/latestdoi/16195791)
-[![Build Status](https://app.travis-ci.com/dondi/GRNsight.svg?branch=master)](https://app.travis-ci.com/dondi/GRNsight)
-[![Coverage Status](https://coveralls.io/repos/github/dondi/GRNsight/badge.svg?branch=master)](https://coveralls.io/github/dondi/GRNsight?branch=master)
+[![Build Status](https://app.travis-ci.com/dondi/GRNsight.svg?branch=beta)](https://app.travis-ci.com/dondi/GRNsight)
+[![Coverage Status](https://coveralls.io/repos/github/dondi/GRNsight/badge.svg?branch=beta)](https://coveralls.io/github/dondi/GRNsight?branch=beta)
 
 http://dondi.github.io/GRNsight/
 

diff --git a/database/README.md b/database/README.md
@@ -0,0 +1 @@
+Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using.
diff --git a/database/network-database/README.md b/database/network-database/README.md
@@ -0,0 +1,44 @@
+# Network Database (Schema)
+
+All files pertaining the network database live within this directory.
+
+## The basics
+
+### Schema
+
+All network data is stored within the spring2022_network schema on our Postgres database.
+
+The schema is located within this directory at the top level in the file `schema.sql`. It defines the tables located within the spring2022_network schema. 
+
+### Scripts
+
+All scripts live within the subdirectory `scripts`, located in the top-level of the network database directory. 
+
+Any source files required to run the scripts live within the subdirectory `source-files`, located in the top-level of the network database directory. As source files may be large, you must create this directory yourself and add any source files you need to use there. 
+
+All generated results of the scripts live in the subdirectory `script-results`, located in the top-level of the network database directory. Currently, all scripts that generate code create the directory if it does not currently exist. When adding a new script that generates resulting code, best practice is to create the script-results directory and any subdirectories if it does not exist, in order to prevent errors and snafus for recently cloned repositories.
+
+Within the scripts directory, there are the following files:
+
+- `generate_network.py`
+- `generate_sgd_network_from_yeastract_network.py`
+- `loader.py`
+- `filter_genes.py`
+
+#### Network Generator (and data preprocessor)
+
+This script (`generate_network.py`) is a two-for-one. It first uses the yeastmine service from the SGD database to query for all regulator genes relating to Saccharomyces cerevisiae. From there it gets all all of the targets for each regulator gene. We then construct two networks from these connections (a regulator by regulator matrix as well as a regulator by target matrix). We also construct the processed loader files, so that they are ready to load using `loader.py`.
+
+The resulting network matrices are located in `script-results/networks` and the resulting processed loader files are located within `script-results/processed-loader-files`
+
+Make sure to have all dependencies installed beforehand or you will recieve errors. (pip3 install intermine, tzlocal, etc. [see file for all imports])
+
+Usage: 
+```
+python3 generate_network.py
+```
+#### Generate an SGD network from a Yeastract network
+
+This script takes a network (assumed to have data from Yeastract, but it can be any given network) and gives you a network with data queried from Yeastmine (SGD). It takes the regulators and targets from a given network file, then queries Yeastmine in order to get the regulatory connections between the genes. From there, it creates a new network using the data obtained from Yeastmine. 
+
+
diff --git a/database/network-database/schema.sql b/database/network-database/schema.sql
@@ -0,0 +1,24 @@
+CREATE TABLE spring2022_network.source (
+  time_stamp TIMESTAMP,
+  source VARCHAR,
+  PRIMARY KEY(time_stamp, source)
+);
+
+CREATE TABLE spring2022_network.gene (
+  gene_id VARCHAR, -- systematic like name
+  display_gene_id VARCHAR, -- standard like name
+  species VARCHAR,
+  taxon_id VARCHAR,
+  regulator BOOLEAN,
+  PRIMARY KEY(gene_id, taxon_id)
+); 
+CREATE TABLE spring2022_network.network (
+  regulator_gene_id VARCHAR,
+  target_gene_id VARCHAR,
+  taxon_id VARCHAR,
+  time_stamp TIMESTAMP,
+  source VARCHAR,
+  FOREIGN KEY (regulator_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id),
+  FOREIGN KEY (target_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id),
+  FOREIGN KEY (time_stamp, source) REFERENCES spring2022_network.source(time_stamp, source)
+); 
diff --git a/database/network-database/scripts/filter_genes.py b/database/network-database/scripts/filter_genes.py
@@ -0,0 +1,76 @@
+import psycopg2
+import csv
+import os
+
+PROCESSED_GENES = "../script-results/processed-loader-files/gene.csv"
+MISSING_GENE_DESTINATION = '../script-results/processed-loader-files/missing-genes.csv'
+UPDATE_GENE_DESTINATION = '../script-results/processed-loader-files/update-genes.csv'
+
+try:
+    connection = psycopg2.connect(user="postgres",
+                                  password="",
+                                  host="grnsight2.cfimp3lu6uob.us-west-1.rds.amazonaws.com",
+                                  port="5432",
+                                  database="postgres")
+    cursor = connection.cursor()
+    postgreSQL_select_Query = "select * from spring2022_network.gene"
+
+    cursor.execute(postgreSQL_select_Query)
+    print("Selecting rows from gene table using cursor.fetchall")
+    gene_records = cursor.fetchall()
+
+    db_genes = {}
+    missing_genes = {}
+    genes_to_update = {}
+    for gene in gene_records:
+        # key = (gene_id, taxon_id)
+        key = (gene[0], gene[3])
+        value = {"display_gene_id": gene[1], "species": gene[2], "regulator": gene[4]}
+        db_genes[key] = value
+
+    print(f'Processing file {PROCESSED_GENES}')
+    with open(PROCESSED_GENES, 'r+', encoding="UTF-8") as f:
+        i = 0
+        reader = csv.reader(f)
+        for row in reader:
+            if i != 0:
+                row = row[0].split('\t')
+                gene_id = row[0]
+                display_gene_id = row[1]
+                species = row[2]
+                taxon_id = row[3]
+                regulator = row[4]
+                key = (gene_id, taxon_id)
+                value = {"display_gene_id": display_gene_id , "species": species, "regulator": regulator}
+                if key not in db_genes:
+                    missing_genes[key] = value
+                elif db_genes[key]["display_gene_id"] != display_gene_id:
+                    # the display gene id got updated, so lets update our db to account for that
+                    genes_to_update[key] = value
+            i+=1
+
+    print(f'Creating missing-genes.csv\n')
+    gene_file = open(MISSING_GENE_DESTINATION, 'w')
+    headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
+    gene_file.write(f'{headers}\n')
+    for gene in missing_genes:
+        gene_file.write(f'{gene[0]}\t{missing_genes[gene]["display_gene_id"]}\t{missing_genes[gene]["species"]}\t{gene[1]}\t{missing_genes[gene]["regulator"]}\n')
+    gene_file.close()
+
+    print(f'Creating update-genes.csv\n')
+    gene_file = open(UPDATE_GENE_DESTINATION, 'w')
+    headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
+    gene_file.write(f'{headers}\n')
+    for gene in genes_to_update:
+        gene_file.write(f'{gene[0]}\t{genes_to_update[gene]["display_gene_id"]}\t{genes_to_update[gene]["species"]}\t{gene[1]}\t{genes_to_update[gene]["regulator"]}\n')
+    gene_file.close()
+
+except (Exception, psycopg2.Error) as error:
+    print("Error while fetching data from PostgreSQL", error)
+
+finally:
+    # closing database connection.
+    if connection:
+        cursor.close()
+        connection.close()
+        print("PostgreSQL connection is closed")
diff --git a/database/network-database/scripts/generate_network.py b/database/network-database/scripts/generate_network.py
@@ -0,0 +1,199 @@
+from __future__ import print_function
+
+from intermine.webservice import Service
+service = Service("https://yeastmine.yeastgenome.org/yeastmine/service")
+
+import csv
+import re
+import sys
+import os
+import datetime
+import pytz
+import tzlocal
+
+# Get Network Data from Yeastmine
+
+query = service.new_query("Gene")
+
+query.add_view(
+    "primaryIdentifier", "secondaryIdentifier", "symbol", "name", "sgdAlias",
+    "regulationSummary.summaryParagraph",
+    "regulationSummary.publications.pubMedId",
+    "regulationSummary.publications.citation"
+)
+query.outerjoin("regulationSummary.publications")
+
+regulators = {}
+all_genes = {}
+print("COLLECTING REGULATORS\n")
+for row in query.rows():
+    systematic_name = row["secondaryIdentifier"]
+    standard_name = row["symbol"]
+    if standard_name == None:
+        standard_name = systematic_name
+
+    regulators[standard_name] =  systematic_name
+    all_genes[standard_name] = systematic_name
+
+regulators_to_targets = {}
+all_targets = {}
+
+
+print("COLLECTING TARGETS\n")
+for regulator in regulators:
+    query = service.new_query("Gene")
+    query.add_constraint("regulatoryRegions", "TFBindingSite")
+    query.add_view(
+        "regulatoryRegions.regulator.symbol",
+        "regulatoryRegions.regulator.secondaryIdentifier", "symbol",
+        "secondaryIdentifier", "regulatoryRegions.regEvidence.ontologyTerm.name",
+        "regulatoryRegions.regEvidence.ontologyTerm.identifier",
+        "regulatoryRegions.experimentCondition",
+        "regulatoryRegions.strainBackground",
+        "regulatoryRegions.regulationDirection",
+        "regulatoryRegions.publications.pubMedId", "regulatoryRegions.datasource",
+        "regulatoryRegions.annotationType"
+    )
+    query.add_sort_order("Gene.secondaryIdentifier", "ASC")
+    query.add_constraint("regulatoryRegions.regulator", "LOOKUP", regulator, "S. cerevisiae", code="A")
+    targets = {}
+
+    for row in query.rows():
+        target_systematic_name = row["secondaryIdentifier"]
+        target_standard_name = row["symbol"]
+        if target_standard_name == None:
+            target_standard_name = target_systematic_name
+        targets[target_standard_name] = target_systematic_name
+        all_targets[target_standard_name] = target_systematic_name
+        all_genes[target_standard_name] =  target_systematic_name
+
+    regulators_to_targets[regulator] = { "systematic_name": regulators[regulator], "targets": targets}
+
+
+
+def create_regulator_to_target_row(target, all_regulators):
+    result = "" + target
+    for regulator in all_regulators:
+        if target in all_regulators[regulator]["targets"]:
+            result += "\t" + "1"
+        else: 
+            result += "\t" + "0"
+    return result
+
+
+# Create files
+
+# Create folder paths 
+if not os.path.exists('../script-results'):
+    os.makedirs('../script-results')
+
+if not os.path.exists('../script-results/networks'):
+    os.makedirs('../script-results/networks')
+
+if not os.path.exists('../script-results/processed-loader-files'):
+    os.makedirs('../script-results/processed-loader-files')
+
+
+
+# Files to be generated
+
+# Create Networks
+
+REGULATORS_TO_TARGETS_MATRIX = '../script-results/networks/regulators_to_targets.csv'
+REGULATORS_TO_REGULATORS_MATRIX = '../script-results/networks/regulators_to_regulators.csv'
+
+
+targets = []
+for target in all_targets:
+    if target != None:
+        targets.append(target)
+
+regulators_list = []
+for regulator in regulators_to_targets:
+    if regulator != None:
+        regulators_list.append(regulator)
+
+print(f'Creating REGULATORS TO TARGETS MATRIX\n')
+regulator_to_target_file = open(REGULATORS_TO_TARGETS_MATRIX, 'w')
+headers = "cols regulators/rows targets"
+headers += '\t'.join(regulators_list)
+regulator_to_target_file.write(f'{headers}\n')
+for target in targets:
+  result = create_regulator_to_target_row(target, regulators_to_targets)
+  if result != False:
+    regulator_to_target_file.write(f'{result}\n')
+regulator_to_target_file.close()
+
+print(f'Creating REGULATORS TO TARGETS MATRIX\n')
+regulator_to_regulator_file = open(REGULATORS_TO_REGULATORS_MATRIX, 'w')
+headers = "cols regulators/rows targets"
+headers += '\t'.join(regulators_list)
+regulator_to_regulator_file.write(f'{headers}\n')
+for target in targets:
+  result = create_regulator_to_target_row(target, regulators_to_targets)
+  if result != False:
+    regulator_to_regulator_file.write(f'{result}\n')
+regulator_to_regulator_file.close()
+
+
+
+# Create loader-files
+
+# Source Table
+
+SOURCE_DESTINATION = '../script-results/processed-loader-files/source.csv'
+dt = datetime.datetime.now()
+
+year = dt.year
+month = f'{dt.month}'
+if len(month) == 1:
+    month = "0" + month
+day = f'{dt.day}'
+if len(day) == 1:
+    day = "0" + day
+hour = dt.hour
+minute = dt.minute
+second = dt.second
+
+
+timestamp = f'{year}-{month}-{day} {hour}:{minute}:{second}'
+source = "YeastMine - Saccharomyces Genome Database"
+
+source_file = open(SOURCE_DESTINATION, 'w')
+headers = f'Timestamp\tSource\n{timestamp}\t{source}'
+source_file.write(f'{headers}\n')
+source_file.close()
+
+# Gene Table
+
+GENE_DESTINATION = '../script-results/processed-loader-files/gene.csv'
+
+species = "Saccharomyces cerevisiae"
+taxon_id = "559292"
+
+print(f'Creating gene.csv\n')
+gene_file = open(GENE_DESTINATION, 'w')
+headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
+gene_file.write(f'{headers}\n')
+for gene in all_genes:
+    if gene in regulators:
+        gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\ttrue\n')
+    else:
+        gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\tfalse\n')
+
+gene_file.close()
+
+
+# Network Table
+
+NETWORK_DESTINATION = '../script-results/processed-loader-files/network.csv'
+
+
+print(f'Creating network.csv\n')
+network_file = open(NETWORK_DESTINATION, 'w')
+headers = f'Regulator Gene ID\tTarget Gene ID\tTaxon ID\tTimestamp\tSource'
+network_file.write(f'{headers}\n')
+for gene in regulators_to_targets:
+    for target_gene in regulators_to_targets[gene]["targets"]:
+        network_file.write(f'{regulators_to_targets[gene]["systematic_name"]}\t{regulators_to_targets[gene]["targets"][target_gene]}\t{taxon_id}\t{timestamp}\t{source}\n')
+network_file.close()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using.