Skip to content

Commit

Permalink
Merge pull request #975 from dondi/beta
Browse files Browse the repository at this point in the history
v6.0.0
  • Loading branch information
dondi authored Aug 31, 2022
2 parents 9dda904 + 7177aaa commit cea4225
Show file tree
Hide file tree
Showing 31 changed files with 1,480 additions and 295 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ logs
results
/.idea

database/network-database/script-results
database/network-database/source-files

npm-debug.log
node_modules
package-lock.json
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
GRNsight
========
[![DOI](https://zenodo.org/badge/16195791.svg)](https://zenodo.org/badge/latestdoi/16195791)
[![Build Status](https://app.travis-ci.com/dondi/GRNsight.svg?branch=master)](https://app.travis-ci.com/dondi/GRNsight)
[![Coverage Status](https://coveralls.io/repos/github/dondi/GRNsight/badge.svg?branch=master)](https://coveralls.io/github/dondi/GRNsight?branch=master)
[![Build Status](https://app.travis-ci.com/dondi/GRNsight.svg?branch=beta)](https://app.travis-ci.com/dondi/GRNsight)
[![Coverage Status](https://coveralls.io/repos/github/dondi/GRNsight/badge.svg?branch=beta)](https://coveralls.io/github/dondi/GRNsight?branch=beta)

http://dondi.github.io/GRNsight/

Expand Down
1 change: 1 addition & 0 deletions database/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using.
44 changes: 44 additions & 0 deletions database/network-database/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Network Database (Schema)

All files pertaining the network database live within this directory.

## The basics

### Schema

All network data is stored within the spring2022_network schema on our Postgres database.

The schema is located within this directory at the top level in the file `schema.sql`. It defines the tables located within the spring2022_network schema.

### Scripts

All scripts live within the subdirectory `scripts`, located in the top-level of the network database directory.

Any source files required to run the scripts live within the subdirectory `source-files`, located in the top-level of the network database directory. As source files may be large, you must create this directory yourself and add any source files you need to use there.

All generated results of the scripts live in the subdirectory `script-results`, located in the top-level of the network database directory. Currently, all scripts that generate code create the directory if it does not currently exist. When adding a new script that generates resulting code, best practice is to create the script-results directory and any subdirectories if it does not exist, in order to prevent errors and snafus for recently cloned repositories.

Within the scripts directory, there are the following files:

- `generate_network.py`
- `generate_sgd_network_from_yeastract_network.py`
- `loader.py`
- `filter_genes.py`

#### Network Generator (and data preprocessor)

This script (`generate_network.py`) is a two-for-one. It first uses the yeastmine service from the SGD database to query for all regulator genes relating to Saccharomyces cerevisiae. From there it gets all all of the targets for each regulator gene. We then construct two networks from these connections (a regulator by regulator matrix as well as a regulator by target matrix). We also construct the processed loader files, so that they are ready to load using `loader.py`.

The resulting network matrices are located in `script-results/networks` and the resulting processed loader files are located within `script-results/processed-loader-files`

Make sure to have all dependencies installed beforehand or you will recieve errors. (pip3 install intermine, tzlocal, etc. [see file for all imports])

Usage:
```
python3 generate_network.py
```
#### Generate an SGD network from a Yeastract network

This script takes a network (assumed to have data from Yeastract, but it can be any given network) and gives you a network with data queried from Yeastmine (SGD). It takes the regulators and targets from a given network file, then queries Yeastmine in order to get the regulatory connections between the genes. From there, it creates a new network using the data obtained from Yeastmine.


24 changes: 24 additions & 0 deletions database/network-database/schema.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CREATE TABLE spring2022_network.source (
time_stamp TIMESTAMP,
source VARCHAR,
PRIMARY KEY(time_stamp, source)
);

CREATE TABLE spring2022_network.gene (
gene_id VARCHAR, -- systematic like name
display_gene_id VARCHAR, -- standard like name
species VARCHAR,
taxon_id VARCHAR,
regulator BOOLEAN,
PRIMARY KEY(gene_id, taxon_id)
);
CREATE TABLE spring2022_network.network (
regulator_gene_id VARCHAR,
target_gene_id VARCHAR,
taxon_id VARCHAR,
time_stamp TIMESTAMP,
source VARCHAR,
FOREIGN KEY (regulator_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id),
FOREIGN KEY (target_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id),
FOREIGN KEY (time_stamp, source) REFERENCES spring2022_network.source(time_stamp, source)
);
76 changes: 76 additions & 0 deletions database/network-database/scripts/filter_genes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import psycopg2
import csv
import os

PROCESSED_GENES = "../script-results/processed-loader-files/gene.csv"
MISSING_GENE_DESTINATION = '../script-results/processed-loader-files/missing-genes.csv'
UPDATE_GENE_DESTINATION = '../script-results/processed-loader-files/update-genes.csv'

try:
connection = psycopg2.connect(user="postgres",
password="",
host="grnsight2.cfimp3lu6uob.us-west-1.rds.amazonaws.com",
port="5432",
database="postgres")
cursor = connection.cursor()
postgreSQL_select_Query = "select * from spring2022_network.gene"

cursor.execute(postgreSQL_select_Query)
print("Selecting rows from gene table using cursor.fetchall")
gene_records = cursor.fetchall()

db_genes = {}
missing_genes = {}
genes_to_update = {}
for gene in gene_records:
# key = (gene_id, taxon_id)
key = (gene[0], gene[3])
value = {"display_gene_id": gene[1], "species": gene[2], "regulator": gene[4]}
db_genes[key] = value

print(f'Processing file {PROCESSED_GENES}')
with open(PROCESSED_GENES, 'r+', encoding="UTF-8") as f:
i = 0
reader = csv.reader(f)
for row in reader:
if i != 0:
row = row[0].split('\t')
gene_id = row[0]
display_gene_id = row[1]
species = row[2]
taxon_id = row[3]
regulator = row[4]
key = (gene_id, taxon_id)
value = {"display_gene_id": display_gene_id , "species": species, "regulator": regulator}
if key not in db_genes:
missing_genes[key] = value
elif db_genes[key]["display_gene_id"] != display_gene_id:
# the display gene id got updated, so lets update our db to account for that
genes_to_update[key] = value
i+=1

print(f'Creating missing-genes.csv\n')
gene_file = open(MISSING_GENE_DESTINATION, 'w')
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
gene_file.write(f'{headers}\n')
for gene in missing_genes:
gene_file.write(f'{gene[0]}\t{missing_genes[gene]["display_gene_id"]}\t{missing_genes[gene]["species"]}\t{gene[1]}\t{missing_genes[gene]["regulator"]}\n')
gene_file.close()

print(f'Creating update-genes.csv\n')
gene_file = open(UPDATE_GENE_DESTINATION, 'w')
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
gene_file.write(f'{headers}\n')
for gene in genes_to_update:
gene_file.write(f'{gene[0]}\t{genes_to_update[gene]["display_gene_id"]}\t{genes_to_update[gene]["species"]}\t{gene[1]}\t{genes_to_update[gene]["regulator"]}\n')
gene_file.close()

except (Exception, psycopg2.Error) as error:
print("Error while fetching data from PostgreSQL", error)

finally:
# closing database connection.
if connection:
cursor.close()
connection.close()
print("PostgreSQL connection is closed")
199 changes: 199 additions & 0 deletions database/network-database/scripts/generate_network.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
from __future__ import print_function

from intermine.webservice import Service
service = Service("https://yeastmine.yeastgenome.org/yeastmine/service")

import csv
import re
import sys
import os
import datetime
import pytz
import tzlocal

# Get Network Data from Yeastmine

query = service.new_query("Gene")

query.add_view(
"primaryIdentifier", "secondaryIdentifier", "symbol", "name", "sgdAlias",
"regulationSummary.summaryParagraph",
"regulationSummary.publications.pubMedId",
"regulationSummary.publications.citation"
)
query.outerjoin("regulationSummary.publications")

regulators = {}
all_genes = {}
print("COLLECTING REGULATORS\n")
for row in query.rows():
systematic_name = row["secondaryIdentifier"]
standard_name = row["symbol"]
if standard_name == None:
standard_name = systematic_name

regulators[standard_name] = systematic_name
all_genes[standard_name] = systematic_name

regulators_to_targets = {}
all_targets = {}


print("COLLECTING TARGETS\n")
for regulator in regulators:
query = service.new_query("Gene")
query.add_constraint("regulatoryRegions", "TFBindingSite")
query.add_view(
"regulatoryRegions.regulator.symbol",
"regulatoryRegions.regulator.secondaryIdentifier", "symbol",
"secondaryIdentifier", "regulatoryRegions.regEvidence.ontologyTerm.name",
"regulatoryRegions.regEvidence.ontologyTerm.identifier",
"regulatoryRegions.experimentCondition",
"regulatoryRegions.strainBackground",
"regulatoryRegions.regulationDirection",
"regulatoryRegions.publications.pubMedId", "regulatoryRegions.datasource",
"regulatoryRegions.annotationType"
)
query.add_sort_order("Gene.secondaryIdentifier", "ASC")
query.add_constraint("regulatoryRegions.regulator", "LOOKUP", regulator, "S. cerevisiae", code="A")
targets = {}

for row in query.rows():
target_systematic_name = row["secondaryIdentifier"]
target_standard_name = row["symbol"]
if target_standard_name == None:
target_standard_name = target_systematic_name
targets[target_standard_name] = target_systematic_name
all_targets[target_standard_name] = target_systematic_name
all_genes[target_standard_name] = target_systematic_name

regulators_to_targets[regulator] = { "systematic_name": regulators[regulator], "targets": targets}



def create_regulator_to_target_row(target, all_regulators):
result = "" + target
for regulator in all_regulators:
if target in all_regulators[regulator]["targets"]:
result += "\t" + "1"
else:
result += "\t" + "0"
return result


# Create files

# Create folder paths
if not os.path.exists('../script-results'):
os.makedirs('../script-results')

if not os.path.exists('../script-results/networks'):
os.makedirs('../script-results/networks')

if not os.path.exists('../script-results/processed-loader-files'):
os.makedirs('../script-results/processed-loader-files')



# Files to be generated

# Create Networks

REGULATORS_TO_TARGETS_MATRIX = '../script-results/networks/regulators_to_targets.csv'
REGULATORS_TO_REGULATORS_MATRIX = '../script-results/networks/regulators_to_regulators.csv'


targets = []
for target in all_targets:
if target != None:
targets.append(target)

regulators_list = []
for regulator in regulators_to_targets:
if regulator != None:
regulators_list.append(regulator)

print(f'Creating REGULATORS TO TARGETS MATRIX\n')
regulator_to_target_file = open(REGULATORS_TO_TARGETS_MATRIX, 'w')
headers = "cols regulators/rows targets"
headers += '\t'.join(regulators_list)
regulator_to_target_file.write(f'{headers}\n')
for target in targets:
result = create_regulator_to_target_row(target, regulators_to_targets)
if result != False:
regulator_to_target_file.write(f'{result}\n')
regulator_to_target_file.close()

print(f'Creating REGULATORS TO TARGETS MATRIX\n')
regulator_to_regulator_file = open(REGULATORS_TO_REGULATORS_MATRIX, 'w')
headers = "cols regulators/rows targets"
headers += '\t'.join(regulators_list)
regulator_to_regulator_file.write(f'{headers}\n')
for target in targets:
result = create_regulator_to_target_row(target, regulators_to_targets)
if result != False:
regulator_to_regulator_file.write(f'{result}\n')
regulator_to_regulator_file.close()



# Create loader-files

# Source Table

SOURCE_DESTINATION = '../script-results/processed-loader-files/source.csv'
dt = datetime.datetime.now()

year = dt.year
month = f'{dt.month}'
if len(month) == 1:
month = "0" + month
day = f'{dt.day}'
if len(day) == 1:
day = "0" + day
hour = dt.hour
minute = dt.minute
second = dt.second


timestamp = f'{year}-{month}-{day} {hour}:{minute}:{second}'
source = "YeastMine - Saccharomyces Genome Database"

source_file = open(SOURCE_DESTINATION, 'w')
headers = f'Timestamp\tSource\n{timestamp}\t{source}'
source_file.write(f'{headers}\n')
source_file.close()

# Gene Table

GENE_DESTINATION = '../script-results/processed-loader-files/gene.csv'

species = "Saccharomyces cerevisiae"
taxon_id = "559292"

print(f'Creating gene.csv\n')
gene_file = open(GENE_DESTINATION, 'w')
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator'
gene_file.write(f'{headers}\n')
for gene in all_genes:
if gene in regulators:
gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\ttrue\n')
else:
gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\tfalse\n')

gene_file.close()


# Network Table

NETWORK_DESTINATION = '../script-results/processed-loader-files/network.csv'


print(f'Creating network.csv\n')
network_file = open(NETWORK_DESTINATION, 'w')
headers = f'Regulator Gene ID\tTarget Gene ID\tTaxon ID\tTimestamp\tSource'
network_file.write(f'{headers}\n')
for gene in regulators_to_targets:
for target_gene in regulators_to_targets[gene]["targets"]:
network_file.write(f'{regulators_to_targets[gene]["systematic_name"]}\t{regulators_to_targets[gene]["targets"][target_gene]}\t{taxon_id}\t{timestamp}\t{source}\n')
network_file.close()
Loading

0 comments on commit cea4225

Please sign in to comment.