-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #975 from dondi/beta
v6.0.0
- Loading branch information
Showing
31 changed files
with
1,480 additions
and
295 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Here are the files pertaining to both the network and expression databases. Look within the README.md files of both folders for information pertinent to the schema that you intend to be using. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# Network Database (Schema) | ||
|
||
All files pertaining the network database live within this directory. | ||
|
||
## The basics | ||
|
||
### Schema | ||
|
||
All network data is stored within the spring2022_network schema on our Postgres database. | ||
|
||
The schema is located within this directory at the top level in the file `schema.sql`. It defines the tables located within the spring2022_network schema. | ||
|
||
### Scripts | ||
|
||
All scripts live within the subdirectory `scripts`, located in the top-level of the network database directory. | ||
|
||
Any source files required to run the scripts live within the subdirectory `source-files`, located in the top-level of the network database directory. As source files may be large, you must create this directory yourself and add any source files you need to use there. | ||
|
||
All generated results of the scripts live in the subdirectory `script-results`, located in the top-level of the network database directory. Currently, all scripts that generate code create the directory if it does not currently exist. When adding a new script that generates resulting code, best practice is to create the script-results directory and any subdirectories if it does not exist, in order to prevent errors and snafus for recently cloned repositories. | ||
|
||
Within the scripts directory, there are the following files: | ||
|
||
- `generate_network.py` | ||
- `generate_sgd_network_from_yeastract_network.py` | ||
- `loader.py` | ||
- `filter_genes.py` | ||
|
||
#### Network Generator (and data preprocessor) | ||
|
||
This script (`generate_network.py`) is a two-for-one. It first uses the yeastmine service from the SGD database to query for all regulator genes relating to Saccharomyces cerevisiae. From there it gets all all of the targets for each regulator gene. We then construct two networks from these connections (a regulator by regulator matrix as well as a regulator by target matrix). We also construct the processed loader files, so that they are ready to load using `loader.py`. | ||
|
||
The resulting network matrices are located in `script-results/networks` and the resulting processed loader files are located within `script-results/processed-loader-files` | ||
|
||
Make sure to have all dependencies installed beforehand or you will recieve errors. (pip3 install intermine, tzlocal, etc. [see file for all imports]) | ||
|
||
Usage: | ||
``` | ||
python3 generate_network.py | ||
``` | ||
#### Generate an SGD network from a Yeastract network | ||
|
||
This script takes a network (assumed to have data from Yeastract, but it can be any given network) and gives you a network with data queried from Yeastmine (SGD). It takes the regulators and targets from a given network file, then queries Yeastmine in order to get the regulatory connections between the genes. From there, it creates a new network using the data obtained from Yeastmine. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
CREATE TABLE spring2022_network.source ( | ||
time_stamp TIMESTAMP, | ||
source VARCHAR, | ||
PRIMARY KEY(time_stamp, source) | ||
); | ||
|
||
CREATE TABLE spring2022_network.gene ( | ||
gene_id VARCHAR, -- systematic like name | ||
display_gene_id VARCHAR, -- standard like name | ||
species VARCHAR, | ||
taxon_id VARCHAR, | ||
regulator BOOLEAN, | ||
PRIMARY KEY(gene_id, taxon_id) | ||
); | ||
CREATE TABLE spring2022_network.network ( | ||
regulator_gene_id VARCHAR, | ||
target_gene_id VARCHAR, | ||
taxon_id VARCHAR, | ||
time_stamp TIMESTAMP, | ||
source VARCHAR, | ||
FOREIGN KEY (regulator_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id), | ||
FOREIGN KEY (target_gene_id, taxon_id) REFERENCES spring2022_network.gene(gene_id, taxon_id), | ||
FOREIGN KEY (time_stamp, source) REFERENCES spring2022_network.source(time_stamp, source) | ||
); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
import psycopg2 | ||
import csv | ||
import os | ||
|
||
PROCESSED_GENES = "../script-results/processed-loader-files/gene.csv" | ||
MISSING_GENE_DESTINATION = '../script-results/processed-loader-files/missing-genes.csv' | ||
UPDATE_GENE_DESTINATION = '../script-results/processed-loader-files/update-genes.csv' | ||
|
||
try: | ||
connection = psycopg2.connect(user="postgres", | ||
password="", | ||
host="grnsight2.cfimp3lu6uob.us-west-1.rds.amazonaws.com", | ||
port="5432", | ||
database="postgres") | ||
cursor = connection.cursor() | ||
postgreSQL_select_Query = "select * from spring2022_network.gene" | ||
|
||
cursor.execute(postgreSQL_select_Query) | ||
print("Selecting rows from gene table using cursor.fetchall") | ||
gene_records = cursor.fetchall() | ||
|
||
db_genes = {} | ||
missing_genes = {} | ||
genes_to_update = {} | ||
for gene in gene_records: | ||
# key = (gene_id, taxon_id) | ||
key = (gene[0], gene[3]) | ||
value = {"display_gene_id": gene[1], "species": gene[2], "regulator": gene[4]} | ||
db_genes[key] = value | ||
|
||
print(f'Processing file {PROCESSED_GENES}') | ||
with open(PROCESSED_GENES, 'r+', encoding="UTF-8") as f: | ||
i = 0 | ||
reader = csv.reader(f) | ||
for row in reader: | ||
if i != 0: | ||
row = row[0].split('\t') | ||
gene_id = row[0] | ||
display_gene_id = row[1] | ||
species = row[2] | ||
taxon_id = row[3] | ||
regulator = row[4] | ||
key = (gene_id, taxon_id) | ||
value = {"display_gene_id": display_gene_id , "species": species, "regulator": regulator} | ||
if key not in db_genes: | ||
missing_genes[key] = value | ||
elif db_genes[key]["display_gene_id"] != display_gene_id: | ||
# the display gene id got updated, so lets update our db to account for that | ||
genes_to_update[key] = value | ||
i+=1 | ||
|
||
print(f'Creating missing-genes.csv\n') | ||
gene_file = open(MISSING_GENE_DESTINATION, 'w') | ||
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator' | ||
gene_file.write(f'{headers}\n') | ||
for gene in missing_genes: | ||
gene_file.write(f'{gene[0]}\t{missing_genes[gene]["display_gene_id"]}\t{missing_genes[gene]["species"]}\t{gene[1]}\t{missing_genes[gene]["regulator"]}\n') | ||
gene_file.close() | ||
|
||
print(f'Creating update-genes.csv\n') | ||
gene_file = open(UPDATE_GENE_DESTINATION, 'w') | ||
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator' | ||
gene_file.write(f'{headers}\n') | ||
for gene in genes_to_update: | ||
gene_file.write(f'{gene[0]}\t{genes_to_update[gene]["display_gene_id"]}\t{genes_to_update[gene]["species"]}\t{gene[1]}\t{genes_to_update[gene]["regulator"]}\n') | ||
gene_file.close() | ||
|
||
except (Exception, psycopg2.Error) as error: | ||
print("Error while fetching data from PostgreSQL", error) | ||
|
||
finally: | ||
# closing database connection. | ||
if connection: | ||
cursor.close() | ||
connection.close() | ||
print("PostgreSQL connection is closed") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,199 @@ | ||
from __future__ import print_function | ||
|
||
from intermine.webservice import Service | ||
service = Service("https://yeastmine.yeastgenome.org/yeastmine/service") | ||
|
||
import csv | ||
import re | ||
import sys | ||
import os | ||
import datetime | ||
import pytz | ||
import tzlocal | ||
|
||
# Get Network Data from Yeastmine | ||
|
||
query = service.new_query("Gene") | ||
|
||
query.add_view( | ||
"primaryIdentifier", "secondaryIdentifier", "symbol", "name", "sgdAlias", | ||
"regulationSummary.summaryParagraph", | ||
"regulationSummary.publications.pubMedId", | ||
"regulationSummary.publications.citation" | ||
) | ||
query.outerjoin("regulationSummary.publications") | ||
|
||
regulators = {} | ||
all_genes = {} | ||
print("COLLECTING REGULATORS\n") | ||
for row in query.rows(): | ||
systematic_name = row["secondaryIdentifier"] | ||
standard_name = row["symbol"] | ||
if standard_name == None: | ||
standard_name = systematic_name | ||
|
||
regulators[standard_name] = systematic_name | ||
all_genes[standard_name] = systematic_name | ||
|
||
regulators_to_targets = {} | ||
all_targets = {} | ||
|
||
|
||
print("COLLECTING TARGETS\n") | ||
for regulator in regulators: | ||
query = service.new_query("Gene") | ||
query.add_constraint("regulatoryRegions", "TFBindingSite") | ||
query.add_view( | ||
"regulatoryRegions.regulator.symbol", | ||
"regulatoryRegions.regulator.secondaryIdentifier", "symbol", | ||
"secondaryIdentifier", "regulatoryRegions.regEvidence.ontologyTerm.name", | ||
"regulatoryRegions.regEvidence.ontologyTerm.identifier", | ||
"regulatoryRegions.experimentCondition", | ||
"regulatoryRegions.strainBackground", | ||
"regulatoryRegions.regulationDirection", | ||
"regulatoryRegions.publications.pubMedId", "regulatoryRegions.datasource", | ||
"regulatoryRegions.annotationType" | ||
) | ||
query.add_sort_order("Gene.secondaryIdentifier", "ASC") | ||
query.add_constraint("regulatoryRegions.regulator", "LOOKUP", regulator, "S. cerevisiae", code="A") | ||
targets = {} | ||
|
||
for row in query.rows(): | ||
target_systematic_name = row["secondaryIdentifier"] | ||
target_standard_name = row["symbol"] | ||
if target_standard_name == None: | ||
target_standard_name = target_systematic_name | ||
targets[target_standard_name] = target_systematic_name | ||
all_targets[target_standard_name] = target_systematic_name | ||
all_genes[target_standard_name] = target_systematic_name | ||
|
||
regulators_to_targets[regulator] = { "systematic_name": regulators[regulator], "targets": targets} | ||
|
||
|
||
|
||
def create_regulator_to_target_row(target, all_regulators): | ||
result = "" + target | ||
for regulator in all_regulators: | ||
if target in all_regulators[regulator]["targets"]: | ||
result += "\t" + "1" | ||
else: | ||
result += "\t" + "0" | ||
return result | ||
|
||
|
||
# Create files | ||
|
||
# Create folder paths | ||
if not os.path.exists('../script-results'): | ||
os.makedirs('../script-results') | ||
|
||
if not os.path.exists('../script-results/networks'): | ||
os.makedirs('../script-results/networks') | ||
|
||
if not os.path.exists('../script-results/processed-loader-files'): | ||
os.makedirs('../script-results/processed-loader-files') | ||
|
||
|
||
|
||
# Files to be generated | ||
|
||
# Create Networks | ||
|
||
REGULATORS_TO_TARGETS_MATRIX = '../script-results/networks/regulators_to_targets.csv' | ||
REGULATORS_TO_REGULATORS_MATRIX = '../script-results/networks/regulators_to_regulators.csv' | ||
|
||
|
||
targets = [] | ||
for target in all_targets: | ||
if target != None: | ||
targets.append(target) | ||
|
||
regulators_list = [] | ||
for regulator in regulators_to_targets: | ||
if regulator != None: | ||
regulators_list.append(regulator) | ||
|
||
print(f'Creating REGULATORS TO TARGETS MATRIX\n') | ||
regulator_to_target_file = open(REGULATORS_TO_TARGETS_MATRIX, 'w') | ||
headers = "cols regulators/rows targets" | ||
headers += '\t'.join(regulators_list) | ||
regulator_to_target_file.write(f'{headers}\n') | ||
for target in targets: | ||
result = create_regulator_to_target_row(target, regulators_to_targets) | ||
if result != False: | ||
regulator_to_target_file.write(f'{result}\n') | ||
regulator_to_target_file.close() | ||
|
||
print(f'Creating REGULATORS TO TARGETS MATRIX\n') | ||
regulator_to_regulator_file = open(REGULATORS_TO_REGULATORS_MATRIX, 'w') | ||
headers = "cols regulators/rows targets" | ||
headers += '\t'.join(regulators_list) | ||
regulator_to_regulator_file.write(f'{headers}\n') | ||
for target in targets: | ||
result = create_regulator_to_target_row(target, regulators_to_targets) | ||
if result != False: | ||
regulator_to_regulator_file.write(f'{result}\n') | ||
regulator_to_regulator_file.close() | ||
|
||
|
||
|
||
# Create loader-files | ||
|
||
# Source Table | ||
|
||
SOURCE_DESTINATION = '../script-results/processed-loader-files/source.csv' | ||
dt = datetime.datetime.now() | ||
|
||
year = dt.year | ||
month = f'{dt.month}' | ||
if len(month) == 1: | ||
month = "0" + month | ||
day = f'{dt.day}' | ||
if len(day) == 1: | ||
day = "0" + day | ||
hour = dt.hour | ||
minute = dt.minute | ||
second = dt.second | ||
|
||
|
||
timestamp = f'{year}-{month}-{day} {hour}:{minute}:{second}' | ||
source = "YeastMine - Saccharomyces Genome Database" | ||
|
||
source_file = open(SOURCE_DESTINATION, 'w') | ||
headers = f'Timestamp\tSource\n{timestamp}\t{source}' | ||
source_file.write(f'{headers}\n') | ||
source_file.close() | ||
|
||
# Gene Table | ||
|
||
GENE_DESTINATION = '../script-results/processed-loader-files/gene.csv' | ||
|
||
species = "Saccharomyces cerevisiae" | ||
taxon_id = "559292" | ||
|
||
print(f'Creating gene.csv\n') | ||
gene_file = open(GENE_DESTINATION, 'w') | ||
headers = f'Gene ID\tDisplay Gene ID\tSpecies\tTaxon ID\tRegulator' | ||
gene_file.write(f'{headers}\n') | ||
for gene in all_genes: | ||
if gene in regulators: | ||
gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\ttrue\n') | ||
else: | ||
gene_file.write(f'{all_genes[gene]}\t{gene}\t{species}\t{taxon_id}\tfalse\n') | ||
|
||
gene_file.close() | ||
|
||
|
||
# Network Table | ||
|
||
NETWORK_DESTINATION = '../script-results/processed-loader-files/network.csv' | ||
|
||
|
||
print(f'Creating network.csv\n') | ||
network_file = open(NETWORK_DESTINATION, 'w') | ||
headers = f'Regulator Gene ID\tTarget Gene ID\tTaxon ID\tTimestamp\tSource' | ||
network_file.write(f'{headers}\n') | ||
for gene in regulators_to_targets: | ||
for target_gene in regulators_to_targets[gene]["targets"]: | ||
network_file.write(f'{regulators_to_targets[gene]["systematic_name"]}\t{regulators_to_targets[gene]["targets"][target_gene]}\t{taxon_id}\t{timestamp}\t{source}\n') | ||
network_file.close() |
Oops, something went wrong.