Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allele rle normalization + pin pydantic version #234

Merged
merged 16 commits into from
Aug 30, 2023
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion src/ga4gh/vrs/normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ def _normalize_allele(input_allele, data_proxy, rle_seq_limit=50):
entire region of ambiguity, resulting in an unambiguous representation that may be
readily compared with other alleles.

This function assumes that IRIs are dereferenced, providing either the accession
(refseq:NC_000006.12, NC_000006.12) or ga4gh identifier for a sequence
(ga4gh:SQ.0iKlIQk2oZLoeOG9P1riRU6hvL5Ux8TV).

:param input_allele: Input VRS Allele object
:param data_proxy: SeqRepo dataproxy
:param rle_seq_limit: If RLE is set as the new state, set the limit for the length
Expand All @@ -99,7 +103,7 @@ def _normalize_allele(input_allele, data_proxy, rle_seq_limit=50):
if isinstance(allele.location.sequence, models.SequenceReference):
alias = f"ga4gh:{allele.location.sequence.refgetAccession}"
theferrit32 marked this conversation as resolved.
Show resolved Hide resolved
else:
# IRI
# Dereferenced IRI
alias = allele.location.sequence.root

# Get reference sequence and interval
Expand Down
56 changes: 28 additions & 28 deletions tests/cassettes/test_normalize_allele.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ interactions:
User-Agent:
- python-requests/2.31.0
method: GET
uri: http://localhost:5000/seqrepo/1/metadata/refseq:NC_000006.12
uri: http://localhost:5000/seqrepo/1/metadata/NC_000006.12
response:
body:
string: "{\n \"added\": \"2016-08-27T21:22:36Z\",\n \"aliases\": [\n \"GRCh38:6\",\n
Expand All @@ -34,7 +34,7 @@ interactions:
Content-Type:
- application/json
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand All @@ -52,7 +52,7 @@ interactions:
User-Agent:
- python-requests/2.31.0
method: GET
uri: http://localhost:5000/seqrepo/1/sequence/refseq:NC_000006.12?start=26090950&end=26090951
uri: http://localhost:5000/seqrepo/1/sequence/NC_000006.12?start=26090950&end=26090951
response:
body:
string: C
Expand All @@ -64,7 +64,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -105,7 +105,7 @@ interactions:
Content-Type:
- application/json
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -135,7 +135,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -176,7 +176,7 @@ interactions:
Content-Type:
- application/json
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -206,7 +206,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -236,7 +236,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -266,7 +266,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -296,7 +296,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -326,7 +326,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -356,7 +356,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -386,7 +386,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -416,7 +416,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand All @@ -434,7 +434,7 @@ interactions:
User-Agent:
- python-requests/2.31.0
method: GET
uri: http://localhost:5000/seqrepo/1/metadata/refseq:NC_000023.11
uri: http://localhost:5000/seqrepo/1/metadata/ga4gh:SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP
response:
body:
string: "{\n \"added\": \"2016-08-27T23:57:18Z\",\n \"aliases\": [\n \"GRCh38:X\",\n
Expand All @@ -457,7 +457,7 @@ interactions:
Content-Type:
- application/json
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -498,7 +498,7 @@ interactions:
Content-Type:
- application/json
Date:
- Wed, 30 Aug 2023 13:26:45 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -528,7 +528,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -558,7 +558,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -588,7 +588,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -618,7 +618,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -648,7 +648,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -678,7 +678,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -708,7 +708,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -738,7 +738,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -768,7 +768,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down Expand Up @@ -798,7 +798,7 @@ interactions:
Content-Type:
- text/plain; charset=utf-8
Date:
- Wed, 30 Aug 2023 13:26:46 GMT
- Wed, 30 Aug 2023 14:40:08 GMT
Server:
- Werkzeug/2.2.2 Python/3.10.4
status:
Expand Down
4 changes: 2 additions & 2 deletions tests/test_vrs_normalize.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"location": {
"end": 26090951,
"start": 26090950,
"sequence": "refseq:NC_000006.12",
"sequence": "NC_000006.12",
"type": "SequenceLocation"
},
"state": {
Expand Down Expand Up @@ -75,7 +75,7 @@
"type": "Allele",
"location": {
"type": "SequenceLocation",
"sequence": "refseq:NC_000023.11",
"sequence": "ga4gh:SQ.w0WZEvgJF0zf_P4yyTzjjv9oW1z61HHP",
Copy link
Member

@ahwagner ahwagner Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IRI is a reference to another object. It can be of any form under the IETF specification. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

I think it is fair for us to assume this property (and every property) is dereferenced / has full object representation for normalization. We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here. I also believe that IRIs that contain a colon before an IRI fragment identifier (#; again, as seen here) are not valid IRIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a regex pattern on the IRI class. I can make a new issue for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An IRI is a reference to another object. It can be of any form. When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object. This is true for every property in VRS where we allow for an IRI or object.

Okay, I will update the code + tests to always assume a SequenceReference

We SHOULD NOT assume that an IRI takes a specific form (e.g. a refseq or ga4gh identifier) as we do here.

This was just examples for tests. The SequenceProxy class will take the input (regardless of refseq/ga4gh/ensembl etc) to get the corresponding sequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say the sequence slot is dereferenced, it means that instead of an IRI, we have a SequenceReference object.

@ahwagner thanks for this clarification. The Translator class will need to be updated to work like this (doesn't need to be in this PR). Currently it sets the sequence id (ga4gh:SQ, not ga4gh:SQR ) as the location.sequence value

"start": [155980374, 155980375],
"end": [155980377, 155980378]
},
Expand Down