Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a VTT SMA data class for clarity across scripts #26

Open
davidverweij opened this issue Feb 2, 2021 · 2 comments
Open

Create a VTT SMA data class for clarity across scripts #26

davidverweij opened this issue Feb 2, 2021 · 2 comments
Labels
data-transfer Data Transfer Protocol VTT FS Device

Comments

@davidverweij
Copy link
Member

I've messed around with this for too long - and think it overcomplicates it, especially for now. I'll add my current thoughts below and push these into a separate issue for future work.

schemas > vttsma_record.py

from datetime import datetime
from dataclasses import dataclass


@dataclass
class VTT_patient:
    """
    A patient-based object constructed from VTT S3 bucket metadata
    """
    vttsma_hash: str
    export_dates: [str]

    def __init__(self, s3_bucket_entry: str)
        """
        extracting values from s3_bucket_entry.split following
        dump_date, raw/files, patienthash, patienthash.nfo/.zip/.audio? 
        """
        values = s3_bucket_entry.split('/')
        self.vttsma_hash = values[2]
        self.export_dates = [values[0]]

    def __eq__(self, other):
        return (
            isinstance(other, self.__class__) and
            getattr(other, 'vttsma_hash', None) == self.vttsma_hash
        )

lib > vttsma.py > get_list()

records = [SMA_Record(r.key) for r in objects if 'users.txt' not in r.key]
# TODO: create a set which removes duplicates, but merges the dump_dates

Originally posted by @davidverweij in #21 (comment)

@davidverweij
Copy link
Member Author

Since we know the format, I wonder if it would be more readable to create a dataclass such that we can do:

split_paths = [SMA_Record(**p.split('/')) for p in object_paths if p.find('users.txt') == -1] 

This way you could do, e.g. split_paths[0].dump_date, etc rather than trying to remember what is in each index

Also, it is possible to use not in p since p is a string, e.g.,

[p.split('/') for p in object_paths if 'users.txt' not in p]

Originally posted by @jawrainey in #21 (comment)

@jawrainey
Copy link
Member

jawrainey commented Feb 2, 2021

This looks good and would be most suitable when we process the VTT data weekly. I noted in the prior PR that having a SMA_Record or such would be helpful as it would help with accessing attributes in both get_list and download_metadata -- this is purely for readability. Having said that, this structure needs reconsidered if we plan to store multiple dates per record.

At the moment, get_list returns a list of dicts, whereas it could either return a dict where the key is the ID and the value is the weeks where that patient exists:

def get_list(bucket: Bucket) -> dict:
 """
 GET all records (metadata) from the AWS S3 bucket 
 
 NOTE: S3 folder structure is symbolic. The 'key' (str) for each file object \
 represents the path. See also `download_metadata()` in devices > vttsma.py 
 """
    from collections import defaultdict
    results = defaultdict(set)
 
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    for path in paths:
        export_date, _, _hash, __ = path.split('/')
        results[_hash].add(export_date)
 return results

or similar to above but with use of dataclass, e.g.,

from dataclasses import dataclass

@dataclass
class SMA_Record:
    export_date: str
    # raw or audio
    folder_name: str
    hash_id: str
    # patienthash.nfo/.zip/.audio?
    files: str

def get_list(bucket: Bucket) -> [SMA_Record]:
    """
    -
    """
    results = []
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    for path in paths:
        results.append(SMA_Record(*path.split('/')))
    return results

Alternatively, as we're building a list via append:

def get_list(bucket: Bucket) -> [SMA_Record]:
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    return [SMA_Record(*path.split('/')) for path in paths]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-transfer Data Transfer Protocol VTT FS Device
Projects
None yet
Development

No branches or pull requests

2 participants