Create a VTT SMA data class for clarity across scripts #26

davidverweij · 2021-02-02T15:43:55Z

I've messed around with this for too long - and think it overcomplicates it, especially for now. I'll add my current thoughts below and push these into a separate issue for future work.

schemas > vttsma_record.py

from datetime import datetime
from dataclasses import dataclass


@dataclass
class VTT_patient:
    """
    A patient-based object constructed from VTT S3 bucket metadata
    """
    vttsma_hash: str
    export_dates: [str]

    def __init__(self, s3_bucket_entry: str)
        """
        extracting values from s3_bucket_entry.split following
        dump_date, raw/files, patienthash, patienthash.nfo/.zip/.audio? 
        """
        values = s3_bucket_entry.split('/')
        self.vttsma_hash = values[2]
        self.export_dates = [values[0]]

    def __eq__(self, other):
        return (
            isinstance(other, self.__class__) and
            getattr(other, 'vttsma_hash', None) == self.vttsma_hash
        )

lib > vttsma.py > get_list()

records = [SMA_Record(r.key) for r in objects if 'users.txt' not in r.key]
# TODO: create a set which removes duplicates, but merges the dump_dates

Originally posted by @davidverweij in #21 (comment)

davidverweij · 2021-02-02T15:44:40Z

Since we know the format, I wonder if it would be more readable to create a dataclass such that we can do:

split_paths = [SMA_Record(**p.split('/')) for p in object_paths if p.find('users.txt') == -1]

This way you could do, e.g. split_paths[0].dump_date, etc rather than trying to remember what is in each index

Also, it is possible to use not in p since p is a string, e.g.,

[p.split('/') for p in object_paths if 'users.txt' not in p]

Originally posted by @jawrainey in #21 (comment)

jawrainey · 2021-02-02T17:06:16Z

This looks good and would be most suitable when we process the VTT data weekly. I noted in the prior PR that having a SMA_Record or such would be helpful as it would help with accessing attributes in both get_list and download_metadata -- this is purely for readability. Having said that, this structure needs reconsidered if we plan to store multiple dates per record.

At the moment, get_list returns a list of dicts, whereas it could either return a dict where the key is the ID and the value is the weeks where that patient exists:

def get_list(bucket: Bucket) -> dict:
 """
 GET all records (metadata) from the AWS S3 bucket 
 
 NOTE: S3 folder structure is symbolic. The 'key' (str) for each file object \
 represents the path. See also `download_metadata()` in devices > vttsma.py 
 """
    from collections import defaultdict
    results = defaultdict(set)
 
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    for path in paths:
        export_date, _, _hash, __ = path.split('/')
        results[_hash].add(export_date)
 return results

or similar to above but with use of dataclass, e.g.,

from dataclasses import dataclass

@dataclass
class SMA_Record:
    export_date: str
    # raw or audio
    folder_name: str
    hash_id: str
    # patienthash.nfo/.zip/.audio?
    files: str

def get_list(bucket: Bucket) -> [SMA_Record]:
    """
    -
    """
    results = []
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    for path in paths:
        results.append(SMA_Record(*path.split('/')))
    return results

Alternatively, as we're building a list via append:

def get_list(bucket: Bucket) -> [SMA_Record]:
    paths = [obj.key for obj in bucket.objects.all() if 'users.txt' not in obj]
    return [SMA_Record(*path.split('/')) for path in paths]

davidverweij added data-transfer Data Transfer Protocol VTT FS Device labels Feb 2, 2021

davidverweij mentioned this issue Feb 2, 2021

Implement VTT DAG process, resolves #20 #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a VTT SMA data class for clarity across scripts #26

Create a VTT SMA data class for clarity across scripts #26

davidverweij commented Feb 2, 2021

davidverweij commented Feb 2, 2021

jawrainey commented Feb 2, 2021 •

edited

Loading

Create a VTT SMA data class for clarity across scripts #26

Create a VTT SMA data class for clarity across scripts #26

Comments

davidverweij commented Feb 2, 2021

schemas > vttsma_record.py

lib > vttsma.py > get_list()

davidverweij commented Feb 2, 2021

jawrainey commented Feb 2, 2021 • edited Loading

jawrainey commented Feb 2, 2021 •

edited

Loading