Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add query to remove user ids from filepaths #1259

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pahatz
Copy link
Contributor

@pahatz pahatz commented Jan 13, 2025

Related issue(s) and PR(s)

This PR closes #1123.

Description

Previous work from #1097 removed the user ids from the filepaths. This sql query now should take care of the current databases and update to remove the user ids from the filepaths there as well.

How to test

It has been tested manually with both test environment and production data.

bring up the test env

make-sda-s3-up

fill 4 test entries in the database

query=$(cat <<-END
    INSERT INTO sda.files (submission_user, submission_file_path)
VALUES
	('[email protected]', 'test_lifescience-ri.eu/dataset_test899898/dataset/IMAGES/IMAGE_test/1.2.826.0.1.00000.8.498.000.dcm.c4gh'),
	('testtesttest', 'testtesttest/dataset_krtkl45/picostorage/default/output/encrypted/Image_klfmgrY/1.2.000.0.1.3680000.8.111.222.dcm.c4gh'),
	('[email protected]', '[email protected]/dataset_dd33rr/media/lala/default/output/encrypted/lala_lolo/1.2.826.0.1.3333.8.498.1111.dcm.c4gh'),
	('[email protected]', '[email protected]/wrongname/picostorage/default/output/encrypted/lala/1.2.826.0.1.11.22.dcm.c4gh');
END

)

docker exec -it postgres psql -U postgres -d sda -c "$query"

run the script

docker cp postgresql/migratedb.d/17.sql postgres:/17.sql
docker exec -it postgres psql -U postgres -d sda -f /17.sql

check data is right

docker exec -it postgres psql -U postgres -d sda -c "select submission_user, submission_file_path from sda.files"

the user ids should be gone from the submission_file_path

@pahatz pahatz self-assigned this Jan 13, 2025
@pahatz pahatz marked this pull request as ready for review February 5, 2025 10:21
INSERT INTO sda.dbschema_version VALUES(sourcever+1, now(), changes);

UPDATE sda.files
SET submission_file_path = regexp_replace(submission_file_path, '^[^/]*/', '');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution is a bit harsh since it will remove the first part of the path even if is not a username.

Suggested change
SET submission_file_path = regexp_replace(submission_file_path, '^[^/]*/', '');
SET submission_file_path = REPLACE(submission_file_path, CONCAT(REPLACE(submission_user, '@', '_'),'/'), '');;

Copy link
Contributor Author

@pahatz pahatz Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I went this way is that we have some old entries in BP that are missing the @ domain suffix in users and the corresponding _domain suffix in file path.
But the following seems to work for those cases too:

UPDATE sda.files
SET submission_file_path = CASE
	WHEN LENGTH(submission_file_path) = 40 THEN regexp_replace(submission_file_path, '[@_]/^[^/]', '')
	ELSE REPLACE(submission_file_path, CONCAT(REPLACE(submission_user, '@', '_'),'/'), '')
END;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this maybe sholdn't even be a .sql file by´ut rather a readme so the DB admin knows that it needs to be done. Because this is not something that should be allowed to happen automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, let's discuss this on a daily.

@MalinAhlberg
Copy link
Contributor

Just a note, #1161 was merged into feature/do-not-leak-user-ids, so that all related work could be grouped there before merged to main. Maybe this one should go there to? Or should feature/do-not-leak-user-ids be merged after this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create and test SQL-statements for updating the database to remove user-ids from the filepaths
3 participants