Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update and expansion of the BfG spectral data #275

Merged
merged 5 commits into from
Jan 20, 2025
Merged

Conversation

olessmn
Copy link

@olessmn olessmn commented Jan 7, 2025

Dear MassBank Team,

we have made several updates and additions to our spectral data. This pull request includes the following changes:

  • Corrections to existing spectral data (e.g. added missing CAS numbers, InChIKeys, and InChI values), with many contributions from @tsufz
  • Added 646 new spectra

All txt-files were re-generated and replaced the previous files.

Please let us know if you have any feedback or require further adjustments.

Thank you!
Best regards,
@olessmn & @ksjewell

@tsufz
Copy link
Member

tsufz commented Jan 9, 2025

@olessmn and @ksjewell. Thanks for the updates of the BAFG records. I just run the validation workflow and there are some encoding issues by German umlaute and obviously the grade symbol.

image

Hence, I would like to ask you to:

  1. Fix the encoding issues in the record files (@meier-rene, I guess you may give a hint on this issue?)
  2. Push the fixed files back to your repository olessmn:dev

There is no need to close this pull request. As soon you push the corrected data, the PR will update automatically.

Best,
Tobias

@olessmn
Copy link
Author

olessmn commented Jan 10, 2025

Thank you @tsufz for the quick feedback!
I changed 'ö' to 'oe' and '° C' to 'deg C' in all documents.
If there is any other conflict, let me know.

Best wishes,
Ole

@tsufz
Copy link
Member

tsufz commented Jan 10, 2025

@olessmn,
Thanks so much for fixing this. The next issue is the missing linefeed at the end of some records after the termination //. e.g. https://github.com/olessmn/MassBank-data/blob/dev/BAFG/MSBNK-BAFG-CSL2501109503.txt.

Txs,
Tobias

@meier-rene
Copy link
Collaborator

meier-rene commented Jan 10, 2025

Thank you for your contribution. Im pretty sure we get this data in, but we need to clarify a few things and then modify the submission according to your information. I will assist you in this process. You wrote that you corrected a number of datasets and added 646 new spectra. The correction is in the area of the metadata, like cas or InChI.

I checked manually a few files and I will now guess what you have done. You removed all files from your last submission and replaced them with new files from your new submission. For your ACCESSION you use a code put together from a few letters, a date and a running number. Because you created a new batch of files they have now a new date resulting in a new accession, eg MSBNK-BAFG-CSL2311091 -> MSBNK-BAFG-CSL2501101 and MSBNK-BAFG-CSL2311092 -> MSBNK-BAFG-CSL2501102. The content of the dataset stays the same, especially the measurement is the same. This of course only applies to the spectra from your last contribution.

If this is the case, we dont want to change the ACCESSION, but just modify the content. If you actually would want to remove your old data we would add a line at the top of each file which says DEPRECATED and give a reason for the deprecation. Here is an example. The dataset would stay there as a tombstone to prevent re-usage of that id. The reason for this is easy to explain. After the release of the data, users could refer to the data by ACCESSION and if we shift around our ACCESSION and/or data this will give a mess.

If you want I can assist you in fixing this. I would suggest, that we move the data which was in your last contribution back to its old name and release the 646 new spectra with the new name. If you agree I could do that for you and in this process also make the small changes Tobias has mentioned.

BTW we dont want dec C instead of °C but sometimes characters get messed up if we switch between architectures and codepages as you can see in Tobias screenshot. There is not much we can do to prevent this, just check for it and fix it. If there are any non latin characters, like language specific characters in a name, they are of course allowed. So if the authors name is actually Björn Ehlig, then we will use Björn here. Is it Björn?

So please tell me if you agree that I tear apart your contribution and rearrange it as described. The contribution of course belongs to you guys, but it might happen that some of your commits will vanish from the GIT log and reappear with my username. I hope thats not an issue for you.

@olessmn
Copy link
Author

olessmn commented Jan 14, 2025

Hi @meier-rene, hi @tsufz,

Thank you for your response and support!

Sorry, I should have clarified this better in the beginning: Yes, in the initial (and the following submissions) I removed all files and replaced them with the updated files (re-generating all files upon changes is easier for me). However, I missed that the ACCESSION shouldn't be changed once a file is in the MassBank repository. I changed my code accordingly, so ACCESSION strings of existing MassBank files won't be updated when I export the files (see explanation below).

I made the following changes and reuploaded all files:

  • The existing 19,783 files keep their original ACCESSION (filename and inside the file), with only the file content updated.
  • The 646 new files received a new ACCESSION.
  • Reverted oe to ö and deg to ° (Yes exactly, the author is supposed to be Björn Ehlig)
  • Made sure encoding is UTF-8
  • Added a newline at the end of the file content after //

For the ACCESSION we use the format MSBNK-BAFG-CSL<date><experiment_id>.
date: In the format YYMMDD, representing the date when the file was exported (unless the file exists in the current MassBank repository).
experiment_id: Unique to the data entry in our database and never changes. Therefore, gaps may exist (not a running number) if data has been deleted.

Example 1:
MSBNK-BAFG-CSL2311091.txt retained its file name, but the file content was updated.
Inside the file:

ACCESSION: MSBNK-BAFG-CSL2311091
RECORD_TITLE: (2-dodecanoylamino-ethyl)-dimethyl-tetradecyl-ammonium; LC-ESI-QTOF; MS2; 110 V
DATE: 2025.01.13

The ACCESSION remained the original one, and DATE shows the date of the last modification (yesterday).

Example 2:
MSBNK-BAFG-CSL25011334487.txt is one of the newly added files, with the new ACCESSION.
Inside the file:

ACCESSION: MSBNK-BAFG-CSL25011334487
RECORD_TITLE: Disperse brown 1; LC-ESI-QTOF; MS2; 40 V
DATE: 2025.01.13

Please let me know if there are any other issues, and thank you for your patience.

Best wishes,
Ole

@meier-rene
Copy link
Collaborator

Thank you for your excellent description. This all sounds very good.

There is a report with the validation of the pull request. There are a few issues left. You can check them at the 'Details' button next to the failed report.
grafik
You will find that there are a few records where the calculated SPLASH does not match with the given SPLASH. There is maybe something with the peaklist. One would have to check that and fix the SPLASH of course.
Other records fail because the given InChI and SMILES do not match. Thats in most cases a mismatch in stereochemistry. In principle it looks like pretty easy fixes. You can do that or I will do that at the beginning of next week. Unfortunately I'm occupied with other tasks this week.

@meier-rene meier-rene merged commit 54af9d3 into MassBank:dev Jan 20, 2025
1 check failed
@meier-rene
Copy link
Collaborator

Excellent contribution. In the last Validation report there were just 4 minor issues left with the molecular formula of MSBNK-BAFG-CSL25011734709-11. It was given as negatively charged formula while it should be neutral. I will make a manual quick fix.

@meier-rene
Copy link
Collaborator

You mentioned in your records, there were created with "Export with pycsl 0.1.0.dev1 and CSL 25.0". Can you point me to that resource or is it an in-house solution? I couldn't find anything.

@olessmn
Copy link
Author

olessmn commented Jan 20, 2025

Hi @meier-rene,
in the last commit we made the following changes and reuploaded the data:

  • Fixed the SPLASH-generation issue
  • Corrected SMILES/ InChI of the spectra that generated different InChIKeys
  • Another 229 new spectra were added (accumulated in the meantime)

Regarding the export information:
CSL 25.0 is our "Collective Spectral Library," where we compile spectral data from us and other institutions.

pycsl 0.1.0.dev1 is the Python program we are currently developing to perform various operations on the CSL (e.g., exporting all BfG spectra, as in this case). It is currently for in-house use only, but we plan to release it on GitHub later this year.

We included these two information in the MS$DATA_PROCESSING comment for future reference. Is this okay?

And thank you for fixing the last issues!

Ole

@meier-rene
Copy link
Collaborator

We are very happy with your contribution. I was just curious about the software, because I never heard about it and I couldn't find anything. If you release this software later for public use I would be happy if you could ping me again and then we would investigate if it will be useful for others.

@olessmn olessmn deleted the dev branch January 22, 2025 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants