Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: cifkit: A Python package for coordination geometry and atomic site analysis #7205

Closed
editorialbot opened this issue Sep 9, 2024 · 91 comments
Assignees
Labels
accepted published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review TeX Track: 2 (BCM) Biomedical Engineering, Biosciences, Chemistry, and Materials

Comments

@editorialbot
Copy link
Collaborator

editorialbot commented Sep 9, 2024

Submitting author: @bobleesj (Sangjoon Lee)
Repository: https://github.com/bobleesj/cifkit
Branch with paper.md (empty if default branch): main
Version: 1.0.6
Editor: @rkurchin
Reviewers: @espottesmith, @ml-evs, @lancekavalsky
Archive: 10.5281/zenodo.14086943

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/9016ae27b8c6fddffaae5aeb8be18d19"><img src="https://joss.theoj.org/papers/9016ae27b8c6fddffaae5aeb8be18d19/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/9016ae27b8c6fddffaae5aeb8be18d19/status.svg)](https://joss.theoj.org/papers/9016ae27b8c6fddffaae5aeb8be18d19)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@espottesmith, @ml-evs, and @lancekavalsky, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review.
First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @rkurchin know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @espottesmith

📝 Checklist for @ml-evs

📝 Checklist for @lancekavalsky

@editorialbot
Copy link
Collaborator Author

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

✅ OK DOIs

- 10.1016/j.commatsci.2012.10.028 is OK
- 10.1107/S010876739101067X is OK
- 10.1088/1361-648X/aa680e is OK
- 10.1021/acs.chemmater.4c01696 is OK
- 10.1016/j.dib.2024.110178 is OK
- 10.1016/j.jallcom.2023.173241 is OK

🟡 SKIP DOIs

- None

❌ MISSING DOIs

- None

❌ INVALID DOIs

- None

@editorialbot
Copy link
Collaborator Author

Software report:

github.com/AlDanial/cloc v 1.90  T=0.10 s (963.6 files/s, 108541.7 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          75           1346            911           5698
Markdown                         6            153              0            460
Jupyter Notebook                 3              0           1416            172
YAML                             4             17              2            126
TeX                              1              6              0            103
INI                              3              5              0             33
TOML                             1              4              0             24
-------------------------------------------------------------------------------
SUM:                            93           1531           2329           6616
-------------------------------------------------------------------------------

Commit count by author:

   139	Sangjoon Bob Lee

@editorialbot
Copy link
Collaborator Author

Paper file info:

📄 Wordcount for paper.md is 752

✅ The paper includes a Statement of need section

@editorialbot
Copy link
Collaborator Author

License info:

✅ License found: MIT License (Valid open source OSI approved license)

@espottesmith
Copy link

espottesmith commented Sep 9, 2024

Review checklist for @espottesmith

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/bobleesj/cifkit?
  • License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@bobleesj) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
  • Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@ml-evs
Copy link

ml-evs commented Sep 9, 2024

Review checklist for @ml-evs

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/bobleesj/cifkit?
  • License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@bobleesj) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
  • Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@espottesmith
Copy link

Paper comments:

  • The summary is generally appropriate. From context, a non-expert might realize what *.cif files are (since crystal structures are mentioned in the second sentence), but this could be more explicitly stated in the first sentence.
  • I come from a computational background, so I'm not the target audience, but I am unconvinced that the basic functionality (i.e., parsing CIF files and extracting information like structures and formulas) is easier in cifkit than in pymatgen/ASE. Similar numbers of lines of code would be required, at least comparing pymatgen and the example given in the cifkit paper (see https://pymatgen.org/usage.html#reading-and-writing-structuresmolecules). Considering higher-level features of cifkit, including filtering groups of CIFs and visualization, I think that cifkit distinguishes itself from other codes. In this area, the authors' argument is more convincing. Perhaps a slight reframing of the Statement of Need is appropriate?
  • Otherwise, the paper looks good!

Code comments:

  • The folder "occupacny" should probably be "occupancy"
  • I initially tried installing with Python 3.9 (as far as I can tell, the "pyproject.toml" and the docs don't specify a version). This caused problems, as cifkit is using type annotation syntax that is not supported in Python 3.9. I now see on PYPI that cifkit only supports Python 3.10, 3.11, and 3.12. Please make this more clear for users.
  • In terms of code style, more extensive docstrings (for instance, explaining what the inputs and outputs should be) would be appreciated. It seems there are already some "TODO"s to this effect (see e.g., models/cif.py), so perhaps the authors are planning on making these changes soon. Function bodies are typically well commented (note that I have not read every file)

Documentation comments:

  • Examples in the docs aren't the most consistent. For instance, some example lines of code use "ensemble" as the variable, and some use "ensemble_test". It may also be better if the commented outputs reflected the example folder ("Example.ErCoIn_big_folder_path") so users can more easily verify correctness. Finally, incorporating the visualization tools in the docs (currently they're demonstrated only in the README) would be appreciated.
  • The statement of need could be made more clear in the docs
  • Community guidelines basically amount to "submit a PR or talk to the lead author", but for a project at this scale, perhaps that's appropriate.

Overall comments:

  • In terms of "substantial scholarly effort", this is a reasonably small, simple, and young project. That said, it is clearly already being used for scholarly research, and it is somewhat unique in terms of the features that it offers. As such, I think it should be considered substantial.
  • The paper does not contain original data or results
  • The authors do not make significant performance claims
  • With some relatively small tweaks, I think this will make a good addition to JOSS.

@ml-evs
Copy link

ml-evs commented Sep 9, 2024

This package exposes some convenience and utility functions on top of the gemmi CIF parser, for handling coordination and bonding analysis, polyhedron visualisation and data cleaning for primarily inorganic crystal structures represented as CIFs. The headline is that I am not sure this package is sufficiently robust to the various intricacies of CIF (or at least, is not well-documented enough to cover them). I've raised some specific technical issues at bobleesj/cifkit#27.

I'll try not to repeat @espottesmith's points (which I broadly agree with), so here's my first round of comments.

  • While this package clearly has some useful tools (particularly the nice polyhedron visualisation), I fail to see how this tool is easier to use than pymatgen/ASE as a selling point. If it is targeted at people with limited programming/Python experience, then much more care must be taken in guiding them through the installation process (with virtual environments etc).

  • Whilst "high throughput" is mentioned throughout the docs and paper, I see little in the way of e.g., assistance with parallelisation or batch processing (e.g., if I have 1m structures can I stream data without going via a file all the time?), nor any advanced error handling that could allow this package to be used in a highly automated way -- perhaps the authors could clarify what they mean with regards to high throughput in this case?

  • I do not think it is appropriate to mention test coverage explicitly in the paper; although the package does appear to be well-tested with good CI, pinned dependencies and lots of test cases, this metric can be a red herring.

  • There are a few glaring omissions in the references that could help provide useful background to readers:

  • Other potential but not mandatory references:

  • I feel this submission is borderline on the "Substantial scholarly effort" front, given the guidelines at https://joss.readthedocs.io/en/latest/submitting.html#substantial-scholarly-effort I am not convinced by the text that the library was crucial in the three examples listed by the authors in the applications section; perhaps this could be expanded? Perhaps even the functionality from the other related packages by the author listed in the paper (SAF and CBA) could be exposed in the cifcheck namespace as (effectively) one package?

@bobleesj
Copy link

bobleesj commented Sep 9, 2024

@espottesmith @ml-evs, thank you for testing the package and for the valuable feedback. I will address each point after internalizing the feedback and reflecting on it in the next iteration.

@rkurchin
Copy link

@editorialbot add @lancekavalsky as reviewer

Thanks for agreeing to review, Lance! Adding you over here so you can get started.

@editorialbot
Copy link
Collaborator Author

@lancekavalsky added to the reviewers list!

@lancekavalsky
Copy link

lancekavalsky commented Sep 13, 2024

Review checklist for @lancekavalsky

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/bobleesj/cifkit?
  • License: Does the repository contain a plain-text LICENSE or COPYING file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@bobleesj) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines
  • Data sharing: If the paper contains original data, data are accessible to the reviewers. If the paper contains no original data, please check this item.
  • Reproducibility: If the paper contains original results, results are entirely reproducible by reviewers. If the paper contains no original results, please check this item.
  • Human and animal research: If the paper contains original data research on humans subjects or animals, does it comply with JOSS's human participants research policy and/or animal research policy? If the paper contains no such data, please check this item.

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of need' that clearly states what problems the software is designed to solve, who the target audience is, and its relation to other work?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@lancekavalsky
Copy link

In this work, the cifkit package is presented to streamline the manipulation of cif structure files en masse through various utility functions including coordination analysis, visualization, and filtering by property. Overall, the package is cleanly written with a largely intuitive API from a user standpoint, with some useful features for easy visualization and error-handling upon ingestion.

Code comments:

  • For io related methods, e.g. the ones that generate histograms, more clarity is needed regarding where they write things by default. Running the histogram tutorial from the documentation, it wrote the histograms to a folder deep in my conda environment site-packages, which would likely not be intuitive to many users (particularly as this package is presented as being catered to users with less coding experience) and may cause issues on shared resources.

  • I will echo the previous comment that more docstrings would be invaluable in helping code clarity. In particular, I would urge adopting a standardized approach to this, such as Numpy, to be more in line with community standards

  • I have opened a technical issue I encountered here

Documentation comments:

  • As the core classes for this package are Cif and CifEnsemble, more explicit explanations as to the inputs and parameters would be helpful -- especially for CifEnsemble. For example, unless I'm missing it, a comprehensive list of everything that thepreprocess parameter triggers is not mentioned in the documentation.

  • In the documentation there are a couple instances of general clean-up required. One example is the first box on Getting Started uses a CIF method ensemble.cif_folder_path which gives an error when run. Another example is under the CIF specific documentation which refers to a README.md for complete documentation, but it is unclear where this file is located (since that info doesn't appear to be the main README in the repo?).

There are options in the Cif class to use either the by_d_min_method or by_best_methods. Please refer to the README.md for complete documentation.

  • This is relatively minor, but on the documentation website it wasn't immediately clear to me what would be contained under the Notebooks tab at the top. Given this outlines several of the core functionalities of cifkit, I would consider renaming it to something more descriptive.

  • The contributing guidelines at this stage are somewhat vague. For example, is a minimum code coverage required for added features?

Paper comments:

  • In the summary it mentions that this package is designed to process datasets on "the order of tens of thousands". It is not clear to me where this is coming from and what exactly causes the bottleneck for going beyond this estimate. Details regarding what determined this limitation would be helpful to judge high-throughput performance

  • The examples in the paper are presented with limited explanation as to what they are showing. While the comments help in the second example, and some of the methods are self-explanatory by their naming, more comments would help cement the clarity here.

Overall, this work would make a great addition to JOSS pending the minor revisions described above.

@bobleesj
Copy link

@lancekavalsky Thank you for your testing and all of the comments to improve the package and the manuscript! Please allow me to address each point when a new update is ready.

@rkurchin
Copy link

rkurchin commented Oct 7, 2024

Hi @bobleesj, just wanted to check in on how the updates are going!

@bobleesj
Copy link

bobleesj commented Oct 7, 2024

@rkurchin Thanks for checking in with me. I have made progress and addressed about halfway through. Would you please allow me to respond by the third week of October?

@rkurchin
Copy link

Hi there @bobleesj, checking in on this again!

@bobleesj
Copy link

@rkurchin Yes, the features addressed by the reviewers have been implemented. Please look out for our responses by today or, at the latest, tomorrow!

@bobleesj
Copy link

Response to @espottesmith

Paper comments:

The summary is generally appropriate. From context, a non-expert might realize what *.cif files are (since crystal structures are mentioned in the second sentence), but this could be more explicitly stated in the first sentence.

Thank you for your suggestion. The summary now incorporates what *.cif files are:

`cifkit` provides higher-level functions and properties for coordination
geometry and atomic site analysis from .cif files, which are standard file
formats for storing crystallographic data such as atomic fractional coordinates,
symmetry operations, and unit cell dimensions.

I come from a computational background, so I'm not the target audience, but I am unconvinced that the basic functionality (i.e., parsing CIF files and extracting information like structures and formulas) is easier in cifkit than in pymatgen/ASE. Similar numbers of lines of code would be required, at least comparing pymatgen and the example given in the cifkit paper (see https://pymatgen.org/usage.html#reading-and-writing-structuresmolecules). Considering higher-level features of cifkit, including filtering groups of CIFs and visualization, I think that cifkit distinguishes itself from other codes. In this area, the authors' argument is more convincing. Perhaps a slight reframing of the Statement of Need is appropriate?

This is a valid point. I have modified the "Statement of Need" and emphasized the utility of higher-level functions by modifying the section as follow:

`cifkit` distinguishes itself from existing libraries by offering higher-level
functions and variables that allow solid-state synthesists to obtain intuitive and
measurable properties impactful properties. It facilitates the visualization of
coordination geometry from each site using four coordination determination
methods and extracts physics-based features like volume and packing
efficiency, which are crucial for structural analysis in machine learning tasks...

Otherwise, the paper looks good!

Thank you!

Code comments:

The folder "occupacny" should probably be "occupancy"

Modified. Thank you.

I initially tried installing with Python 3.9 (as far as I can tell, the "pyproject.toml" and the docs don't specify a version). This caused problems, as cifkit is using type annotation syntax that is not supported in Python 3.9. I now see on PYPI that cifkit only supports Python 3.10, 3.11, and 3.12. Please make this more clear for users.

In pyproject.toml, I have added the following requires-python = '>=3.10, <3.13 to support 3.10, 3.11, 3.12. Per your suggestion, I have included the supported Python badges in the documentation under the Getting Started section.

In terms of code style, more extensive docstrings (for instance, explaining what the inputs and outputs should be) would be appreciated. It seems there are already some "TODO"s to this effect (see e.g., models/cif.py), so perhaps the authors are planning on making these changes soon. Function bodies are typically well commented (note that I have not read every file)

I have added numpy-style docstrings for the main classes Cif and CifEnsemble here: bobleesj/cifkit#49. I have also identified some code methods that could be refactored, as noted here: bobleesj/cifkit#53. My next step is to further refine these docstrings by consolidating them into reusable methods.

Documentation comments:

Examples in the docs aren't the most consistent. For instance, some example lines of code use "ensemble" as the variable, and some use "ensemble_test". It may also be better if the commented outputs reflected the example folder ("Example.ErCoIn_big_folder_path") so users can more easily verify correctness. Finally, incorporating the visualization tools in the docs (currently they're demonstrated only in the README) would be appreciated.

Thank you for your suggestion. I have now maintained the instance name ensemble consistently. The actual output using the Jupyter notebook has been used for all examples.

The statement of need could be made more clear in the docs

I have added the statement of need in the beginning of the docs: https://bobleesj.github.io/cifkit/.

Community guidelines basically amount to "submit a PR or talk to the lead author", but for a project at this scale, perhaps that's appropriate.

Thank you. I have created a PR template https://github.com/bobleesj/cifkit/blob/main/.github/pull_request_template.md and instructions on how to contribute here https://github.com/bobleesj/cifkit/blob/main/CONTRIBUTING.md.

Overall comments:

In terms of "substantial scholarly effort", this is a reasonably small, simple, and young project. That said, it is clearly already being used for scholarly research, and it is somewhat unique in terms of the features that it offers. As such, I think it should be considered substantial.

Thank you. For one of the projects mentioned, SAF (Structure/Composition Analyzer), which uses approximately 100 features generated using cifkit, we have a pre-print version available: https://chemrxiv.org/engage/chemrxiv/article-details/670aa269cec5d6c142f3b11a. I will continue to maintain the software for on-going research efforts.

The paper does not contain original data or results

Yes, .cif files must be provided for us to conduct analysis.

The authors do not make significant performance claims

I have added information to the documentation stating that processing approximately 10,000 .cif files on a standard laptop (iMac with M1 chip) takes about 30 to 60 minutes. At this rate, we can process nearly all .cif files within 1–2 days. However, I have decided not to include performance metrics in the manuscript. Our code may evolve further—for instance, by using matrix multiplications to compute distances, as in ASE—and performance hasn't been a major bottleneck for our research.

With some relatively small tweaks, I think this will make a good addition to JOSS.

Thank you for your review!

@bobleesj
Copy link

Response to @lancekavalsky

Code comments:

For io related methods, e.g. the ones that generate histograms, more clarity is needed regarding where they write things by default. Running the histogram tutorial from the documentation, it wrote the histograms to a folder deep in my conda environment site-packages, which would likely not be intuitive to many users (particularly as this package is presented as being catered to users with less coding experience) and may cause issues on shared resources.

Thank you for the details. Yes, the histogram images were saved to the Anaconda environment, as that was where the .cif files were provided by default in the installed package. In the docs, I have included information about options to set the output path for users for clarity:

# Optional: Specify the output directory where the .png file will be saved.
ensemble.generate_site_mixing_type_histogram(output_dir="path/to/directory")

# Optional: Call plt.show() to display the histogram on screen.
ensemble.generate_site_mixing_type_histogram(display=False)

For API doc users, I have included docstrings:

def generate_supercell_size_histogram(
    self, display=False, output_dir=None
):
    """Generate a histogram of the 'supercell_count' property from CIF files.

    This method creates a histogram based on the 'supercell_count' statistics of
    the CIF files. If 'output_dir' is specified, the histogram image (.png) will be
    saved to that directory. If 'output_dir' is not specified, the image will be saved
    to the directory specified by 'self.dir_path'.

    Parameters
    ----------
    display : bool, optional
        If True, the plot is displayed using plt.show(). Default is False.
    output_dir : str, optional
        The directory path where the histogram should be saved. If None,
        the histogram is saved in the directory defined by 'self.dir_path'.
    """

I will echo the previous comment that more docstrings would be invaluable in helping code clarity. In particular, I would urge adopting a standardized approach to this, such as Numpy, to be more in line with community standards

Per your suggestion, I have added NumPy-style docstrings to the core Cif and CifEnsemble classes. I will continue updating the documentation after modularizing some functions, such as combining the histogram generation functions into a single one to reduce verbosity.

I have opened a technical issue I encountered here

Thank you. I have addressed the issue via this PR: bobleesj/cifkit#47. To ensure compatibility, I created a test function to ensure that raw .cif files sourced from ICSD, COD can be parsed and the supercell can be generated:

@pytest.mark.parametrize(
    "cif_folder_path, expected_file_count, expected_supercell_stats",
    [
        ("tests/data/cif/sources/ICSD", 4, {216: 2, 307: 1, 320: 1}),
        ("tests/data/cif/sources/COD", 2, {519: 1, 1383: 1}),
        ("tests/data/cif/sources/MP", 2, {108: 1, 594: 1}),
        ("tests/data/cif/sources/PCD", 1, {364: 1}),
        ("tests/data/cif/sources/MS", 1, {2988: 1}),
        ("tests/data/cif/sources/CCDC", 1, {3844: 1}),
    ],
)

Documentation comments:

As the core classes for this package are Cif and CifEnsemble, more explicit explanations as to the inputs and parameters would be helpful -- especially for CifEnsemble. For example, unless I'm missing it, a comprehensive list of everything that the preprocess parameter triggers is not mentioned in the documentation.

Thank you. I have included docstrings for the Cif and CifEnsemble classes, providing explicit parameters and preprocessing triggers as shown below. While the documentation can be further refined, I believe it serves its purpose for now.

class CifEnsemble:
    def __init__(
        self,
        cif_dir_path: str,
        add_nested_files=False,
        preprocess=True,
        logging_enabled=False,
    ) -> None:
        """Initialize a CifEnsemble object, containing a collection of Cif
        objects.

        Parameters
        ----------
        cif_dir_path : str
            Path to the folder path containing .cif file(s).
        add_nested_files : bool, optional
            Option to include .cif files contained in sub-directories within cif_dir_path
            , by default False
        preprocess : bool, optional
            Option to edit .cif files before initializing each into a Cif object,
            by default True. Preprocess modifies atomic site labels in
            atom_site_label. Some site labels may contain a comma or a symbol like M
            due to atomic mixing. It reformats each atom_site_label so it can be
            parsed into an element type matching atom_site_type_symbol. For PCD
            databases, addresses in publ_author_address often have an incorrect
            format requiring manual modifications. It also relocates any ill-formatted
            files, such as those with duplicate labels in atom_site_label, missing
            fractional coordinates, or files requiring supercell generation.

        logging_enabled : bool, optional
            Option to log while pre-processing Cif objects, by default False

        Attributes
        ----------
        dir_path: str
            Path to the folder containing .cif files
        file_paths: list[str]
            List of file paths to .cif files
        cifs: list[Cif]
            List of Cif objects
        file_count: int
            Number of .cif files in the folder
        logging_enabled: bool
            Option to log while pre-processing Cif objects
        """

In the documentation there are a couple instances of general clean-up required. One example is the first box on Getting Started uses a CIF method ensemble.cif_folder_path which gives an error when run. Another example is under the CIF specific documentation which refers to a README.md for complete documentation, but it is unclear where this file is located (since that info doesn't appear to be the main README in the repo?).

I have revised the documentation to enhance clarity and personally tested each example to ensure accuracy. Additionally, I have included comments indicating the location of default Example files provided in the package for first-time users:

# In `cifkit` we provide .cif files that can be accessed through `from cifkit import Example` as shown below. For advancuser, these example .cif files are located under `src/cifkit/data` in the package.

from cifkit import Example
from cifkit import Cif

# Initialize with the example file provided
cif = Cif(Example.Er10Co9In20_file_path)
...

There are options in the Cif class to use either the by_d_min_method or by_best_methods. Please refer to the README.md for complete documentation.

I have included detailed documentations in the API. https://github.com/bobleesj/cifkit/blob/dbaf32400b70f323ba5965526193704b2613ea7b/src/cifkit/models/cif.py#L619 and https://github.com/bobleesj/cifkit/blob/dbaf32400b70f323ba5965526193704b2613ea7b/src/cifkit/models/cif.py#L682.

This is relatively minor, but on the documentation website it wasn't immediately clear to me what would be contained under the Notebooks tab at the top. Given this outlines several of the core functionalities of cifkit, I would consider renaming it to something more descriptive.

Thank you for your feedback. I have replaced the name "Notebooks" but into 4 sections: Getting started Cif CifEnsemble and API References.

The contributing guidelines at this stage are somewhat vague. For example, is a minimum code coverage required for added features?

I have included a PR request template (https://github.com/bobleesj/cifkit/blob/main/.github/pull_request_template.md) as well as the CONTRIBUTING.md for how to fork, clone, commit, etc.

Paper comments:

In the summary it mentions that this package is designed to process datasets on "the order of tens of thousands". It is not clear to me where this is coming from and what exactly causes the bottleneck for going beyond this estimate. Details regarding what determined this limitation would be helpful to judge high-throughput performance

Please kindly read (longer) comment below in response to @ml-evs point

The examples in the paper are presented with limited explanation as to what they are showing. While the comments help in the second example, and some of the methods are self-explanatory by their naming, more comments would help cement the clarity here.

Thank you for this feedback. I have added higher-level functions that are considered novel in our package and included more explanatory comments.

Overall, this work would make a great addition to JOSS pending the minor revisions described above.

Thank you for your review and recommendation!

@bobleesj
Copy link

Response to @ml-evs

While this package clearly has some useful tools (particularly the nice polyhedron visualisation), I fail to see how this tool is easier to use than pymatgen/ASE as a selling point. If it is targeted at people with limited programming/Python experience, then much more care must be taken in guiding them through the installation process (with virtual environments etc).

This is a valid comment considering most users end up using the packages that are built on top of cifkit rather than building. I've also provided conda install cifkit_env -n cifkit option in the package until conda-forge began to support Python 3.13. I will include this back when cifkit can support Python 3.13. This installation and the Getting Started sections were tested by undergraduate and PhD students of Dr. Anton Oliynyk.

Whilst "high throughput" is mentioned throughout the docs and paper, I see little in the way of e.g., assistance with parallelisation or batch processing (e.g., if I have 1m structures can I stream data without going via a file all the time?), nor any advanced error handling that could allow this package to be used in a highly automated way -- perhaps the authors could clarify what they mean with regards to high throughput in this case?

@lancekavalsky

Based on reviewer feedback and internal discussions, we concluded that our package's strength lies not in high-throughput processing (ASE and pymatgen excel at this), but in its specific features for "coordination geometry" and "atomic site analysis." These features are crucial for experimental research and are often tediously acquired from GUI-based software. Consequently, I've modified the manuscript title to "cifkit: A Python package for coordination geometry and atomic site analysis" and removed "high-throughput."

To clarify, our current implementation doesn't include batch processing. We process each .cif file individually. With approximately 1,000 atoms in the supercell, processing takes about 0.3 seconds for one file and 3,000 atoms—less than 3 seconds total. I've added to the documentation that processing roughly 10,000 .cif files on a standard laptop (iMac with M1 chip) takes about 30 to 60 minutes. At this rate, we can process nearly all .cif files within 1–2 days. However, we've decided not to include performance metrics in the manuscript. Our code may evolve further—perhaps using matrix multiplications to compute distances, as in ASE—and performance hasn't been a major bottleneck for our research.

Regarding error handling, we've processed tens of thousands of binary and ternary .cif files, primarily from PCD and thousands from ICSD. We've aimed to identify the most common errors to ensure we can use as many .cif files from the database as possible. Of course, there are some cases where .cif files lack space group operations or fractional coordinates. In such instances, we move those files into separate folders before initializing them into objects. We've also tested with COD, CCSD, and other datasets to encounter more diverse errors, which users can report later for me to address.

I do not think it is appropriate to mention test coverage explicitly in the paper; although the package does appear to be well-tested with good CI, pinned dependencies and lots of test cases, this metric can be a red herring.

Thank you for your suggestion. All dependency pinnings have been removed, and CI now supports Python 3.10, 3.11, and 3.12. Additionally, the mention of Codecov has been removed.

There are a few glaring omissions in the references that could help provide useful background to readers:

Based on your feedback, cifkit now supports CIF files directly downloaded from ICSD, COD, PCD, CCDC, and those created with Materials Studio. I have retained the hall_crystallographic_1991 citation as it introduces the original CIF format. On demand, I will modify our code to support the new version if databases begin to export .cif in v2.

The CIF framework itself is not appropriately cited (see https://www.iucr.org/resources/cif/cif2) -- the paper/documentation should also discuss which versions of the CIF standard are supported (it is a growing standard) with changes between v1 and v2. Gemmi, the library used for the underlying CIF parsing, published in JOSS already https://joss.theoj.org/papers/10.21105/joss.04200 Other potential but not mandatory references:

This is greatly appreciated. The Gemmi has been cited, along with PyVista, another JOSS paper. Additionally, citations for NumPy, SciPy, and Matplotlib have been included.

chemenv, used under the hood by pymatgen for coordination analysis https://journals.iucr.org/b/issues/2020/04/00/lo5066 (and subsequently much of the pre-existing literature for determining coordination environments).

Thank you. Added in the manuscript.

Other dedicated CIF parsing projects supported by IUCr: PyCIFRw and COD's CIF parser (with pycodcif Python package) https://journals.iucr.org/j/issues/2016/01/00/po5052/index.html There are many others listed at https://www.iucr.org/resources/cif/software

Since Gemmi provided sufficient parsing capabilities for our tasks with minor fixes done by cifkit such as such as adding “#” to the first line to ICSD .cif files, we have not incorporated additional CIF parsers.

I feel this submission is borderline on the "Substantial scholarly effort" front, given the guidelines at https://joss.readthedocs.io/en/latest/submitting.html#substantial-scholarly-effort I am not convinced by the text that the library was crucial in the three examples listed by the authors in the applications section; perhaps this could be expanded? Perhaps even the functionality from the other related packages by the author listed in the paper (SAF and CBA) could be exposed in the cifcheck namespace as (effectively) one package?

Yes, cikfit and its applications are actively used by approximately 4-5 research groups. The SAF and CBA modules each have their own descriptions in respective journal publications, making it challenging to consolidate them under a broad `cifkit' umbrella. We would prefer to serve as an 'engine' that provides advanced functionality. Therefore, we do not want to combine packages into one. However, features will continue to be developed on demand, primarily for experimental groups led by Prof. Oliynyk, both within his group and through his active collaborations.

@bobleesj
Copy link

@rkurchin How can I re-render the manuscript here? It's in the main branch of https://github.com/bobleesj/cifkit/. Thanks!

@bobleesj
Copy link

bobleesj commented Nov 12, 2024

Screenshot 2024-11-11 at 9 53 40 PM

Fixed! 10.5281/zenodo.14086943 is the doi. @rkurchin

https://zenodo.org/records/14086943

@rkurchin
Copy link

@editorialbot set 10.5281/zenodo.14086943 as archive

@editorialbot
Copy link
Collaborator Author

Done! archive is now 10.5281/zenodo.14086943

@rkurchin
Copy link

@bobleesj can you clarify what the contributions of Anton Oliynyk were in that case?

We usually do want the authors of the archive and the paper to match if possible, though I believe there can be exceptions made; @openjournals/bcm-eics can perhaps jump in in that regard.

@bobleesj
Copy link

Anton Oliynyk's contributions include not limited to conceptually designing the specific methods embedded in the applications mentioned in the manuscript. He also validated these functions using GUI-based software and also provided feedback when new bugs are encountered and how to navigate the problem together.

@bobleesj
Copy link

@bobleesj can you clarify what the contributions of Anton Oliynyk were in that case?

We usually do want the authors of the archive and the paper to match if possible, though I believe there can be exceptions made; @openjournals/bcm-eics can perhaps jump in in that regard.

Not sure if @openjournals/bcm-eics was tagged here. Since we've already made the new archive with the authors listed, we are all good!

@rkurchin
Copy link

@editorialbot recommend-accept

@editorialbot
Copy link
Collaborator Author

Attempting dry run of processing paper acceptance...

@editorialbot
Copy link
Collaborator Author

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

✅ OK DOIs

- 10.1016/j.jallcom.2023.173241 is OK
- 10.1016/j.dib.2024.110178 is OK
- 10.1021/acs.chemmater.4c01696 is OK
- 10.1088/1361-648X/aa680e is OK
- 10.1107/S010876739101067X is OK
- 10.1016/j.commatsci.2012.10.028 is OK
- 10.1107/S2052520620007994 is OK
- 10.21105/joss.01450 is OK
- 10.21105/joss.04200 is OK
- 10.1038/s41586-020-2649-2 is OK
- 10.1038/s41592-019-0686-2 is OK
- 10.1109/MCSE.2007.55 is OK
- 10.26434/chemrxiv-2024-rrbhc is OK

🟡 SKIP DOIs

- None

❌ MISSING DOIs

- None

❌ INVALID DOIs

- None

@editorialbot
Copy link
Collaborator Author

👋 @openjournals/bcm-eics, this paper is ready to be accepted and published.

Check final proof 👉📄 Download article

If the paper PDF and the deposit XML files look good in openjournals/joss-papers#6131, then you can now move forward with accepting the submission by compiling again with the command @editorialbot accept

@editorialbot editorialbot added the recommend-accept Papers recommended for acceptance in JOSS. label Nov 13, 2024
@Kevin-Mattheus-Moerman
Copy link
Member

Kevin-Mattheus-Moerman commented Nov 15, 2024

@bobleesj as AEiC for JOSS I will now help to process this submission for acceptance in JOSS. Below are some final checks I'm processing, some of which may need your attention.

Checks on repository

  • Project has OSI approved license
  • Project features contributing guidelines

Checks on review issue

  • Review completed
  • Software version tag listed here matches a tagged release

Checks on archive

  • Archive listed title and authors matches paper
  • Archive listed license matches software license
  • Archive listed version tag matches tagged release (and includes a potential v).

Checks on paper

  • Checked paper formatting
  • Check affiliations to make sure country acronyms are not used
  • Checked reference rendering
  • Checked if pre-print citations can be updated by published versions
  • Checked for typos

Remaining points:

As you can see, most seems in order, however the below are some points that require your attention 👇 :

  • Please ensure the archive listed license tag matches your software release tag: 1.0.6. You do not need to create a new archive link, you should be able to manually edit the tag when you log into ZENODO.
  • Please spell out country names in your affiliations, so do not use the acronym USA.

@bobleesj
Copy link

@Kevin-Mattheus-Moerman Thank you!

Please ensure the archive listed license tag matches your software release tag: 1.0.6. You do not need to create a new archive link, you should be able to manually edit the tag when you log into ZENODO.

Now, it is updated:

Version 1.0.6
[10.5281/zenodo.14086943](https://doi.org/10.5281/zenodo.14086943)

Please spell out country names in your affiliations, so do not use the acronym USA.

Now paper.md writes:

affiliations:
  - name:
      Department of Applied Physics and Applied Mathematics, Columbia
      University, New York, NY 10027, United States
    index: 1
  - name:
      Department of Chemistry, Hunter College, City University of New York, New
      York, NY 10065, United States
    index: 2
  - name:
      Ph.D. Program in Chemistry, The Graduate Center of the City University of
      New York, New York, NY 10016, United States

@bobleesj
Copy link

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@bobleesj
Copy link

@Kevin-Mattheus-Moerman I also created a PR openjournals/joss#1394 here since one of the example paper.md contained USA under affiliation.

@Kevin-Mattheus-Moerman
Copy link
Member

@editorialbot accept

@editorialbot
Copy link
Collaborator Author

Doing it live! Attempting automated processing of paper acceptance...

@editorialbot
Copy link
Collaborator Author

Ensure proper citation by uploading a plain text CITATION.cff file to the default branch of your repository.

If using GitHub, a Cite this repository menu will appear in the About section, containing both APA and BibTeX formats. When exported to Zotero using a browser plugin, Zotero will automatically create an entry using the information contained in the .cff file.

You can copy the contents for your CITATION.cff file here:

CITATION.cff

cff-version: "1.2.0"
authors:
- family-names: Lee
  given-names: Sangjoon
  orcid: "https://orcid.org/0000-0002-2367-3932"
- family-names: Oliynyk
  given-names: Anton O.
  orcid: "https://orcid.org/0000-0003-0732-7340"
contact:
- family-names: Lee
  given-names: Sangjoon
  orcid: "https://orcid.org/0000-0002-2367-3932"
doi: 10.5281/zenodo.14086943
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Lee
    given-names: Sangjoon
    orcid: "https://orcid.org/0000-0002-2367-3932"
  - family-names: Oliynyk
    given-names: Anton O.
    orcid: "https://orcid.org/0000-0003-0732-7340"
  date-published: 2024-11-15
  doi: 10.21105/joss.07205
  issn: 2475-9066
  issue: 103
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 7205
  title: "cifkit: A Python package for coordination geometry and atomic
    site analysis"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.07205"
  volume: 9
title: "cifkit: A Python package for coordination geometry and atomic
  site analysis"

If the repository is not hosted on GitHub, a .cff file can still be uploaded to set your preferred citation. Users will be able to manually copy and paste the citation.

Find more information on .cff files here and here.

@editorialbot
Copy link
Collaborator Author

🐘🐘🐘 👉 Toot for this paper 👈 🐘🐘🐘

@editorialbot
Copy link
Collaborator Author

🦋🦋🦋 👉 Bluesky post for this paper 👈 🦋🦋🦋

@editorialbot
Copy link
Collaborator Author

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.07205 joss-papers#6147
  2. Wait five minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.07205
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

@editorialbot editorialbot added accepted published Papers published in JOSS labels Nov 15, 2024
@bobleesj
Copy link

Thanks to everyone for your valuable time! @rkurchin @espottesmith @ml-evs @lancekavalsky @Kevin-Mattheus-Moerman

@Kevin-Mattheus-Moerman
Copy link
Member

@bobleesj congratulations on this JOSS publication!

@rkurchin thanks for editing!

And a special thanks to the reviewers: @espottesmith, @ml-evs, @lancekavalsky !!

@editorialbot
Copy link
Collaborator Author

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following

code snippets

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.07205/status.svg)](https://doi.org/10.21105/joss.07205)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.07205">
  <img src="https://joss.theoj.org/papers/10.21105/joss.07205/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.07205/status.svg
   :target: https://doi.org/10.21105/joss.07205

This is how it will look in your documentation:

DOI

We need your help!

The Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

@bobleesj
Copy link

@Kevin-Mattheus-Moerman

Making a small donation to support our running costs here: https://numfocus.org/donate-to-joss

Just wanted to share the donation link doesn't seem to be appearing and it has been addressed here: openjournals/joss#1396

tagging @xuanxu as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review TeX Track: 2 (BCM) Biomedical Engineering, Biosciences, Chemistry, and Materials
Projects
None yet
Development

No branches or pull requests

7 participants