Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix batching in fairseq SentencepieceTokenizer #640

Merged
merged 2 commits into from
Jan 30, 2024

Conversation

Craigacp
Copy link
Contributor

@Craigacp Craigacp commented Jan 26, 2024

This fixes #639 by moving the fairseq id patching out of the loop so it's not applied multiple times to sequences earlier in the batch.

I've added the check to ensure it's not applied when add_rev is true mirroring its position in the for loop, but I'm unsure if that's required (presumably there are no HF tokenizers which emit a reversed string and have this fairseq hack?).

I tested it against the example in #639 and both batches and single sequences work correctly now.

@Craigacp Craigacp requested a review from a team as a code owner January 26, 2024 14:08
@wenbingl
Copy link
Member

/azp run onnxruntime-extensions.CI

Copy link

No pipelines are associated with this pull request.

@wenbingl
Copy link
Member

/azp run onnxruntime-extensions.CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Craigacp
Copy link
Contributor Author

I ran the Python tests locally on macOS ARM64 and everything passed. The access violation in the Windows tests happens either in test_cliptok or test_cv2 both of which passed fine on my machine and don't use the code I changed, and they also passed fine in Linux and macOS in the CI. Is there anything more I can do to run the failure down?

@wenbingl
Copy link
Member

@sayanshaw24 , can you review this PR?

Copy link
Contributor

@sayanshaw24 sayanshaw24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! missed the !add_rev check in the if-statement - this looks good, thanks for this.

@sayanshaw24 sayanshaw24 merged commit d47a3dd into microsoft:main Jan 30, 2024
34 checks passed
@Craigacp Craigacp deleted the fairseq-batching-fix branch January 30, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fairseq support in SentencepieceTokenizer doesn't work with batches
3 participants