Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Expose experimental Markdown importer in the importWxr step #2080

Closed
wants to merge 13 commits into from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Dec 13, 2024

🚧 Work in progress, don't merge 🚧

Enables importing markdown and epub files via the importWxr step (to be renamed) when the data-liberation importer is enabled.

CleanShot.2024-12-13.at.21.17.10.mp4

Here's the Blueprint you can use to import the "data basics" tutorial from the Gutenberg repo:

{
    "$schema": "https://playground.wordpress.net/blueprint-schema.json",
    "landingPage": "/adding-a-delete-button/",
    "features": {
        "networking": true
    },
    "steps": [
        {
            "step": "resetData"
        },
        {
            "step": "importWxr",
            "importer": "data-liberation",
            "phpImporterOptions": {
                "data_source": "markdown_directory",
                "source_site_url": "https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/how-to-guides/data-basics"
            },
            "importData": {
                "resource": "git:directory",
                "url": "https://github.com/WordPress/gutenberg.git",
                "ref": "HEAD",
                "path": "docs/how-to-guides/data-basics"
            }
        }
    ]
}

Requires WordPress/blueprints-library#121

Other code examples

Combining the new importers APIs is getting ridiculous. Here’s two entity readers:

  • The first one sources posts, meta, etc. from XHTML files stored inside a remote .epub file
  • The second one sources posts, meta, etc. from markdown files in a local .zip file

We can mix&match data sources (local filesystem, remote), formats (e.g. md, xhtml, wxr), and containes (plain, .zip, git in the future)

$reader = WP_Directory_Tree_Entity_Reader::create(
    new WP_Zip_Filesystem(
        WP_Remote_File_Ranged_Reader::create( 
            'https://github.com/IDPF/epub3-samples/releases/download/20230704/childrens-literature.epub'
        )
    ),
    array (
        'root_dir' => '/EPUB',
        'first_post_id' => 1,
        'allowed_extensions' => array( 'html', 'xhtml' ),
        'index_file_patterns' => array( '#^index\.x?html$#' ),
        'markup_converter_factory' => function( $content ) {
            return new WP_HTML_To_Blocks( $content );
        },
    )
);

$reader = WP_Directory_Tree_Entity_Reader::create(
    new WP_Zip_Filesystem(
        WP_File_Reader::create(__DIR__.'/../docs.zip')
    ),
    array (
        'root_dir' => '/',
        'first_post_id' => 1,
        'allowed_extensions' => array( 'md' ),
        'index_file_patterns' => array( '#^index\.md$#' ),
        'markup_converter_factory' => function( $content ) {
            return new WP_Markdown_To_Blocks( $content );
        },
    )
);

Remaining work

  • Confirm the WXR import still works both for the regular importer and the data liberation one
  • Add E2E coverage
  • Rewrite relative markdown URLs
  • Enable specifying additional URL mappings directly in the Blueprint
  • Review the code and make any architectural adjustments necessary

@adamziel
Copy link
Collaborator Author

This PR needs to be split into smaller parts before merging. For sure the new vendor libraries will become a separate PR. Epub and HTML importers probably, too.

adamziel added a commit that referenced this pull request Dec 17, 2024
Adds a forked version of the markdown parsing libraries required by the
upcoming Markdown importer. We need out own fork for PHP 7.2 compatibility.
The downgrade process was performed semi-automatically via Rector.

This PR adds the following libraries:

* `league/commonmark`
* `webuni/front-matter`

There are no testing steps here. This PR only adds new code without
modifying the existing one.

A part of #2080
adamziel added a commit that referenced this pull request Dec 17, 2024
Adds a forked version of the markdown parsing libraries required by the
upcoming Markdown importer. We need out own fork for PHP 7.2
compatibility. The downgrade process was performed semi-automatically
via Rector.

This PR adds the following libraries:

* `league/commonmark`
* `webuni/front-matter`

There are no testing steps here. This PR only adds new code without
modifying the existing one.

A part of:

* #2080
* #1894
…Wxr step

🚧 Work in progress, don't merge 🚧

Enables importing markdown files via the `importWxr` step (to be
renamed) when the data-liberation importer is enabled.

Here's the Blueprint you can use to import the "data basics" tutorial
from the Gutenberg repo:

```json
{
    "$schema": "https://playground.wordpress.net/blueprint-schema.json",
    "landingPage": "/adding-a-delete-button/",
    "features": {
        "networking": true
    },
    "steps": [
        {
            "step": "resetData"
        },
        {
            "step": "importWxr",
            "importer": "data-liberation",
            "phpImporterOptions": {
                "data_source": "markdown_directory",
                "source_site_url": "https://raw.githubusercontent.com/WordPress/gutenberg/HEAD/docs/how-to-guides/data-basics"
            },
            "importData": {
                "resource": "git:directory",
                "url": "https://github.com/WordPress/gutenberg.git",
                "ref": "HEAD",
                "path": "docs/how-to-guides/data-basics"
            }
        }
    ]
}
```

 ## Remaining work

* Confirm the WXR import still works both for the regular importer and
  the data liberation one
* Add E2E coverage
* Rewrite relative markdown URLs
* Enable specifying additional URL mappings directly in the Blueprint
* Review the code and make any architectural adjustments necessary
adamziel added a commit that referenced this pull request Dec 17, 2024
Moves the Markdown importer to a `data-liberation-markdown` package so
that it can be shipped as a separate `.phar` file and downloaded only
when needed.

 ## Testing instructions

This only moves code around. To test, confirm the CI PHP unit tests keep
working.

A part of:

* #2080
* #1894
adamziel added a commit that referenced this pull request Dec 17, 2024
Builds data-liberation-markdown.phar.gz (200KB) to enable downloading the
Markdown importer only when needed instead of on every page load.

A part of:

* #2080
* #1894

 ## Testing instructions

Run `nx build playground-data-liberation-markdown`, confirm it finished
without errors. A smoke test of the built phar file is included in the
build command.
adamziel added a commit that referenced this pull request Dec 17, 2024
Builds data-liberation-markdown.phar.gz (200KB) to enable downloading
the
Markdown importer only when needed instead of on every page load.

A part of:

* #2080
* #1894

 ## Testing instructions

Run `nx build playground-data-liberation-markdown`, confirm it finished
without errors. A smoke test of the built phar file is included in the
build command.
@adamziel adamziel force-pushed the expose-markdown-importer branch from f522d40 to 4a31689 Compare December 17, 2024 13:35
@adamziel
Copy link
Collaborator Author

I'm going to close this PR. I've reorganized it as a series of smaller ones that we can discuss granularly:

After all the API changes, I'm no longer sure setting up the importer in blueprint.json in the way proposed in this PR will stand the test of time. Let's land all the plumbing from the above PRs and then discuss the public API in a dedicated discussion.

@adamziel adamziel closed this Dec 17, 2024
zaerl pushed a commit that referenced this pull request Jan 8, 2025
Sets the stage for the EPub importer. A part of
#2080

Refactors and clean up the Data Liberation package. This includes
renaming, reorganizing file paths, improving class structure, and
removing deprecated/unused code.

## Key Changes

**Refactor:**
- Renamed `WP_WXR_Reader` to `WP_WXR_Entity_Reader` for consistency and
clarity.
   - Adjusted references in related classes, tests, and imports.
- Moved `byte-readers` to the Blueprints library (see
WordPress/blueprints-library#121)

**Cleanup:**
- Deleted unused and redundant byte reader classes (`WP_Byte_Reader`,
`WP_File_Reader`, etc.).
   - Removed legacy files such as `WXR_Import_Info`.

**New Additions:**
- Added `WP_Directory_Tree_Entity_Reader` to improve handling of
directory tree imports.
- Introduced `WP_Import_HTML_Processor` for better HTML import
functionality.

## Testing instructions

Confirm the CI tests passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

1 participant