Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query String Manipulation #30

Open
GitHub-Mike opened this issue Dec 9, 2024 · 13 comments
Open

Query String Manipulation #30

GitHub-Mike opened this issue Dec 9, 2024 · 13 comments

Comments

@GitHub-Mike
Copy link

GitHub-Mike commented Dec 9, 2024

I want to create a static 1:1 copy of a Joomla website and manipulate the query strings so that I can redirect them correctly via the .htaccess.

example.com/?foo=bar -> example.com/foo_bar/

I know that this tool was primarily created for SEO purposes, but it would be nice if there was a solution for this.

Thanks again for making this tool available to the public, but I think it's a shame that there is no community support. Discord and Reddit are not usable and there have been no answers here for a long time.

@janreges
Copy link
Owner

janreges commented Dec 9, 2024

Hi @GitHub-Mike,

the tool was not created with the aim of being a tool only for SEO, but to be usable for crawling, analyzes of all kinds and at the same time to be able to export a website into an offline form.

Could you please be more specific about how it should work and what specifically the tool should do differently for your purpose (e.g. based on some new --flag)?

@GitHub-Mike
Copy link
Author

Thanks for the quick reply.

I would like to relaunch a Joomla website with WordPress. Since a migration of the entire content is not possible without losses for various reasons, I would like to make a large part of the pages available as static pages. As these pages are in the Google index, I need to make them available again via a redirect (mod_rewrite). However, this only works if I have a pattern.

It is, for example, about such paginated pages:

  • /path/news
  • /path/news?start=1
  • /path/news?start=2
  • /path/news?start=3

This becomes current:

  • news.php.html
  • news.00a4d8f4b7.php.html
  • news.00fcacec69.php.html
  • news.1af0fb8aef.php.html

This is of no use to me as I have no pattern to rewrite the URL and the php.html ending is also unsightly. I would therefore like to have something like this:

  • news
  • news__start-1
  • news__start-2
  • news__start-3

There is already a little-documented --replace-content parameter. You could build on this.

My suggestion would be something that also works with multiple key/value pairs:

--rewrite-query-structure='/([^&]+)=([^&]*)(&|$)/' -> '$1-$2_'

The last underscore should be removed at the end.

To start with, it would also be sufficient if the special characters were simply overwritten statically.

? -> __
= -> -
& -> _

What do you think of this and is it feasible?

@janreges
Copy link
Owner

janreges commented Dec 9, 2024

@GitHub-Mike, I understand now. Thank you. The main reason why I decided to replace query parameters with a short, but unique enough hash, is the limitation on the length of the filename (or the whole file-path) on the disk.

The query string and the overall URL, by default, can be up to 2000 characters, in some browsers/technologies for example 32 767 characters/bytes. File names on Windows/Linux/macOS can only be 255 characters/bytes. And a full path on Windows can be only 260 characters, on Linux or macOS it is between 1024 and 4096 characters/bytes.

Another reason is that the characters and escaping supported in the query string are very different, but the characters that can be used on different operating systems or filesystems are very limited. Creating a very reliable set of replacement rules would be very time consuming.

But to help you, I will introduce some new flag that will disable this hashing and will only do some basic replacing of ?, & and =. Replacement will be for some little-used characters that are not commonly used in URLs, but can be used in filenames on most platforms/filesystems. In this case it will be possible to implement the URL rewrite as you plan it.

However, if you (or anyone who uses it) have slashes and other special characters in the query string that are not supported as file-names on the given operating system/filesystem, this will cause problems when saving or when viewing pages offline. The same applies to URLs and query strings that are too long. Also in the documentation I will state that this feature should only be used with caution.

After a short research it seems that these could be the characters below. Works on all common platforms. The ones you suggested are all very commonly used characters in URLs or query strings.

contact?foo=123&bar=456 -> contact!foo(123)bar(456.html

? => !
= => (
& => )

@janreges
Copy link
Owner

janreges commented Dec 9, 2024

I will also consider the option to let the user define the replacements himself, e.g. using:

--offline-url-replacement='?>! =>( &>)'. So pattern FROM1>TO1[space]FROM2>TO2.

For separating rules, a space is more convenient and less conflicting than, for example, a comma.

This could be the most universal option. In the documentation the above recommended example will be mentioned, however, the user can then substitute other characters for his web, which would cause him problems when saving or browsing the web.

What do you think?

@janreges
Copy link
Owner

janreges commented Dec 9, 2024

I read your message again and my last suggestion is very similar to yours: --rewrite-query-structure='/([^&]+)=([^&]*)(&|$)/' -> '$1-$2_'.

Your solution with regular expressions is even more versatile, but more complex for the amateur user. However, the possibility will be there and the amateur user will have to try harder, but the advanced user can implement more complex scenarios with the help of regexp.

I therefore vote for the option you suggested, i.e. the --rewrite-query-structure switch, which can be introduced repeatedly for more replacement rules. They will be executed in the order in which they are listed when the CLI is called.

Do you agree?

@GitHub-Mike
Copy link
Author

! ( ) is not a good idea, because these are "Reserved Characters" and they need a percent-encoding. See: https://datatracker.ietf.org/doc/html/rfc3986#section-2.2

I will also consider the option to let the user define the replacements himself, ...

This is a very useful decision, because you can never foresee all eventualities.

For separating rules, a space is more convenient and less conflicting than, for example, a comma.

Yes, that would work.

@GitHub-Mike
Copy link
Author

Your solution with regular expressions is even more versatile, but more complex for the amateur user. However, the possibility will be there and the amateur user will have to try harder, but the advanced user can implement more complex scenarios with the help of regexp.

Yes, that's how I would assess it too.

I therefore vote for the option you suggested, i.e. the --rewrite-query-structure switch, which can be introduced repeatedly for more replacement rules. They will be executed in the order in which they are listed when the CLI is called.

OK, this should also be included in the documentation with an example. Perhaps the --replace-content parameter could then also be explained.

janreges added a commit that referenced this issue Dec 10, 2024
… to replace the default behavior where the query string is replaced by a short hash constructed from the query string in filenames, see issue #30
@janreges
Copy link
Owner

janreges commented Dec 10, 2024

@GitHub-Mike can you please try the current version from main branch with this commit?

The description of the parameter can be found only in README.md. It will be available on the website after the next release.

Usage examples (simple and regexp):

  • --replace-query-string='= -> (' --replace-query-string='& -> )'
  • --replace-query-string='/([^&]+)=([^&]*)(&|$)/ -> $1-$2_'

@GitHub-Mike
Copy link
Author

As already mentioned here janreges/siteone-crawler-gui#3, I made a pull request today and then carried out a crawl with one of the new parameters.

--replace-query-string='= -> -' --replace-query-string='& -> _' --replace-query-string='? -> __'

Result: /path/news?start=3 --> /path/news.start-3.php.html

Even if ? has not been replaced by __, a pattern can be formed.

I will test the regex tomorrow.

However, there are a few other problems with incorrect paths that I will have to analyse tomorrow.

@janreges
Copy link
Owner

In the course of implementation I realized that ? is only the leading character of the query string, but the query string itself no longer contains ? at all, only the key=value pairs and using & to concatenate all pairs. Therefore, there is no need to replace ?. Btw, if someone uses the foo[]=bar array definition in the query string, he can replace also [ and ].

But I understand that if you want to define mod_rewrite rules, the ? replacement character would be useful. If this is necessary, let me know. I'll make sure the ? is there (if query string is not empty), so it can be replaced as you need it. I don't think anyone is using this new feature yet, so we can still afford to make this bc-break.

@GitHub-Mike
Copy link
Author

GitHub-Mike commented Dec 12, 2024

I am of the opinion that you always have to find a compromise between flexibility and simplicity. My suggestion would therefore be to leave the functionality of the --replace-query-string parameter as it is and to address another point that I have already briefly mentioned above.

My wish would be to create the option of omitting / configuring the file extensions. So instead of /path/news.start-3.php.html I would like to have /path/news.start-3 . But not only for URLs with query strings, but generally for all URLs the file extension should be omitted or made configurable.

I assume that you split the URL anyway, then you could also reassemble it configurably with variables. However, adjusting the file extension would be enough for now. But, of course, you could also solve the problem of ? -> __ as well. Should I create a separate issue for this?

During yesterday's run, I also noticed a few errors when rewriting the path names, which then caused 404 errors. However, I don't know whether these were caused by changes in the course of this issue or by your general changes to the code. This also raises the question of whether to create separate issues. I thought the problem had only arisen now, but it already existed before. Then I'll open a separate issue.

What do you think about my comments?

@janreges
Copy link
Owner

On the website crawler.siteone.io you will find somewhere in the roadmap that my goal is to enable (e.g. through some --flag) to generate exported website in such a form that it is then possible to use simple rules in Nginx or mod_rewrite in Apache to ensure the functioning of such a website, ideally on the original URLs, as it was with the original site. Alternatively, run the site locally with a mini webserver, such as binserve, spark, tiny-http, etc.

My goal is to be able to use this tool also in the CI/CD framework to run automatically maintained static copies of the site, in case of failure of the dynamically generated site.

Please open a separate issue on this topic. I believe that together we will be able to implement it in the next few days.

The way the export works now is really for without the local disk. There the files must have the *.html extension, otherwise it is not possible. And if the site also uses URLs with query strings, that's another complication.

@GitHub-Mike
Copy link
Author

My goal is to be able to use this tool also in the CI/CD framework to run automatically maintained static copies of the site, in case of failure of the dynamically generated site.

OK, that would be a good application. In combination with a health check service, automatic switching could then also be implemented.

Please open a separate issue on this topic. I believe that together we will be able to implement it in the next few days.

OK, I think if we keep at it now we can take the project to another level and also create a very useful tool for the creation of static websites.

With the speed of the run yesterday, you could even use it for stress tests. :-)

The way the export works now is really for without the local disk. There the files must have the *.html extension, otherwise it is not possible.

OK, I hadn't realised until now that the focus was on "offline" and that the migration of dynamic websites to a static version was not the goal at all. But I could have guessed from the parameter names. :-)

And if the site also uses URLs with query strings, that's another complication.

Yes, but by integrating the query string into the file name, a simplification is realised. In the end, the page as a static version is a page like any other, which only needs its own file name. Of course, care must be taken to ensure that the links are correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants