Those funny characters in the node titles sure do get annoying - near impossible to search for (and actually impossible to type in some cases).
One possibility given no changes right now is to have two (or more) e2nodes for each of these. One of these nodes will be designed to be searchable, the other has the correct topic. A link and lock on the searchable one to the proper title.
The problem with this? Its a ton of work to find all of those nodes and make searchable nodeshells for them. Lets just say it would be difficult at best and constant work and maintenance to create proper titles for nodes.
There is another possibility (other than giving everyone a way to easily search for nodes that one can't type).
Currently, the title of an E2 node is a unique identifier within the database. Searches are done against this title. What if this title could be both correct, and searchable at the same time? In effect two titles.
What you say?! Its not as whack as it seems at first.
When creating an E2 node, or
updating the title of an E2 node the update will also create a 7 bit ASCII version that is hidden from all but search. This is near nil impact on server load. All searches will be done against this title. (Un)fortunately, this will not be a unique title - and in theory should be an indexed column.
When doing a search, the same translation is done upon the search words. This is slight hit. In theory, this search key will have a number of s/// preformed upon it. It would likely be best to have it be studyed first. Whatever the case, you get something like this:
my ($words) = @_;
$words =~ s/è/e/g;
$words =~ s/È/e/g;
$words =~ s/é/e/g;
$words =~ s/É/e/g;
$words =~ s/ê/e/g;
$words =~ s/Ê/e/g;
Ok, you say - so we made some high ASCII translated to 7 bit ASCII. Big deal - mysql
is supposed to be able to do that. Well, yes, it does for some of them. But it doesn't for all of them - most notably the ö which gives it major headaches. But wait - there is more. Not only can we solve the high ASCII problem, but we can solve
the hyphen problem. To the search engine, 'foo-bar' is not the same as 'foo bar'. Adding the line $words =~ s/-/ /g;
will make it so that searching for 'foo' or 'bar' will find 'foo-bar'.
Earlier I glossed over the unique thingy. When doing a search, if one and only one node comes back then it is displayed. If findings finds two 'exact' matches it should display the one that matches the text entered. If nothing matches the text entered, display a findings page as normal.
In this case, the challenge will not be maintaining two sets of nodes for all titles - one proper and one not; but rather have just the proper title. It won't matter how people link to it be it with the funny character or not. If there is one, and only one title that properly matches - it will be found.
In theory, one could also deFunk() html entities too. This
would mean that there would be a few more matches to do (though you don't have to worry about case in this case). Remember, study adds a bit of overhead at the start but drastically cuts down on the cost of matches done against the string later. The advantage here is that if a person (wrongly) put a html entity in the html title without a pipelink, it would still find the correct node. One word of warning here is that nodes should never be created with html entities (at least in cases where the html entity is an otherwise 'normal' ASCII character). Allowing such would only create headaches down the road, though could be addressed by a deJazz() function to properly title the node title.