Stripping Diacritics in Qt

As someone dealing with languages different than English, I quite often need to deal with diacritics. You know, characters such as č, ć, đ, š, ž, and similar. Even in English texts I can sometime see them, e.g. voilà. And yes, quite often you can omit them and still have understandable text. But often you simply cannot because it changes meaning of the word.

One place where this often bites me is search. It's really practical for search to be accent insensitive since that allows me to use English keyboard even though I am searching for content in another language. Search that would ignore diacritics would be awesome.

And over the time I implemented something like that in all my apps. As I am building a new QText, it came time to implement it in C++ with a Qt flavor. And, unlike C#, C++ was not really built with a lot of internationalization in mind.

Solution comes from non-other than now late Michael S. Kaplan. While his blog was deleted by Microsoft (great loss!), there are archives of his work still around - courtesy of people who loved his work. His solution was in C# (that's how I actually rembered it - I already needed that once) and it was beautifully simple. Decompose unicode string, remove non-spacing mark characters, and finally combine what's left back to a unicode string.

In Qt's C++, that would be something like this:

Code
QString stripDiacritics(QString text) {
QString formD = text.normalized(QString::NormalizationForm_D);

QString filtered;
for (int i = 0; i < formD.length(); i++) {
if (formD.at(i).category() != QChar::Mark_NonSpacing) {
filtered.append(formD.at(i));
}
}

return filtered.normalized(QString::NormalizationForm_C);
}

2 thoughts to “Stripping Diacritics in Qt”

  1. Stumbled over this while hunting down a solution to my problem. Yours is a rather elegant solution, but what about characters like æ, ø and œ? Ideally I would love for those to be “decomposed” into a and o, respectively (not necessarily ae and oe). A test string like “Rød grød med fløde” is completely unaffected by the normalization. Of course I could hardcode it but I’d rather avoid going down this road.

    1. Unfortunately, Unicode rules have rather limited decomposition and neither æ not œ are covered. Thus the “brute force” way would be the only way to go. :(

Leave a Reply

Your email address will not be published. Required fields are marked *