当前位置: 动力学知识库 > 问答 > 编程问答 >

codepoint - C++ Unicode: Bytes, Code Points and Graphemes

问题描述:

So, I'm building a scripting language and one of my goals is convenient string operations. I tried some ideas in C++.

  • String as sequences of bytes and free functions that return vectors containing the code-points indices.
  • A wrapper class that combines a string and a vector containing the indices.

Both ideas had a problem, and that problem was, what should I return. It couldn't be a char, and if it was a string it would be wasted space.

I ended up creating a wrapper class around a char array of exactly 4 bytes: a string that has exactly 4 bytes in memory, no more nor less.

After creating this class, I felt tempted to just wrap it in a std::vector of it in another class and build from there, thus making a string type of code-points. I don't know if this is a good approach, it would end up being much more convenient but it would end up wasting more space.

So, before posting some code, here's a more organized list of ideas.

  • My character type would be not a byte, nor a grapheme but rather a code-point. I named it a rune like the one in the Go language.
  • A string as a series of decomposed runes, thus making indexing and slicing O1.
  • Because a rune is now a class and not a primitive, it could be expanded with methods for detecting unicode whitespace: mysring[0].is_whitespace()
  • I still don't know how to handle graphemes.

Curious fact! An odd thing about the way I build the prototype of the rune class was that it always print in UTF8. Because my rune is not a int32, but a 4 byte string, this end up having some interesting properties.

My code:

class rune {

char data[4] {};

public:

rune(char c) {

data[0] = c;

}

// This constructor needs a string, a position and an offset!

rune(std::string const & s, size_t p, size_t n) {

for (size_t i = 0; i < n; ++i) {

data[i] = s[p + i];

}

}

void swap(rune & other) {

rune t = *this;

*this = other;

other = t;

}

// Output as UTF8!

friend std::ostream & operator <<(std::ostream & output, rune input) {

for (size_t i = 0; i < 4; ++i) {

if (input.data[i] == '\0') {

return output;

}

output << input.data[i];

}

return output;

}

};

Error handling ideas:

I don't like to use exceptions in C++. My idea is, if the constructor fails, initialize the rune as 4 '\0', then overload the bool operator explicitly to return false if the first byte of the run happens to be '\0'. Simple and easy to use.

So, thoughts? Opinions? Different approaches?

Even if my rune string is to much, at least I have a rune type. Small and fast to copy. :)

网友答案:

It sounds like you're trying to reinvent the wheel.

There are, of course, two ways you need to think about text:

  • As an array of codepoints
  • As an encoded array of bytes.

In some codebases, those two representations are the same (and all encodings are basically arrays of char32_t or unsigned int). In some (I'm inclined to say "most" but don't quote me on that), the encoded array of bytes will use UTF-8, where the codepoints are converted into variable lengths of bytes before being placed into the data structure.

And of course many codebases simply ignore unicode entirely and store their data in ASCII. I don't recommend that.

For your purposes, while it does make sense to write a class to "wrap around" your data (though I wouldn't call it a rune, I'd probably just call it a codepoint), you'll want to think about your semantics.

  • You can (and probably should) treat all std::string's as UTF-8 encoded strings, and prefer this as your default interface for dealing with text. It's safe for most external interfaces—the only time it will fail is when interfacing with a UTF-16 input, and you can write corner cases for that—and it'll save you the most memory, while still obeying common string conventions (it's lexicographically comparable, which is the big one).
  • If you need to work with your data in codepoint form, then you'll want to write a struct (or class) called codepoint, with the following useful functions and constructors
    • While I have had to write code that handles text in codepoint form (notably for a font renderer), this is probably not how you should store your text. Storing text as codepoints leads to problems later on when you're constantly comparing against UTF-8 or ASCII encoded strings.

code:

struct codepoint {
    char32_t val;
    codepoint(char32_t _val = 0) : val(_val) {}
    codepoint(std::string const& s);
    codepoint(std::string::const_iterator begin, std::string::const_iterator end);
    //I don't know the UTF-8→codepoint conversion off-hand. There are lots of places
    //online that show how to do this

    std::string to_utf8() const;
    //Again, look up an algorithm. They're not *too* complicated.
    void append_to_string_as_utf8(std::string & s) const;
    //This might be more performant if you're trying to reduce how many dynamic memory 
    //allocations you're making.

    //codepoint(std::wstring const& s);
    //std::wstring to_utf16() const;
    //void append_to_string_as_utf16(std::wstring & s) const;

    //Anything else you need, equality operator, comparison operator, etc.
};
分享给朋友:
您可能感兴趣的文章:
随机阅读: