Unicode (UTF-8) reading and writing to files in Python -

I'm having some brain failure in understanding how to read and write text in a file (Python 2.4).

# string, in which it has an intensity Ss = u'Capit \ xe1n 'ss8 = ss.encode (' utf8 ') repr (ss), repr (ss8) < / Pre>

("u'Capit \ xe1n '', '' capit \ xc3 \ xa1n ''))

  print ss, ss8 print & Gt; & Gt; Open ('F1', 'W'), SS8 & gt; & Gt; & Gt; File ('F1'). Read () 'capit \ xc3 \ xa1n \ n'

Then I type capit \ xc3 \ xa1n in my favorite editor, in file F2.

Then:

  & gt; & Gt; & Gt; Open ('F1') Read () 'capit \ xc3 \ xa1n \ n'> gt; & Gt; & Gt; Open ('f2'). Read () 'capit \\ xc3 \\ xa1n \ n' & gt; & Gt; & Gt; Open ('F1') Read (). Decode ('utf8') U'apat \ xe1n \ n '& gt; & Gt; & Gt; Open ('F2'). Read (). Decode ('UTF8') U'ptit \\ xc3 \\ xa1n \ n '

What do I not understand here? Clearly, some important bit of magic (or good sense) is missing what is a type of text files to get the proper conversion?

Am I really unsuccessful to mess about, that is what is the representation of UTF-8, if you can not really get the dragon, recognize it when it comes from outside I can just dump the string to JSON and instead should use it, because it has an unreliable representation! To talk more, is this Unicode object an ASCII representation that will recognize and decode when Python comes from a file? If so, how do I get it?

  & gt; & Gt; & Gt; Print SimpleJasson Dump (SS) "Capitol \ u00e1n" & gt; & Gt; & Gt; Print & gt; & Gt; File ('F3', 'W'), Simplex Dump (SS) & gt; & Gt; & Gt; Simplejson.load (Open ('F3')) '' Capita \ xe1n ''

In the notation

  u'capet \ xe1n \ n '

"\ xe1" represents only one byte. "\ X" tells you that "e1" is in hexadecimal when you type "\ xc3" in your file

  capit \ xc3 \ xa1n

< P> writes. These are 4 bytes and you read them all in your code. You can see it when you can see them:

  & gt; & Gt; & Gt; Open ('f2') Read () 'capit \\ xc3 \\ xa1n \ n'

You can avoid backslash backslash. So you have four bytes in the string: "\", "x", "c" and "3".

Edit:

As others have said in their answers, simply enter the characters in the editor and your editor should convert it to UTF-8 and save it.

If you really have a string in this string, then you can code string_escape to decode it in normal string:

  15]: Print 'capit \\ xc3 \\ xa1n \ n'.decode (' string_escape ') Captain

The result is a string that has been encoded in UTF-8 where the accented character Has been represented by two bytes which were written in the original string by \\ xc3 \\ xa1 . If you want a Unicode string, then you have to decode it again with UTF-8.

For your edit: You do not have UTF-8 in your file. Actually see how this looks:

  s = u'capet \ xe1n \ n 'Sutf8 = s.encode (' UTF-8 ') open (' utf-8.out ',' w '). To write (sutf8)

Comparing the contents of the file utf-8.out The content of the file you saved with your editor.

Brennenstuhl

Search This Blog

Unicode (UTF-8) reading and writing to files in Python -

Comments

Post a Comment