JAVA中的字符编码操作

Tani ·

更新时间:2024-09-21

· 835 次阅读

　　在JAVA源文件-->JAVAC编译-->Class-->Java运行-->getBytes()-->newString()-->显示的过程中，　　每一步都有编码的转换过程，这个过程总是存在的，只是有的时候用默认的参数进行。　　在编写JAVA源文件的时候要指定源文件的编码，这里是指源文件的文本以什么编码保存为操作系统中的文件。

　　JAVAC编译的时候要把源文件编译成class文件，先要读取源文件，这时候要以一种编码来解码读到的　　文件，可以通过javac -encoding来指定，如果不指定则用系统默认编码。同时以unicode编码来生成class文件。　　比如有一个java文件Test.java中定义了一个 String str="中文";，　　然后源文件用utf-8保存，Test.java文件中"中文"的二进制　　则为utf-8形式（-28 -72 -83 -26 -106 -121），这时候通过javac编译的时候　　javac -encoding utf-8 Test.java，按照utf-8编码读入Test.java这个文件，编译成unicode编码的　　class文件，"中文"的二进制则为unicode形式（78 45 101 -121）。　　然后运行过程中，"中文"的二进制为unicode形式(78 45 101 -121)，默认输入和输出的都是操作系统的默认编码。　　如果这时候运行str.getBytes()，没有指定编码，得到的bytes是由unicode转成系统默认编码，　　如果指定编码，如str.getBytes("utf-8")，则由unicode转成utf-8. 　　new String(bytes[，encode])执行的时候，如果不指定编码，用操作系统的默认编码识别　　bytes，如果指定编码，则用指定的编码识别bytes。得到的string在Java中仍然以unicode存在。　　如果后面需要String.getBytes([encode])，系统要做一个Unicode字符-->encode字符-->bytes的转换。

　　以下面这个代码来详细的了解这些概念

public class IOTest { private static String str = "中文"; public static void main(String[] args) throws Exception { System.out.println(System.getProperty("file.encoding")); testChar(); printBytes(str.getBytes("utf-8")); printBytes(str.getBytes("unicode")); printBytes(str.getBytes("gb2312")); printBytes(str.getBytes("iso8859-1")); printBytes("ABC".getBytes("iso8859-1")); byte[] bytes = {-28， -72， -83， -26， -106， -121}; System.out.println(getStringFromBytes(bytes，"utf-8")); byte[] bytes1 = { -2，-1，78， 45， 101， -121}; System.out.println(getStringFromBytes(bytes1，"unicode")); System.out.println(new String(bytes1)); readBytesFromFile("C:/D/charset/utf8.txt"); readBytesFromFile("C:/D/charset/gb2312.txt"); readStringFromFile("C:/D/charset/utf8.txt"，"utf8"); readStringFromFile("C:/D/charset/gb2312.txt"，"gb2312"); } public static void testChar() throws Exception { char c = '中'; int i = c; System.out.println(i); System.out.println("u4E2D"); printBytes("中".getBytes("unicode")); } public static void printBytes(byte[] bytes ) { for(int i=0; i<bytes.length;i++){ System.out.print(" " + bytes[i]); } System.out.println(""); } public static String getStringFromBytes(byte[] bytes，String charset ) throws Exception { return new String(bytes，charset); } public static void writeTofile(byte[] bytes ) { for(int i=0; i<bytes.length;i++){ System.out.print(" " + bytes[i]); } System.out.println(""); } public static void readBytesFromFile(String fileName ) throws Exception { File f = new File(fileName); FileInputStream fin = new FileInputStream(f); byte[] readBytes = new byte[10]; while (true) { if (fin.available() >= 10) { fin.read(readBytes); for (byte b : readBytes) { System.out.print(b + " "); } } else { byte[] lastbits = new byte[fin.available()]; fin.read(lastbits); for (byte b : lastbits) { System.out.print(b + " "); } break; } } System.out.println(""); fin.close(); } public static void readStringFromFile(String fileName，String charset) throws Exception { File file = new File(fileName); FileInputStream fis = new FileInputStream(file); InputStreamReader fr = new InputStreamReader(fis，charset); BufferedReader br = new BufferedReader(fr); String line; while ((line=br.readLine()) != null){ System.out.println(line); } br.close(); fr.close(); fis.close(); } }

　　Java是支持多国编码的，在Java中，字符和字符串都是以Unicode进行存储的，每个字符占两个字节　　如下面的代码：

char c = '中'; int i = c; System.out.println(i); //20013 System.out.println("u4E2D"); //中 printBytes("中".getBytes("unicode")); //-2 -1 78 45

　　20013对应的16进制为4E2D， 4E2D对应的10进制为78 45。

　　如何得到系统的默认编码：　　System.out.println(System.getProperty("file.encoding")); 　　str以unicode编码可以转到兼容的其它编码

printBytes(str.getBytes("utf-8")); // -28 -72 -83 -26 -106 -121 printBytes(str.getBytes("unicode")); // -2 -1 78 45 101 -121 printBytes(str.getBytes("gb2312")); // -42 -48 -50 -60

　　不能转到iso8859-1，因为iso8859-1不能编码中文，输出63，63 　　printBytes(str.getBytes("iso8859-1")); // 63 63 　　通过bytes指定正确的编码可以还原到string

byte[] bytes = {-28， -72， -83， -26， -106， -121}; System.out.println(getStringFromBytes(bytes，"utf-8")); byte[] bytes1 = { -2，-1，78， 45， 101， -121}; System.out.println(getStringFromBytes(bytes1，"unicode")); System.out.println(new String(bytes1));//

　　bytes1是unicode的"中文"，系统的默认编码是utf-8，会将unicode的bytes当做utf8来解释，还原的string是烂码　　来看一下文本文件的字节流　　我们有一个utf8编码的文件，内容为“中文”，我们通过hex的方式查看文件，内容如下

　　readBytesFromFile("C:/D/charset/utf8.txt");读到的为下面的bytes，-17 -69 -65 -28 -72 -83 -26 -106 -121，其中-28 -72 -83 -26 -106 -12是"中文"的utf-8编码bytes，-17 -69 -65 是utf8编码文件的文件的文件头头形式的bytes，e4=256-28=228（getBytes得到的是-28，和e4的二进制是一样的）　　我们有一个gb2312编码的文件，内容为“中文”，我们通过hex的方式查看文件，内容如下

　　readBytesFromFile("C:/D/charset/gb2312.txt");读到的为下面的bytes，-42 -48 -50 -60是“中文”的gb2312编码。　　D6 = 256-42 =214，D0=256-48=208 　　如果要读取文本内容必须要指定正确的编码，以什么样的编码保存的，则以什么样的编码读取，在使用InputStreamReader时指定编码。　　readStringFromFile("C:/D/charset/utf8.txt"，"utf8"); 　　readStringFromFile("C:/D/charset/gb2312.txt"，"gb2312");

编码 JAVA 字符字符编码

1024 个赞