字符集问题的初步探讨(五)

原文链接:

http://www.eygle.com/special/NLS_CHARACTER_SET_05.htm

** 原文发表于itpub技术丛书《Oracle数据库DBA专题技术精粹》,未经许可,严禁转载本文. **

我们知道在导出文件中，记录着导出使用的字符集id,通过查看导出文件头的第2、3个字节，我们可以找到16进制表示的字符集ID,在Windows上，
我们可以使用UltraEdit等工具打开dmp文件，查看其导出字符集::

在Unix上我们可以通过以下命令来查看:

> > cat expdat.dmp | od -x | head >

Oracle提供标准函数，对字符集名称及ID进行转换:

> > > SQL> select nls_charset_id('ZHS16GBK') from dual; >
> NLS_CHARSET_ID('ZHS16GBK') > -------------------------- > 852 >
> 1 row selected. >
> SQL> select nls_charset_name(852) from dual; >
> NLS_CHAR > -------- > ZHS16GBK >
> 1 row selected. >
> 十进制转换十六进制: >
> SQL> select to_char('852','xxxx') from dual; >
> TO_CH > ----- > 354 >
> 1 row selected. >

对应上面的图中第2、3字节，我们知道该导出文件字符集为ZHS16GBk.

查询数据库中有效的字符集可以使用以下脚本:

> > > col nls_charset_id for 9999 > col nls_charset_name for a30 > col hex_id for a20 > select > nls_charset_id(value) nls_charset_id,
> value nls_charset_name, > to_char(nls_charset_id(value),'xxxx') hex_id > from v$nls_valid_values > where parameter = 'CHARACTERSET' > order by nls_charset_id(value) > / >

输出样例如下:

> > > NLS_CHARSET_ID NLS_CHARSET_NAME HEX_ID
> > -------------- ------------------------------ -------------
> > 1 US7ASCII 1
> > 2 WE8DEC 2
> > 3 WE8HP 3
> > 4 US8PC437 4
> > 5 WE8EBCDIC37 5
> > 6 WE8EBCDIC500 6
> > 7 WE8EBCDIC1140 7
> > 8 WE8EBCDIC285 8
> > ...................
> > 850 ZHS16CGB231280 352
> > 851 ZHS16MACCGB231280 353
> > 852 ZHS16GBK 354
> > 853 ZHS16DBCS 355
> > 860 ZHT32EUC 35c
> > 861 ZHT32SOPS 35d
> > 862 ZHT16DBT 35e
> > 863 ZHT32TRIS 35f
> > 864 ZHT16DBCS 360
> > 865 ZHT16BIG5 361
> > 866 ZHT16CCDC 362
> > 867 ZHT16MSWIN950 363
> > 868 ZHT16HKSCS 364
> > 870 AL24UTFFSS 366
> > 871 UTF8 367
> > 872 UTFE 368 > > .................................. > >
>

在很多时候，当我们进行导入操作的时候，已经离开了源数据库，这时如果目标数据库的字符集和导出文件不一致，很多时候就需要进行特殊处理，
以下介绍几种方法，主要以US7ASCII和ZHS16GBK为例

1．源数据库字符集为US7ASCII，导出文件字符集为US7ASCII或ZHS16GBK，目标数据库字符集为ZHS16GBK
在Oracle92中，我们发现对于这种情况，不论怎样处理，这个导出文件都无法正确导入到Oracle9i数据库中，这可能是因为Oracle9i的编码方案发生了较大改变。

以下是我们所做的简单测试，其中导出文件命名规则为:

S-Server ，后跟Server字符集
C-client , 后跟导出操作时客户端字符集

导入时客户端字符集设置在命令行完成，限于篇幅，我们省略了部分测试过程。
对于Oracle9iR2,我们的测试结果是US7ASCII字符集，不管怎样转换，都无法正确导入ZHS16GBK字符集的数据库中。

在进行导入操作时，如果字符不能正常转换，Oracle数据库会自动用一个”?”代替，也就是编码63。

> > > E:\nls2>set NLS_LANG=AMERICAN_AMERICA.US7ASCII >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Cus7ascii.dmp fromuser=eygle touser=eygle tables=test >
> Import: Release 9.2.0.4.0 - Production on Mon Nov 3 17:14:39 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V09.02.00 via conventional path > import done in US7ASCII character set and AL16UTF16 NCHAR character set > import server uses ZHS16GBK character set (possible charset conversion) > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQLPlus: Release 9.2.0.4.0 - Production on Mon Nov 3 17:14:50 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select name,dump(name) from test; >
> NAME DUMP(NAME) > ----------------------------- > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 >
> 2 rows selected. >
> SQL> exit > Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> E:\nls2> set NLS_LANG=AMERICAN_AMERICA.ZHS16GBK >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Cus7ascii.dmp fromuser=eygle touser=eygle tables=test ignore=y >
> Import: Release 9.2.0.4.0 - Production on Mon Nov 3 17:15:28 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V09.02.00 via conventional path > import done in ZHS16GBK character set and AL16UTF16 NCHAR character set > export client uses US7ASCII character set (possible charset conversion) > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQLPlus: Release 9.2.0.4.0 - Production on Mon Nov 3 17:15:34 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select name,dump(name) from test; >
> NAME DUMP(NAME) > -------------------------------------------------------------------------------- > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 >
>
> 4 rows selected. >
> SQL> drop table test; >
> Table dropped. >
> SQL> exit > Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> E:\nls2> set NLS_LANG=AMERICAN_AMERICA.ZHS16GBK >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Czhs16gbk.dmp fromuser=eygle touser=eygle tables=test ignore=y >
> Import: Release 9.2.0.4.0 - Production on Mon Nov 3 17:17:21 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V09.02.00 via conventional path > import done in ZHS16GBK character set and AL16UTF16 NCHAR character set > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQLPlus: Release 9.2.0.4.0 - Production on Mon Nov 3 17:17:30 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select name,dump(name) from test; >
> NAME DUMP(NAME) > ---------------------------------------------- > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 >
> 2 rows selected. >
> SQL> exit > Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> E:\nls2> set NLS_LANG=AMERICAN_AMERICA.US7ASCII >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Czhs16gbk.dmp fromuser=eygle touser=eygle tables=test ignore=y >
> Import: Release 9.2.0.4.0 - Production on Mon Nov 3 17:18:00 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V09.02.00 via conventional path > import done in US7ASCII character set and AL16UTF16 NCHAR character set > import server uses ZHS16GBK character set (possible charset conversion) > export client uses ZHS16GBK character set (possible charset conversion) > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQLPlus: Release 9.2.0.4.0 - Production on Mon Nov 3 17:18:08 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select name,dump(name) from test; >
> NAME DUMP(NAME) > ---------------------------------------- > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 > ???? Typ=1 Len=4: 63,63,63,63 > test Typ=1 Len=4: 116,101,115,116 >
> 4 rows selected. >
> SQL> >
>

对于这种情况，我们可以通过使用Oracle8i的导出工具，设置导出字符集为US7ASCII，导出后修改第二、三字符，修改 0001 为
0354,这样就可以将US7ASCII字符集的数据正确导入到ZHS16GBK的数据库中。

修改导出文件:

导入修改后的导出文件:

> > > E:\nls2>set NLS_LANG=AMERICAN_AMERICA.ZHS16GBK >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Cus7ascii-exp817.dmp fromuser=eygle touser=eygle tables=test >
> Import: Release 9.2.0.4.0 - Production on Mon Nov 3 17:37:17 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V08.01.07 via conventional path > import done in ZHS16GBK character set and AL16UTF16 NCHAR character set > export server uses UTF8 NCHAR character set (possible ncharset conversion) > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQL*Plus: Release 9.2.0.4.0 - Production on Mon Nov 3 17:37:23 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select name,dump(name) from test; >
> NAME DUMP(NAME) > -------------------------------------------------------------------------------- > 测试 Typ=1 Len=4: 178,226,202,212 > Test Typ=1 Len=4: 116,101,115,116 >
> 2 rows selected. >
> SQL> >

2．使用create database的方法
如果导出文件使用的字符集是US7ASCII，目标数据库的字符集是ZHS16GBK,我们可以使用create database的方法来修改，具体如下:

> > > SQL> col parameter for a30 > SQL> col value for a30 > SQL> select * from v$nls_parameters; >
> PARAMETER VALUE > ------------------------------ ------------------------------ > NLS_LANGUAGE AMERICAN > NLS_TERRITORY AMERICA > NLS_CURRENCY $ > NLS_ISO_CURRENCY AMERICA > NLS_NUMERIC_CHARACTERS ., > NLS_CALENDAR GREGORIAN > NLS_DATE_FORMAT DD-MON-RR > NLS_DATE_LANGUAGE AMERICAN > NLS_CHARACTERSET ZHS16GBK > NLS_SORT BINARY > ………………. >
> 19 rows selected. >
> SQL> create database character set us7ascii; > create database character set us7ascii > * > ERROR at line 1: > ORA-01031: insufficient privileges >
>
> SQL> select * from v$nls_parameters; >
> PARAMETER VALUE > ------------------------------ ------------------------------ > NLS_LANGUAGE AMERICAN > NLS_TERRITORY AMERICA > NLS_CURRENCY $ > NLS_ISO_CURRENCY AMERICA > NLS_NUMERIC_CHARACTERS ., > NLS_CALENDAR GREGORIAN > NLS_DATE_FORMAT DD-MON-RR > NLS_DATE_LANGUAGE AMERICAN > NLS_CHARACTERSET US7ASCII > NLS_SORT BINARY > ………….. >
> 19 rows selected. >
> SQL> exit > Disconnected from Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> E:\nls2> set nls_lang=AMERICAN_AMERICA.US7ASCII >
> E:\nls2>imp eygle/eygle file=Sus7ascii-Cus7ascii.dmp fromuser=eygle touser=eygle >
> Import: Release 9.2.0.4.0 - Production on Sun Nov 2 14:53:26 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Export file created by EXPORT:V09.02.00 via conventional path > import done in US7ASCII character set and AL16UTF16 NCHAR character set > import server uses ZHS16GBK character set (possible charset conversion) > . . importing table "TEST" 2 rows imported > Import terminated successfully without warnings. >
> E:\nls2>sqlplus eygle/eygle >
> SQL*Plus: Release 9.2.0.4.0 - Production on Sun Nov 2 14:53:35 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select * from test; >
> NAME > ---------- > 测试 > test >
> 2 rows selected. >
>

我们看到，当发出 create database character set us7ascii ；命令时，数据库v$nls_parameters中的字符集设置随之更改，该参数影响导入进程，
更改后可以正确导入数据，重起数据库后，该设置恢复。

提示:v$nls_paraemters来源于x$nls_parameters,该动态性能视图影响导入操作；而nls_database_parameters来源于props$数据表，影响数据存储。

3． Oracle提供的字符扫描工具csscan

我们说以上的方法只是应该在不得已的情况下使用，其本质是欺骗数据库，强制导入数据，可能损失元数据。
如果要确保数据的完整性，应该使用csscan扫描数据库，找出所有不兼容的字符，然后通过编写相应的脚本及代码，在转换之后进行更新，确保数据的正确性。
我们简单看一下csscan的使用。

要使用csscan之前，需要以sys用户身份创建相应数据字典对象:

> > > E:\nls2>sqlplus "/ as sysdba" >
> SQL*Plus: Release 9.2.0.4.0 - Production on Sun Nov 2 19:42:07 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> SQL> select instance_name from v$intance; > select instance_name from v$intance > * > ERROR at line 1: > ORA-00942: table or view does not exist >
>
> SQL> select instance_name from v$instance; >
> INSTANCE_NAME > ---------------- > penny >
> 1 row selected. >
> SQL> @?/rdbms/admin/csminst.sql >
> User created. >
>
> Grant succeeded. >
> ……….. >
>

这个脚本创建相应用户(csmig)及数据字典对象，扫描信息会记录在相应的数据字典表里。

我们可以在命令行调用这个工具对数据库进行扫描:

> > > E:\nls2>csscan FULL=Y FROMCHAR=ZHS16GBK TOCHAR=US7ASCII LOG=US7check.log CAPTURE=Y ARRAY=1000000 PROCESS=2 >
>
> Character Set Scanner v1.1 : Release 9.2.0.1.0 - Production on Sun Nov 2 20:24:45 2003 >
> Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. >
>
> Username: eygle/eygle >
> Connected to: > Oracle9i Enterprise Edition Release 9.2.0.4.0 - Production > With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options > JServer Release 9.2.0.4.0 - Production >
> Enumerating tables to scan... >
> . process 1 scanning SYS.SOURCE$[AAAABHAABAAAAIRAAA] > . process 2 scanning SYS.ATTRIBUTE$[AAAAEoAABAAAAhZAAA] > . process 2 scanning SYS.PARAMETER$[AAAAEoAABAAAAhZAAA] > . process 2 scanning SYS.METHOD$[AAAAEoAABAAAAhZAAA] > …….. > . process 2 scanning SYSTEM.DEF$_AQERROR[AAAA8fAABAAACWJAAA] > . process 1 scanning WMSYS.WM$ENV_VARS[AAABeWAABAAAFMZAAA] > …………………. > . process 2 scanning SYS.UGROUP$[AAAAA5AABAAAAGpAAA] > . process 2 scanning SYS.CON$[AAAAAcAABAAAACpAAA] > . process 1 scanning SYS.FILE$[AAAAARAABAAAABxAAA] >
> Creating Database Scan Summary Report... >
> Creating Individual Exception Report... >
> Scanner terminated successfully. >

然后我们可以检查输出的日志来查看数据库扫描情况:

> > > Database Scan Individual Exception Report >
>
> [Database Scan Parameters] >
> Parameter Value
> ------------------------------ ------------------------------------------------ > Scan type Full database
> Scan CHAR data? YES
> Current database character set ZHS16GBK
> New database character set US7ASCII
> Scan NCHAR data? NO
> Array fetch buffer size 1000000
> Number of processes 2
> Capture convertible data? YES
> ------------------------------ ------------------------------------------------ >
> [Data Dictionary individual exceptions] >
>
> [Application data individual exceptions] >
> User : EYGLE > Table : TEST > Column: NAME > Type : VARCHAR2(10) > Number of Exceptions : 1
> Max Post Conversion Data Size: 4
>
> ROWID Exception Type Size Cell Data(first 30 bytes)
> ------------------ ------------------ ----- ------------------------------ > AAABpIAADAAAAAMAAA lossy conversion 测试
> ------------------ ------------------ ----- ------------------------------ >

不能转换的数据将会被记录下来，我们可以根据这些信息在转换之后，对数据进行相应的更新，确保转换无误。