/ / /
BIG5 轉 UTF8 萬國碼資料庫轉換、檔案轉換工具
分享幾個自己寫來轉 Vovo2000.com BIG-5/LATIN1 --> UTF-8 資料庫的工具,
基本上是用 PHP 5.3/5.4 的 MultiByte-String::mb_convert_encoding 相關功能來達成的。
環境前置作業: https://vovo2000.com/f/viewtopic-355829.html
如果你是使用 Windows 環境,
可能需要稍微對這個 script 小工具做一些小修改。
使用方式 / Usage:
代碼:
#
# 轉單一個 BIG5 大五碼檔案
#
$ /usr/bin/php big5_to_utf8_conv_xargs.php <ORIGINAL_BIG5_FILES>
#
# 配合 find 一次轉多個 BIG5 大五碼檔案
#
$ find . -name "*.htm" | xargs /usr/bin/php big5_to_utf8_conv_xargs.php
檔案轉好之後,檔名會變成 ".UTF-8" 如 a.htm --> a.htm.UTF-8
可用 rename 這個工具再把名稱改回來
代碼:
$ find . -name "*.htm.UTF-8" | /usr/bin/rename -f 's/\.UTF-8$//'
big5_to_utf8_conv_xargs.php (同時處理 "淚許蓋功餐" Back Slash)
big5_to_utf8_conv_xargs.php 有幾個選項可自行看你的需求修改
代碼:
$from_char = 'BIG-5'; // 來源編碼
$to_char = 'UTF-8'; // 對象編碼
$tmp_folder = '/tmp/'; // 暫存目錄
$encoding_detect_tool = '/usr/bin/file -b';
$big5_exception_all ==> 因為你一定已經處理過「BIG5::淚許蓋功餐」之類的字,所以轉完之後,你的資料可能會變成「功\夫,淚\眼」有 Back Slashes,要再一次轉回來;由於處理所有字元,太過於消耗時間,所以我只挑出常用字
$force ==> 強制轉,不做判斷(因為判斷也可能錯誤)
代碼:
<?PHP
/*
* php_big5_to_utf8_conv.php
*
* ------
* Usage:
* ------
* # /usr/bin/php php_big5_to_utf8_conv.php <FILENAME/SQL_DUMP>
*
* The output file will be ".UTF8" postfixed
* e.g.
* mysqldb.sql --> mysqldb.sql.utf8
*
* ------
* Notes:
* ------
* => Convert BIG5 database or text/htm/php files to UTF-8 mainly;
* it could be used to convert other encoding by touching $from_char or $to_char, in theory.
*
* 主要用來轉換中文 BIG5 --> UTF8,「理論上」也可以轉其他編碼。
*
*
* Rev: 0.1 ... service@vovo2000.com
*/
$from_char = 'BIG-5';
$to_char = 'UTF-8';
$tmp_folder = '/tmp/';
$encoding_detect_tool = '/usr/bin/file -b';
// 強制轉換
// $force = '--force';
$force = '';
for ($j=1; $j < count($argv); $j++)
{
if (isset($argv[$j]) && file_exists($argv[$j]))
{
$filename = trim($argv[$j]);
$tmpfile = $tmp_folder.time().'.txt';
$cmd = "$encoding_detect_tool $filename > $tmpfile";
// echo "Execute $cmd\n";
exec($cmd);
$orig_encoding = trim(file_get_contents($tmpfile));
unlink($tmpfile);
echo "Detect encoding: $filename => ".$orig_encoding."\n";
// If you want to convert it anyway
// $argv[2] == '--force';
if (
(strpos($orig_encoding, $to_char) === FALSE && strpos($orig_encoding, 'ASCII') === FALSE)
|| strpos($orig_encoding, 'extended-ASCII')
|| $force == '--force'
)
{
$new_filename = $filename.'.'.$to_char;
echo "Converting $filename --> $new_filename\n";
$str = file_get_contents($filename);
$new = mb_convert_encoding ($str, $to_char, $from_char);
mb_internal_encoding($to_char);
if ($from_char == 'BIG-5')
{
//
// Hanlde 功\ 許\ 淚\ :: ie. You already workarounded with ADD_SLASHES in your database!
//
/*
$big5_exception_all = '么功吒吭沔坼歿俞枯苒娉珮豹崤淚許廄琵跚愧稞鈾暝蓋墦穀閱璞餐縷擺黠孀踊髏躡';
$big5_exception_all .= '尐佢汻岤垥柦胐娖涂罡偅惝牾莍傜揊焮茻鄃幋滜綅赨塿縷槙擺箤踊嫹髏潿蔌醆嬞獦';
$big5_exception_all .= '佢螏餤燡螰駹礒鎪瀙酀瀵騱酅贕鱋鱭';
*/
//
// Got a lot of data? Handle common seen characters only.
//
$big5_exception = '么功吒吭歿俞枯苒娉珮豹淚許廄琵跚愧鈾蓋穀閱璞餐縷擺黠孀踊髏涂罡傜縷槙擺踊髏礒';
$big5_array_orig = array();
$big5_array_new = array();
$i = 0;
for ($i = 0; $i < mb_strlen($big5_exception, 'UTF-8'); $i++)
{
$this_big5_char = mb_substr($big5_exception, $i, 1, 'UTF-8');
$big5_array_new[$i] = $this_big5_char;
/* We are going to use preg_replace, DUAL backslash required */
$this_big5_char .= '\\\\';
$big5_array_orig[$i] = $this_big5_char;
}
//
// Make all '功\' -> '功' try to remedy the big5-workaround back to normal
//
for ($i = 0; $i < mb_strlen($big5_exception, 'UTF-8'); $i++)
{
//
// Use 'u' pattern modifier: http://php.net/manual/en/reference.pcre.pattern.modifiers.php
//
$new = preg_replace('/'.$big5_array_orig[$i].'/u',
$big5_array_new[$i],
$new);
echo $big5_array_new[$i].' ';
}
}
file_put_contents($new_filename, $new);
echo "\nDone: File Length ".strlen($str).' --> '.strlen($new)."\n";
}
else
{
echo "Already an $to_char or pure ASCII file, skip convert\n";
}
}
else
{
echo 'Usage:
# -------------------------------------------------------------------
# Recursively convert all *.htm files in this forlder recursively
# -------------------------------------------------------------------
#
# find . -name "*.htm" | xargs /usr/bin/ /tmp/big5_to_utf8_conv_xargs_.php
#
#
# -------------------------------------------------------------------
# Recursively "Rename" the converted UTF-8 file to overwrite original htm
# -------------------------------------------------------------------
#
# find . -name "*.htm.UTF-8" | rename -f "s/\.UTF-8$//"';
exit;
}
}
?>
________________
美術插畫設計案子報價系統 v0.1 Beta
爪哇禾雀




