数据挖掘是一个不断发展的领域。根据CRISP DM模型和其他数据挖掘模型,在挖掘知识和进行预测分析之前需要收集数据。数据收集可以包括数据抓取,这涉及到网页抓取(HTML到文本)、图像到文本和视频到文本的转换。当数据以文本格式存在时,通常使用文本挖掘技术来挖掘知识。
在本文中,将向介绍光学字符识别(OCR)技术,它用于将图像转换为文本。开发了Just Another Tesseract Interface(JATI),以将图像转换为文本文件,并将它们整合成一组文本数据,用于文本挖掘和自然语言处理。
JATI与Tesseract OCR引擎接口,将图像转换为文本。已经包含了源代码。在本文中,将解释如何使用C#接口流行的开源Tesseract OCR引擎。
要OCR整个图像很容易,但想选择图像的一部分进行OCR。这也可以提高结果的准确性。因此,在JATI中,用户可以点击图片并拖动以绘制一个矩形来选择部分。然后,选定的区域将被裁剪。以下是完成此操作的步骤。
在C#中,使用System.Drawing库来处理图像。以下是处理PictureBox控件的MouseDown、MouseMove和MouseUp事件的代码。
using System.Drawing;
void PictureBox1MouseDown(object sender, MouseEventArgs e) {
try {
if (e.Button == System.Windows.Forms.MouseButtons.Left) {
Cursor = Cursors.Cross;
startX = e.X;
startY = e.Y;
selPen = new Pen(Color.Red, 1);
}
pictureBox1.Refresh();
} catch (Exception ex) {
}
}
void PictureBox1MouseMove(object sender, MouseEventArgs e) {
try {
if (e.Button == System.Windows.Forms.MouseButtons.Left) {
pictureBox1.Refresh();
Cursor = Cursors.Cross;
curX = e.X;
curY = e.Y;
Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
pictureBox1.CreateGraphics().DrawRectangle(selPen, rect);
}
} catch (Exception ex) {
}
}
void PictureBox1MouseUp(object sender, MouseEventArgs e) {
try {
Cursor = Cursors.Arrow;
Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
Bitmap _img = new Bitmap(curX - startX, curY - startY);
Graphics g = Graphics.FromImage(_img);
g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;
pictureBox2.SizeMode = PictureBoxSizeMode.Zoom;
pictureBox2.Width = _img.Width;
pictureBox2.Height = _img.Height;
} catch (Exception ex) {
}
}
上述代码裁剪选定的图像部分,并将其放入pictureBox2。以下是详细解释。
创建一个新的Rectangle对象用于选择:
Rectangle rect = new Rectangle(startX, startY, curX - startX, curY - startY);
将原始图像保存到Bitmap对象中:
Bitmap OriginalImage = new Bitmap(pictureBox1.Image, pictureBox1.Width, pictureBox1.Height);
创建一个新的Bitmap对象:
Bitmap _img = new Bitmap(curX - startX, curY - startY);
基于新的Bitmap对象创建一个Graphics对象:
Graphics g = Graphics.FromImage(_img);
设置Graphics对象的属性:
g.InterpolationMode = System.Drawing.Drawing2D.InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = System.Drawing.Drawing2D.PixelOffsetMode.HighQuality;
g.CompositingQuality = System.Drawing.Drawing2D.CompositingQuality.HighQuality;
根据选择裁剪图像并放入pictureBox2:
g.DrawImage(OriginalImage, 0, 0, rect, GraphicsUnit.Pixel);
pictureBox2.Image = _img;
要获取图像的选定坐标,使用:
string selCoordinates = "(" + startX.ToString() + "," + startY.ToString() + "," + curX.ToString() + "," + curY.ToString() + ")";
使用TesseractOCR引擎将图像转换为文本。要与Tesseract OCR引擎接口,请包括System.Diagnostics库:
using System.Diagnostics;
将pictureBox2中的裁剪图像选择保存到临时目录:
pictureBox2.Image.Save(Directory.GetCurrentDirectory() + "/JATI/temp/temp.png");
为TesseractOCR引擎设置输入文件和输出文件:
string input = Directory.GetCurrentDirectory() + "/JATI/temp/temp.png";
string output = Directory.GetCurrentDirectory() + "/JATI/temp/temp.txt";
创建进程并放入参数:
Process myProcess = Process.Start(Directory.GetCurrentDirectory() + "/JATI/tesseract.exe", "--tessdata-dir ./JATI/ " + input + " " + output.Replace(".txt", "") + " -l " + languageTextBox.Text + " -psm " + psmTextBox.Text);
myProcess.WaitForExit();