Florence-2是一种先进的多模态模型架构,能够生成图像的丰富描述。通过这种多模态模型,可以计算出用于构建AI代理的信息,这些代理可以在网页上导航,或者更好地为搜索应用索引网站信息。在本指南中,将详细介绍如何使用Florence-2生成网站截图的文本描述,并展示如何在自己的硬件上使用HuggingFace Transformers运行模型。
在本指南中,将使用HuggingFace Transformers和timm图像包来加载Florence-2。首先,需要安装所需的依赖项:
pip install transformers timm flash_attn einops
安装完所需的依赖项后,可以开始生成图像说明。
安装了Inference后,现在可以开始生成网站截图的描述。为此,将使用Florence-2的图像描述功能。创建一个新的Python文件,并添加以下代码:
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
import copy
model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval().cuda()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
def run_example(task_prompt, text_input=None):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"].cuda(),
pixel_values=inputs["pixel_values"].cuda(),
max_new_tokens=1024,
early_stopping=False,
do_sample=False,
num_beams=3,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(
generated_text,
task=task_prompt,
image_size=(image.width, image.height)
)
return parsed_answer
在这段代码中,导入了Transformers并加载了大型Florence-2模型。然后定义了一个函数,该函数在图像上运行Florence-2并返回处理后的答案。要运行图像上的模型,可以使用以下代码:
image = Image.open("image.jpeg").convert("RGB")
task_prompt = ""
answer = run_example(task_prompt=task_prompt)
print(answer)
当首次运行脚本时,Florecne-2的权重将被下载到系统上。下载过程可能需要几分钟,具体取决于互联网连接速度。权重将被缓存到设备上,以便在后续运行中加载,而不是下载。
当给定以下图像时,Florence-2返回了以下描述:
The image is a screenshot of the homepage of a website called Roboflow. The website has a purple and white color scheme with a navigation bar at the top. Below the navigation bar, there is a navigation menu with options such as Product, Solutions, Resources, Pricing, Docs, and Sign Up.
On the left side of the page, there are two tabs - "Fine-tune Florence-2 for Object Detection with Custom Data" and "How to Train YOLOV10 Model on a Custom Dataset". On the right side, the page has a title that reads "Roboflow" and a brief description of the website's features.
The main content of the webpage is divided into two sections. The first section is titled "Fine Tune Florence 2" and has an image of a computer screen with a graph and a line graph on it. The second section has a description of how to train YOLovi10 model on a custom dataset. The text below the title explains that the website offers a tutorial on how to improve the performance of an object detection system with custom data.
上述描述捕捉了图像的内容。话虽如此,模型在元素定位方面存在错误。它说页面左侧有两个标签,但实际上有两个标签占据了整个顶部视口。它还说有一个描述网站功能的描述,这是不准确的。
有了Florence-2,可以生成网站截图的描述。这些描述可能对构建信息检索应用程序很有用。例如,可以构建一个系统,让可以使用系统生成的说明在桌面上搜索截图。在本指南中,详细介绍了如何使用Florence-2生成网站截图。安装了HuggingFace Transformers,下载并初始化了Florence-2模型,然后在示例图像上运行了模型。