Inferență Phi-3-Vision pe Local

Phi-3-vision-128k-instruct permite lui Phi-3 să înțeleagă nu doar limbajul, ci și să vadă lumea vizual. Prin Phi-3-vision-128k-instruct, putem rezolva diverse probleme vizuale, precum OCR, analiza tabelelor, recunoașterea obiectelor, descrierea imaginilor etc. Putem finaliza cu ușurință sarcini care anterior necesitau mult antrenament de date. Următoarele sunt tehnici și scenarii de aplicație asociate cu Phi-3-vision-128k-instruct.

0. Pregătire

Asigurați-vă că următoarele biblioteci Python au fost instalate înainte de utilizare (se recomandă Python 3.10+)

pip install transformers -U
pip install datasets -U
pip install torch -U

Se recomandă utilizarea CUDA 11.6+ și instalarea flatten

pip install flash-attn --no-build-isolation

Creați un Notebook nou. Pentru a finaliza exemplele, se recomandă să creați mai întâi următorul conținut.

from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor

model_id = "microsoft/Phi-3-vision-128k-instruct"

kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, torch_dtype="auto").cuda()

user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"

1. Analiza imaginii cu Phi-3-Vision

Dorim ca AI-ul să fie capabil să analizeze conținutul imaginilor noastre și să ofere descrieri relevante.

prompt = f"{user_prompt}<|image_1|>\nCould you please introduce this stock to me?{prompt_suffix}{assistant_prompt}"


url = "https://g.foolcdn.com/editorial/images/767633/nvidiadatacenterrevenuefy2017tofy2024.png"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]

Putem obține răspunsurile relevante rulând următorul script în Notebook.

Certainly! Nvidia Corporation is a global leader in advanced computing and artificial intelligence (AI). The company designs and develops graphics processing units (GPUs), which are specialized hardware accelerators used to process and render images and video. Nvidia's GPUs are widely used in professional visualization, data centers, and gaming. The company also provides software and services to enhance the capabilities of its GPUs. Nvidia's innovative technologies have applications in various industries, including automotive, healthcare, and entertainment. The company's stock is publicly traded and can be found on major stock exchanges.

2. OCR cu Phi-3-Vision

Pe lângă analiza imaginii, putem, de asemenea, să extragem informații din imagine. Acesta este procesul OCR, care anterior necesita scrierea unui cod complex pentru a fi finalizat.

prompt = f"{user_prompt}<|image_1|>\nHelp me get the title and author information of this book?{prompt_suffix}{assistant_prompt}"

url = "https://marketplace.canva.com/EAFPHUaBrFc/1/0/1003w/canva-black-and-white-modern-alone-story-book-cover-QHBKwQnsgzs.jpg"

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=False, 
                                  clean_up_tokenization_spaces=False)[0]

Rezultatul este

The title of the book is "ALONE" and the author is Morgan Maxwell.

3. Compararea mai multor imagini

Phi-3 Vision suportă compararea mai multor imagini. Putem folosi acest model pentru a găsi diferențele dintre imagini.

prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is difference in this two images?{prompt_suffix}{assistant_prompt}"

print(f">>> Prompt\n{prompt}")

url = "https://hinhnen.ibongda.net/upload/wallpaper/doi-bong/2012/11/22/arsenal-wallpaper-free.jpg"

image_1 = Image.open(requests.get(url, stream=True).raw)

url = "https://assets-webp.khelnow.com/d7293de2fa93b29528da214253f1d8d0/news/uploads/2021/07/Arsenal-1024x576.jpg.webp"

image_2 = Image.open(requests.get(url, stream=True).raw)

images = [image_1, image_2]

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

Rezultatul este

The first image shows a group of soccer players from the Arsenal Football Club posing for a team photo with their trophies, while the second image shows a group of soccer players from the Arsenal Football Club celebrating a victory with a large crowd of fans in the background. The difference between the two images is the context in which the photos were taken, with the first image focusing on the team and their trophies, and the second image capturing a moment of celebration and victory.

Declinări de responsabilitate:
Acest document a fost tradus folosind servicii de traducere automată bazate pe inteligență artificială. Deși depunem eforturi pentru a asigura acuratețea, vă rugăm să rețineți că traducerile automate pot conține erori sau inexactități. Documentul original în limba sa maternă ar trebui considerat sursa autoritară. Pentru informații critice, se recomandă traducerea profesională realizată de un specialist. Nu ne asumăm responsabilitatea pentru neînțelegerile sau interpretările greșite care pot apărea din utilizarea acestei traduceri.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision_Inference.md

Vision_Inference.md

Inferență Phi-3-Vision pe Local

0. Pregătire

1. Analiza imaginii cu Phi-3-Vision

2. OCR cu Phi-3-Vision

3. Compararea mai multor imagini

Files

Vision_Inference.md

Latest commit

History

Vision_Inference.md

File metadata and controls

Inferență Phi-3-Vision pe Local

0. Pregătire

1. Analiza imaginii cu Phi-3-Vision

2. OCR cu Phi-3-Vision

3. Compararea mai multor imagini