Jump to content

SUBIECTE NOI
« 1 / 5 »
RSS
Adaptor pt casti

În UE, permis digital de conducato...

Spalatorie auto profesionala in I...

Anxietate si Safety behaviors OCD.
 Conducta DN 800 pe teren construi...

placa video buna pentru redare ca...

in cat timp se activeaza un abona...

Dilema dupa facultate
 Reciclare metale - merita?

Renault Arkana sh sau Sang Yong K...

La multi ani @ceanu1980!

La multi ani @KiloW!
 ChatGPT - Certificat de rezidenta...

Page numbers in Google Docs, dar ...

Prenadez dubios

WhatsApp pe cartela din routerul ...
 

Python: SyntaxError: invalid character '·' (U+00B7) on OCR pytesseract

- - - - -
  • Please log in to reply
4 replies to this topic

#1
me_suzy

me_suzy

    Member

  • Grup: Members
  • Posts: 736
  • Înscris: 29.04.2007
EROAREA este urmatoarea:

Traceback (most recent call last):
File "D:\OCR.py", line 3, in <module>
import pytesseract
File "D:\pytesseract.py", line 70
<title>pytesseract/pytesseract.py at master · madmaze/pytesseract</title>
											 ^
SyntaxError: invalid character '·' (U+00B7)



Acesta este codul. Foloseste OCR ca sa converteasca fisierele PDF in txt cu Python si libraria pytesseract

import os
import PyPDF2
import pytesseract
from PIL import Image
# Path to the folder containing PDF files
input_folder = "d:/doc/doc"
# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"
# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]
# Loop through each PDF file and convert it to text using OCR
for file in files:
pdf_path = os.path.join(input_folder, file)
txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")
# Extract images from PDF and perform OCR on each image
images = []
with open(pdf_path, "rb") as pdf_file:
	 pdf_reader = PyPDF2.PdfFileReader(pdf_file)
	 for page_num in range(pdf_reader.numPages):
		 page = pdf_reader.getPage(page_num)
		 image = page.extract_images()[0]["obj"]
		 images.append(Image.frombytes("RGB", image.size, image.data))
text = ""
for image in images:
	 text += pytesseract.image_to_string(image)
# Save the extracted text to a text file
with open(txt_path, "w", encoding="utf-8") as txt_file:
	 txt_file.write(text)
print("Conversion complete!")


Edited by me_suzy, 18 June 2023 - 11:41.


#2
victor29cr

victor29cr

    Senior Member

  • Grup: Senior Members
  • Posts: 2,438
  • Înscris: 04.06.2016
Ai copiat codul cu copy paste dintr-un website, ti l-a preluat ca html
nu are ce cauta <title> in programul tau.

Mergi si ia iar fisierul, si cauta sa aiba undeva raw code, sau copy to clipboard

L.E. vad ca e pe pypi, ia-l cu pip install pytessaract mai bine

Edited by MarianG, 18 June 2023 - 13:31.
nu e nevoie de citat integral


#3
scub

scub

    Senior Member

  • Grup: Senior Members
  • Posts: 3,318
  • Înscris: 30.11.2005
import os
import PyPDF2
import pytesseract
from PIL import Image

# Path to the folder containing PDF files
input_folder = "d:/doc/doc"

# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"

# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]

# Loop through each PDF file and convert it to text using OCR
for file in files:
pdf_path = os.path.join(input_folder, file)
txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")

# Extract images from PDF and perform OCR on each image
images = []
with open(pdf_path, "rb") as pdf_file:
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
for page_num in range(pdf_reader.numPages):
page = pdf_reader.getPage(page_num)
image = page.extract_images()[0]["obj"]
images.append(Image.frombytes("RGB", image.size, image.data))

text = ""
for image in images:
text += pytesseract.image_to_string(image)

# Save the extracted text to a text file
with open(txt_path, "w", encoding="utf-8") as txt_file:
txt_file.write(text)

print("Conversion complete!")



#4
scub

scub

    Senior Member

  • Grup: Senior Members
  • Posts: 3,318
  • Înscris: 30.11.2005
Lasam sa vina un practicant in ale programarii , din pacate nu am cunostinte solide ...doar  m-am lasat "dus" de acest curent al AI Posted Image   .  
Dar  ai putea sa-l pui sa regenereze mai multe variante , ii  spui ca  ceva  merge  prost si  va  incerca sa corecteze . Apoi testezi.

Edited by scub, 18 June 2023 - 16:08.


#5
me_suzy

me_suzy

    Member

  • Grup: Members
  • Posts: 736
  • Înscris: 29.04.2007
Am reparat codul, merge, l-am testat acum !

import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader
# Path to the folder containing PDF files
input_folder = "d:/doc/doc"
# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"
# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]
# Loop through each PDF file and convert it to text using OCR
for file in files:
	pdf_path = os.path.join(input_folder, file)
	txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")
	# Convert PDF pages to images
	images = convert_from_path(pdf_path)
	# Perform OCR on images and extract text
	text = ""
	for image in images:
		# text += pytesseract.image_to_string(image)
		text += pytesseract.image_to_string(image, lang='ron')
	# Save the extracted text to a text file
	with open(txt_path, "w", encoding="utf-8") as txt_file:
		txt_file.write(text)
print("Conversion complete!")



Anunturi

Chirurgia endoscopică a hipofizei Chirurgia endoscopică a hipofizei

"Standardul de aur" în chirurgia hipofizară îl reprezintă endoscopia transnazală transsfenoidală.

Echipa NeuroHope este antrenată în unul din cele mai mari centre de chirurgie a hipofizei din Europa, Spitalul Foch din Paris, centrul în care a fost introdus pentru prima dată endoscopul în chirurgia transnazală a hipofizei, de către neurochirurgul francez Guiot. Pe lângă tumorile cu origine hipofizară, prin tehnicile endoscopice transnazale pot fi abordate numeroase alte patologii neurochirurgicale.

www.neurohope.ro

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

Forumul Softpedia foloseste "cookies" pentru a imbunatati experienta utilizatorilor Accept
Pentru detalii si optiuni legate de cookies si datele personale, consultati Politica de utilizare cookies si Politica de confidentialitate