One of my development teams was looking for a PDF parsing library. They essentially wanted to search and extract data from PDF files. At first, I thought that OCR is the only way to achieve this, but there are libraries available to help us :)
PDFBox : This seems to be most popular library for extracting text out of PDF files. This is a Java library, but also has a .NET wrapper around it using iKVM.NET
Simple examples using this library can be found here and here.
iText & iTextSharp : These libraries are very popular for PDF generation and can also be used for extracting text from PDF files. Sample example can be found here.
I have heard that OpenOffice.org also provides you with a Java API that can be used to create and manipulate PDF files, but have not tried it yet.
Monday, May 10, 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment