Page 1
Page 1
Started By
Message

Smart extract from PDF to Excel

Posted on 6/19/17 at 2:55 pm
Posted by castorinho
13623 posts
Member since Nov 2010
82033 posts
Posted on 6/19/17 at 2:55 pm
Ok so I have multiple PDF files (scans) that someone enters manually on an Excel spreadsheet and it's time consuming.

Is there a way to extract the data that I have onto excel? (after OCR of course) What kind of scripts would I need to run?

The data in the original file is "disorganized" and not in a table format, and I would need to only extract certain data.

For example, let's say the original file looks like this, and the excel looks like the file below, that would be the end product.

shite sounds impossible (or at the very least like a lot of work) to me, but I figured I'd ask.


Posted by LSUtigerME
Walker, LA
Member since Oct 2012
3797 posts
Posted on 6/19/17 at 3:58 pm to
How clean is the OCR extraction? If it's okay and consistent, you can search through the file and categorize the data by tags or triggers.

For example, if the OCR always has the text "Date" and then an accurate "Date" after that, you can scan the file for "Date" and then write the following data.
Posted by castorinho
13623 posts
Member since Nov 2010
82033 posts
Posted on 6/19/17 at 4:19 pm to
I just played with one earlier and the OCR is pretty clean.
quote:

you can scan the file for "Date" and then write the following data.


How do I link it to the excel file?
This post was edited on 6/19/17 at 4:23 pm
Posted by castorinho
13623 posts
Member since Nov 2010
82033 posts
Posted on 6/20/17 at 9:28 am to
Bump
Posted by Scream4LSU
Member since Sep 2007
989 posts
Posted on 6/22/17 at 3:48 pm to
The ocr result is simply text so the next step would be to save that and parse the "Factor" values by locating them with something like a regex expression and then grab the values to the right. This could then be written to a .csv or .xls file. There is no utility you can buy to do exactly what you are looking for, would need to be custom coded.
Posted by Big Data
Scotch Fan
Member since Nov 2007
2553 posts
Posted on 6/22/17 at 3:57 pm to
Posted by Scream4LSU
Member since Sep 2007
989 posts
Posted on 6/22/17 at 4:04 pm to
[quote]A-PDF Data Extractor: LINK ]

Not gonna work. Everything would have to be exactly in the same spot on every instance.
This post was edited on 6/22/17 at 4:05 pm
Posted by LSUtigerME
Walker, LA
Member since Oct 2012
3797 posts
Posted on 6/22/17 at 9:21 pm to
If you save the OCR as a Word document or .txt, you should be able to use Excel VBA and write a macro to search the text.
Posted by castorinho
13623 posts
Member since Nov 2010
82033 posts
Posted on 6/23/17 at 8:36 am to
quote:

If you save the OCR as a Word document or .txt, you should be able to use Excel VBA and write a macro to search the text.

looks like this is the best option, but in the end it might be just as time consuming as entering it manually.

Thanks for all the replies
first pageprev pagePage 1 of 1Next pagelast page
refresh

Back to top
logoFollow TigerDroppings for LSU Football News
Follow us on Twitter, Facebook and Instagram to get the latest updates on LSU Football and Recruiting.

FacebookTwitterInstagram