Converting Scanned .pdf Documents to Excel

You may have run across times when you have a sheet of paper with lots of numbers that you need in a spreadsheet.

The long way- hand type the numbers into a new spreadsheet. yuck!

The better way- scan the document (.pdf) and use OCR (optical character recognition) to pull out the numbers.

In my example, I have a Statement of Revenues, Expenses and Changes in Net Position from an audit.  But, I only a paper copy.

Step 1- Scan the  document. I use a copier with scanning capabilities.

Ideally, you want a copy that’s clean and straight. This will be important for the next step.

Step 2- Use an OCR conversion website. I prefer (Free Online OCR – convert scanned PDF and images to Word, JPEG to Word).

Step 3- Download the results in Excel:

Almost perfect! There is a little clean-up at this point, but it’s in pretty good shape.  Look at row 19. There are two lines in one cell. That’s easy to fix and beats hand typing.

Note that I took a .pdf and converted to Excel. But there are many other combinations. Your original could be .jpeg (for ex.), and your output may be Word, Excel, or even .txt.


Converting text to numbers (easily) in bulk

On many occasions, I export out reports in Excel. Maybe you do too. Depending on the platform, I get “numbers” are are actually text. You can’t total or really do any work with them in this text form.

Here is a small piece of a much larger document:

The “$” is actually part of the text. We need to remove it. If you only have a few numbers, the tedious way is to go into each cell and edit. Excel will know these are number and then format that way.

But what if you have hundreds? There are two ways that work in this instance. One, do Ctrl H. Search and replace the “$” with an empty field.

Excel will then convert the fields to numbers.


Off to the side type the number 1. Then format that number however you like. I like currency. Now copy the cell with the number 1 in it.

Here is where the magic happens.

You want to:

  1. Select the area with the text numbers.
  2. Paste/Special
  3. Multiply

Now you have numbers instead of text.

Breaking apart account strings – Formula addition

Using the same example from Text to Columns.

Instead, we will use formulas to break out the different segments of the account string.

Assuming our string is in cell A2. Amount is in B2. We will put our first formula in C2.

The formula we will use is “=MID(A2,1,3)“.  The function’s arguments are in three parts. First is the text (A2). Then the starting number, which is 1, then the number of characters. We want three characters because that’s how long our fist segment is.

The result is “100”, which is our first segment. We could do this again to get the next segment. In cell D2, enter “=MID(A2,5,6)“.  The difference is we are starting at the fifth position, and we a selecting six digits.

The result is “401000”- our second segment. We could do this for the remaining three parts of the account string. So, why would we choose to do this vs Text to Column? Well, if your data is static, Text to Column is probably the best. But, you may be working with a query from a database that can be updated. In that case, formulas would be the best route.


Converting formulas to values

In many cases when working with your general ledger, it becomes necessary to move data around that have formulas attached. Sometimes, those get broken in the process. In order to “lock” in those formulas values, they must be hard coded. A easy way to do this is by using the copy/paste special feature in Excel.

To do this, select the cells with the formula

  • Right click
  • Copy
  • Paste special

Just that easy!

And an even faster way of doing this is to create a macro in Visual Basic.

Sub Copy_PasteSpecial()
 Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _
 :=False, Transpose:=False
 Application.CutCopyMode = False
End Sub

Deleting rows with no values/or unwanted data

When exporting out a transaction detail or summary from the general ledger, no matter the reporting system, there are usually lots and lots of extra lines that need to be removed.


First step is to find a field in your data that you can sort on. In my example, I use the JENumber (journal entry #). I know that sorting on this field will put the total fields up top and at the bottom. The raw transaction detail I need is in the middle.

I delete the yellow lines and I am left with only the transaction details. At this point, I could do a pivot table or subtotals.

Pivot table – Data in monthly columns

Pivot tables are an amazing tool for understanding large amounts of data…like transaction or summary data from your general ledger.

In this example, we use general ledger data (account number, description, amount, and month). Here is a sample of the data:

Create our pivot table

  • Insert tab> Pivot table
  • Select columns A:D. Your data my be columns and rows like A1:D100 for example.
  • Right click inside of blank pivot table > Pivot table options
  • Display tab. > Classic PivotTable (my preference. Gives you drag and drop).
  • Now you are able to drag and drop the fields into the blank table.
  • Drag “Major” to the far left.
  • Drag “Description” to the right of Major.

  • This adds a total line. We want to remove this. Double click on the “Major”
  • Subtotals & Filters > Check “None”.
  • Now drag “Amount” into the columns field.
  • If it defaults to “Count”, then double-click the “Count of Amount“.
  • Change to Sum. From here you can fix the formatting to currency.
  • To get the months. Drag over the “Month” column to the top of “Total”. Highlighted here in yellow.

After you drop the “Month”, it should look like this:

The finished product. In doing this, you might get rows and columns listed as “Blank”. To remove these, just use the drop down and un-select “Blank”.



Convert transaction posting date to month

When your transaction data comes out of the general ledger it may be in a pretty raw format. You might get something called a posting date (in a format like 06/30/2016). Let’s say you want to work with the data in a pivot table where you have a column for month.

The month can be easily pulled out this this formula.

06/30/2016 is in cell A1 for example. in A2 you put “=month(A1)”. The result returns 6. This can also be used with day and year too.

6/30/2016 =MONTH(A1) 6
7/1/2016 =DAY(A2) 1
7/2/2016 =YEAR(A3) 2016


Converting GL transaction data – populating lines with account numbers

Let’s say you export out transaction detail from the general ledger. You may want to work with it in a pivot table. But the individual lines are missing the account numbers. They might be above or below the data. Here is a trick to populate the lines.

Here is a sample of our data. Highlighted in yellow is where we want our account numbers.

In cell A3 type “=A2”. Now we want to copy this formula but only in the yellow cells. Not on top of the remaining header account numbers.

  • Copy cell A3.
  • Select the range where you want to paste.
  • Press F5 to launch the GoTo box.
  • Click the “Special…”
  • Click “Blanks”
  • Now you selected just the blank cells.
  • Ctrl V (or paste from the menu).

I’ve changed the area from yellow to blue to show how it should look with account numbers populated.

Text to Columns

In our sample general ledger detail, we get the full string account number. But, we are wanting to work with the individual components of the string (fund, major, etc.). A useful tool for this is Text to Columns.

Our example shows the full string account number and the amount.

In Excel, go to “Data” tab. Highlight just the three cells containing the string. Then Text to Column button. Then choose radius button “Delimited” because we want to use the “-” character to break apart our string.

Put a “-” in the Other.

Then Next>

At this point, you are able to choose which new columns to keep. I have entered $D$3 for the new data. That way it places it to the side of my original list.

I have added the labels above our new columns.

Note- I could have chosen “Text”, instead of “General”. This would have kept the formatting of “001” vs 1.