Excel / LibreOffice Calc – text and data processing

Excel / LibreOffice Calc – text and data processing

I’ll show how to parse text, extract numbers and calculate averages on an example of performance data. I have performed a dozen of performance scenarios, each repeated four times to get an average. Every sample contained three values: time in seconds of execution of the first step, time in seconds of execution of the second step, and memory consumption in MB. For instance, 17.6 | 7.4 | 687 meant:

  • step 1 took 17.6 seconds
  • step 2 took 7.4 seconds
  • the application used 687 MB of memory

I pasted the results of four runs into four consecutive cells for every scenario:

How to extract the numbers from those cells to calculate averages?

Parse the title

Let’s first change the title – it is an output from a git log --oneline command, but all that’s important in this summary is the title alone without the “pick” and hash texts. For example, instead of “pick 4e28481 Performance: Added performance timers” we want just “Performance: Added performance timers”.

Apparently, this is a simple task, because the text to be removed has always the same length: 4 characters of “pick”, 1 for space, 7 characters of the hash and again 1 for space; 13 characters in total. We have to use a function that will take a fragment of a string: MID. It accepts 3 parameters:

  • the source string, it will be A2
  • from which character the fragment should start: it will be 14 (ignore the first 13 characters and start from the 14th)
  • the length of the fragment: unfortunately we can’t just ask for all text until the end without providing the length. It will be equal to the length of the original text minus the 13 characters we want to be removed:
    • LEN(A2) - 13
    • the general formula is: LEN(A2) - starting position + 1. In our case, it is LEN(A2) - 14 + 1
    • sometimes it’s possible to simply provide any larger number, e.g. LEN(A2) or 100, but it’s more confusing

I put that formula into the first available cell in the second row: F2. Then I used the fill feature to copy the formula onto other rows:

The last step is overwriting the column A with values from column F. Note that it’s not enough to copy the column F onto A, because there are formulas in the former column – moreover, they are dependent on the latter column. Doing so will result in an #REF! error.

The correct way is: select the F column by clicking the column header, right-click the A column and select Paste Special, then Values or Text:

Finally, I dropped the F column by right-clicking it and selecting Delete.

Extract from a single cell

Let’s first parse one cell: B2. It contains the following string: 17.6 | 7.4 | 687. How to extract the three values from it? One of the simplest methods is treating the | characters as delimiters. First, we need to find positions of the two | characters, and then copy the text between them using simple functions:

Information

Excel uses , instead of ; for separating functions’ parameters for the US locale.

  • I store the position of the first | character in the cell F2: =FIND("|"; B2)
  • I store the position of the second | character in the cell G2. To find a second or next the same character or substring, just start looking after the position of the previous occurrence (the last parameter): =FIND("|"; B2; F2+1)
  • Let’s do some math now:
  • I store the first value (17.6) in the cell H2. Because it starts from the beginning of the string, it’s best to use the LEFT function: =LEFT(B2; F2-2). Note that the result is a string, not a number, so it cannot be used i.a. to calculate the average. To do so, we have to convert string to a number, e.g. by using the NUMBERVALUE function. Its first parameter is the string to be converted to a number, and the second parameter is the decimal point character. Finally, the cell H2 should contain this function: =NUMBERVALUE(LEFT(B2; F2-2); ".")
  • I store the second value (7.4) in the cell I2. Because it starts in the middle of the string, it’s best to use the MID function: =MID(B2; F2+2; G2-F2-3). Similarly, it should be converted to a number, so the final formula is: =NUMBERVALUE(MID(B2; F2+2; G2-F2-3); ".")
  • I store the last value (687) in the cell J2. Because it ends with the string, I can use the RIGHT function: =RIGHT(B2; LEN(B2)-G2-1). Again, the formula after conversion is: =NUMBERVALUE(RIGHT(B2; LEN(B2)-G2-1); ".")

Now we have all values extracted from a single cell:

Extract from other cells in a row

What should be done in order to extract the second cell, C2? We could copy the same formulas as in the cells F2..J2, but we would have to edit every one to change the reference from B2 to C2. It’s not extremely tedious, but there are two more cells to be parsed. It’s worth learning an other way, which will allow you to copy cells faster.

The INDIRECT function

The key is to store a part of the reference to the cell that is being processed. In my case, that cell was B2 and it occurred in every function. Copying the functions would be faster if I extracted the reference B2 or its part to a separate cell.

Let’s add a column before the F column and put the value B into the cell F2.

The function in G2 (formerly F2) is =FIND("|"; B2), and it should be =FIND("|"; F2 & "2"). Observe that the previous reference, B2, is just a concatenation of B and 2. Since B is stored in the cell F2, the address is not F2 & "2". However, F2 & "2" is just a string, not an address. To make an address, we can use the INDIRECT function: INDIRECT(F2 & "2"). So we just need to replace B2 address with that function. The new formula in the cell G2 will be: =FIND("|"; INDIRECT(F2 & "2")).

The other formulas will be:

  • H2: =FIND("|"; INDIRECT(F2 & "2"); G2+1)
  • I2: =NUMBERVALUE(LEFT(INDIRECT(F2 & "2"); G2-2); ".")
  • J2: =NUMBERVALUE(MID(INDIRECT(F2 & "2"); G2+2; H2-G2-3); ".")
  • K2: =NUMBERVALUE(RIGHT(INDIRECT(F2 & "2"); LEN(INDIRECT(F2 & "2"))-H2-1); ".")

I know that the formulas got longer now, but they can be easily reused.

Just select the F..K columns, copy to clipboard, select the L column and paste. Select the R column and paste. And select the X column and paste.

Now change the value in L2 to C, in R2 to D, and in X2 to E. All the other formulas were automatically recalculated for the other data.

Extract from other rows

We have the values extracted from a single row. Select the cells F2..AC2 and copy them to the other rows, e.g. by using filling the same as in the above example with titles.

You may notice that every row contains the same values. Where is the problem? That the reference in INDIRECT(F2 & "2") in the second row is changed to INDIRECT(F3 & "2") in the other row. The "2" value, which states for the address’ row number, is not changed. There is a quick way to solve it – use the ROW function which returns the number of current row.

So, I replaced INDIRECT(F2 & "2") with INDIRECT(F2 & ROW()) in every formula in the cells FG..K2. I repeated copying those cells to the next columns (see the last two steps in the previous chapter): to L, R, X. Then I copied the second row (F2..AC2) again over the other rows, and now all values are correctly extracted.

There is too much data

No wonder, all data is visible only after zooming out:

What’s important, not all of this data is relevant. The columns B..E contain raw data. The columns F,G, H, and repeated four times for every parsed cell, contain auxiliary functions. It’s a good idea to hide them all. Select them either as the range B..H or one by one while holding the CTRL key, then right-click one of the selected headers and choose Hide.

The data is more concise, but not very clear. Some headings would be helpful.

Averages

Let’s calculate the averages now.

Put the following formula to the cell AD2: =AVERAGE(I2; O2; U2; AA2) to get the first timer’s average. Copy it to two next cells to calculate the second timer’s average and memory consumption’s average. Copy to the other rows for full data:

Let’s put the data onto a graph.

Chart

I’ll show creating the chart in LibreOffice Calc, but it’s nearly the same in Excel.

  1. Select the column A, hold the CTRL key and select also AD, AE, AF.
  2. Select the menu Insert > Chart.
  3. Choose Column and Line (1 line) > Stacked.
  4. Click the Finish button.
  5. The graph is visible, but because time and memory have a totally different scale, not much is visible. Let’s use a different scale for timers and memory.
  6. Select the line on the graph.
  7. Select Format > Format Selection (it’s Properties in Excel).
  8. Choose the Secondary Y axis option and click the OK button.
  9. Move the chart by its border under the data.
  10. Change the labels to be better visible (they should be radically trimmed in length by the way).

Check the following video guide:

Summary of functions

  • LEFT(text; length) – take the leftmost part of a text – the first length characters.
  • MID(text; from; length) – take the middle of a text – length characters from the from position.
  • RIGHT(text; length) – take the rightmost part of a text – the last length characters.
  • LEN(text) – get the length of text.
  • FIND(search_in; search_for [; from]) – get the position of the first occurrence of search_for within search_in. Optionally start searching from the from position.
  • NUMBERVALUE(text; decimal_separator) – convert a text into a number.
  • INDIRECT("A2") – convert a string into an address reference.
  • AVERAGE(range) – get the average value from numbers.

Leave a Reply

avatar
  Subscribe  
Notify of