I’ll show how to parse text, extract numbers and calculate averages on an example of performance data. I have performed a dozen of performance scenarios, each repeated four times to get an average. Every sample contained three values: time in seconds of execution of the first step, time in seconds of execution of the second step, and memory consumption in MB. For instance,
17.6 | 7.4 | 687 meant:
- step 1 took 17.6 seconds
- step 2 took 7.4 seconds
- the application used 687 MB of memory
I pasted the results of four runs into four consecutive cells for every scenario:
How to extract the numbers from those cells to calculate averages?
Table of contents
Parse the title
Let’s first change the title – it is an output from a
git log --oneline command, but all that’s important in this summary is the title alone without the “pick” and hash texts. For example, instead of “pick 4e28481 Performance: Added performance timers” we want just “Performance: Added performance timers”.
Apparently, this is a simple task, because the text to be removed has always the same length: 4 characters of “pick”, 1 for space, 7 characters of the hash and again 1 for space; 13 characters in total. We have to use a function that will take a fragment of a string:
MID. It accepts 3 parameters:
- the source string, it will be
- from which character the fragment should start: it will be
14(ignore the first 13 characters and start from the 14th)
- the length of the fragment: unfortunately we can’t just ask for all text until the end without providing the length. It will be equal to the length of the original text minus the 13 characters we want to be removed:
LEN(A2) - 13
- the general formula is:
LEN(A2) - starting position + 1. In our case, it is
LEN(A2) - 14 + 1
- sometimes it’s possible to simply provide any larger number, e.g.
100, but it’s more confusing
I put that formula into the first available cell in the second row:
F2. Then I used the fill feature to copy the formula onto other rows:
The last step is overwriting the column
A with values from column
F. Note that it’s not enough to copy the column
A, because there are formulas in the former column – moreover, they are dependent on the latter column. Doing so will result in an
The correct way is: select the
F column by clicking the column header, right-click the
A column and select Paste Special, then Values or Text:
Finally, I dropped the
F column by right-clicking it and selecting Delete.
Extract from a single cell
Let’s first parse one cell:
B2. It contains the following string:
17.6 | 7.4 | 687. How to extract the three values from it? One of the simplest methods is treating the
| characters as delimiters. First, we need to find positions of the two
| characters, and then copy the text between them using simple functions:
, instead of
; for separating functions’ parameters for the US locale.
- I store the position of the first
|character in the cell
- I store the position of the second
|character in the cell
G2. To find a second or next the same character or substring, just start looking after the position of the previous occurrence (the last parameter):
=FIND("|"; B2; F2+1)
- Let’s do some math now:
- I store the first value (
17.6) in the cell
H2. Because it starts from the beginning of the string, it’s best to use the
=LEFT(B2; F2-2). Note that the result is a string, not a number, so it cannot be used i.a. to calculate the average. To do so, we have to convert string to a number, e.g. by using the
NUMBERVALUEfunction. Its first parameter is the string to be converted to a number, and the second parameter is the decimal point character. Finally, the cell
H2should contain this function:
=NUMBERVALUE(LEFT(B2; F2-2); ".")
- I store the second value (
7.4) in the cell
I2. Because it starts in the middle of the string, it’s best to use the
=MID(B2; F2+2; G2-F2-3). Similarly, it should be converted to a number, so the final formula is:
=NUMBERVALUE(MID(B2; F2+2; G2-F2-3); ".")
- I store the last value (
687) in the cell
J2. Because it ends with the string, I can use the
=RIGHT(B2; LEN(B2)-G2-1). Again, the formula after conversion is:
=NUMBERVALUE(RIGHT(B2; LEN(B2)-G2-1); ".")
Now we have all values extracted from a single cell:
Extract from other cells in a row
What should be done in order to extract the second cell,
C2? We could copy the same formulas as in the cells
J2, but we would have to edit every one to change the reference from
C2. It’s not extremely tedious, but there are two more cells to be parsed. It’s worth learning an other way, which will allow you to copy cells faster.
The INDIRECT function
The key is to store a part of the reference to the cell that is being processed. In my case, that cell was
B2 and it occurred in every function. Copying the functions would be faster if I extracted the reference
B2 or its part to a separate cell.
Let’s add a column before the
F column and put the value
B into the cell
The function in
=FIND("|"; B2), and it should be
=FIND("|"; F2 & "2"). Observe that the previous reference,
B2, is just a concatenation of
B is stored in the cell
F2, the address is not
F2 & "2". However,
F2 & "2" is just a string, not an address. To make an address, we can use the
INDIRECT(F2 & "2"). So we just need to replace
B2 address with that function. The new formula in the cell
G2 will be:
=FIND("|"; INDIRECT(F2 & "2")).
The other formulas will be:
=FIND("|"; INDIRECT(F2 & "2"); G2+1)
=NUMBERVALUE(LEFT(INDIRECT(F2 & "2"); G2-2); ".")
=NUMBERVALUE(MID(INDIRECT(F2 & "2"); G2+2; H2-G2-3); ".")
=NUMBERVALUE(RIGHT(INDIRECT(F2 & "2"); LEN(INDIRECT(F2 & "2"))-H2-1); ".")
I know that the formulas got longer now, but they can be easily reused.
Just select the
K columns, copy to clipboard, select the
L column and paste. Select the
R column and paste. And select the
X column and paste.
Now change the value in
D, and in
E. All the other formulas were automatically recalculated for the other data.
Extract from other rows
We have the values extracted from a single row. Select the cells
AC2 and copy them to the other rows, e.g. by using filling the same as in the above example with titles.
You may notice that every row contains the same values. Where is the problem? That the reference in
INDIRECT(F2 & "2") in the second row is changed to
INDIRECT(F3 & "2") in the other row. The
"2" value, which states for the address’ row number, is not changed. There is a quick way to solve it – use the
ROW function which returns the number of current row.
So, I replaced
INDIRECT(F2 & "2") with
INDIRECT(F2 & ROW()) in every formula in the cells
K2. I repeated copying those cells to the next columns (see the last two steps in the previous chapter): to
X. Then I copied the second row (
AC2) again over the other rows, and now all values are correctly extracted.
There is too much data
No wonder, all data is visible only after zooming out:
What’s important, not all of this data is relevant. The columns
E contain raw data. The columns
H, and repeated four times for every parsed cell, contain auxiliary functions. It’s a good idea to hide them all. Select them either as the range
H or one by one while holding the
CTRL key, then right-click one of the selected headers and choose Hide.
The data is more concise, but not very clear. Some headings would be helpful.
Let’s calculate the averages now.
Put the following formula to the cell
=AVERAGE(I2; O2; U2; AA2) to get the first timer’s average. Copy it to two next cells to calculate the second timer’s average and memory consumption’s average. Copy to the other rows for full data:
Let’s put the data onto a graph.
I’ll show creating the chart in LibreOffice Calc, but it’s nearly the same in Excel.
- Select the column
A, hold the
CTRLkey and select also
- Select the menu Insert > Chart.
- Choose Column and Line (1 line) > Stacked.
- Click the Finish button.
- The graph is visible, but because time and memory have a totally different scale, not much is visible. Let’s use a different scale for timers and memory.
- Select the line on the graph.
- Select Format > Format Selection (it’s Properties in Excel).
- Choose the Secondary Y axis option and click the OK button.
- Move the chart by its border under the data.
- Change the labels to be better visible (they should be radically trimmed in length by the way).
Check the following video guide:
Summary of functions
LEFT(text; length)– take the leftmost part of a text – the first
MID(text; from; length)– take the middle of a text –
lengthcharacters from the
RIGHT(text; length)– take the rightmost part of a text – the last
LEN(text)– get the length of text.
FIND(search_in; search_for [; from])– get the position of the first occurrence of
search_in. Optionally start searching from the
NUMBERVALUE(text; decimal_separator)– convert a text into a number.
INDIRECT("A2")– convert a string into an address reference.
AVERAGE(range)– get the average value from numbers.