Understanding expressions.
OpenRefine supports "expressions" mostly to transform existing data or to create new data based on existing data, much like how spreadsheet software supports "formulas". There are, however, significant differences between [DocumentationForUsers#Expressions OpenRefine's expressions] and typical spreadsheet formulas.
Consider this sample data set
friend | age | |
---|---|---|
1. | John Smith | 28 |
2. | Jane Doe | 33 |
When you invoke the Transform command on, say, column "friend" and enter an expression, OpenRefine will go through each row in the data (matching facets and filters, if any), and evaluate that expression for that row in order to obtain a result for that row. Whereas in a spreadsheet, you would need to store a different formula in each cell of that column, in OpenRefine, you only need one single expression. And that's made possible through the use of Variables, as explained next.
When evaluated on a row in the example above, the expression can access that row and the cell in the column "friend" of that row through Variables. Think of a variable as a placeholder for something else. For example, there is a variable named "value" that is the placeholder for the cell's content. When the expression is evaluated on row 1, the variable "value" will stand for "John Smith"; when evaluated on row 2, it will stand for "Jane Doe". So, if the expression is
value.split(" ")[1]
then for row 1, it will yield "Smith" and for row 2, "Doe". That expression splits against the space char found and takes the 2nd part.
IMPORTANT TIP: [1] is equivalent to saying "The 2nd part of an array or list" in GREL since indexing of arrays or lists in Refine actually begins with [0] or "The 1st part of an array or list".
Using Variables, a single expression yields different results for different rows.
Note that an expression is typically based on one particular column in the data--the column whose drop-down menu is invoked. A lot of Variables are created to stand for things about the cell in that "base column" of the current row on which the expression is evaluated. But there are still Variables about the whole row, and through them, you can access cells in other columns.
Whereas each spreadsheet software has its own formula language, and only one language, OpenRefine is capable of supporting several languages for writing expressions. OpenRefine has its own native language called OpenRefine Expression Language (GREL), but you could also use Jython, or other languages if you install OpenRefine extensions that support them.
Where in GREL you write
value.split(" ")[1]
in Jython you would write
return value.split(" ")[1]
For that example the two languages are similar enough, but they don't have to be. On this documentation wiki, we will focus mostly on GREL. If you use another language, like Jython, please refer to its own documentation.
GREL is designed to resemble Javascript. So you can expect these basic things to work, and know how they would work:
example | description |
---|---|
value + " (approved)" | concatenate two strings; whatever is in value gets converted to a string first |
value + 2.239 | add two numbers; if value actually holds something other than a number, this becomes a string concatenation |
value.trim().length() | trimming leading and trailing whitespace of value and than take the length of the result |
value.substring(7, 10) | take the substring of value from character index 7 up to and excluding character index 10 |
value.substring(13) | take the substring of value from character index 13 until the end of the string |
If you're used to Excel, note that the operator for string concatenation is not & but +.
In OpenRefine expression language function can be invoked using either of these 2 forms:
- functionName(arg0, arg1, ...)
- arg0.functionName(arg1, ...)
The second form above is a shortcut to make expressions easier to read.
It's only syntactic sugar or shorthand, such as:
dot shorthand notation | full notation |
---|---|
value.trim().length() | length(trim(value)) |
value.substring(7, 10) | substring(value, 7, 10) |
value.substring(13) | substring(value, 13) |
The first argument to a function can be swapped out in front of the function to formulate the dot notation. That is easier to read as the functions occur from left to right in the order of calling, rather than in the reverse order.
The dot notation can also be used to access member fields:
example | description |
---|---|
cell.value | same as just value because cell stands for the current cell |
row.index | index of the current row |
For member fields whose names are not words (e.g., they contain spaces or other characters), use the bracket notation:
| cells["First Name"] | access the cell in the column called "First Name" of the current row |
Brackets can also be used to get substrings and sub-arrays, and their syntax resembles Python a bit more than Javascript:
example | description |
---|---|
value[1,3] | access the substring of value starting from character index 1 up to but excluding character index 3 |
"internationalization"[1,3] | return nt |
"internationalization"[1,-2] | return nternationalizati (negative indexes are counted from the end) |
GREL supports branching and looping (e.g., "if" and "for") slightly differently than Javascript. To do branching and looping, you use GREL "controls", or sometimes also called "constructs", and their syntax is more like Excel's IF:
example | description |
---|---|
if(value.length() > 10, "big string", "small string") | if the length of what value stands for is great than 10 characters, then return "big string", otherwise, return "small string" |
if(mod(row.index, 2) == 0, "even", "odd") | if the 2 modulus of the row index is zero, then output "even", otherwise, output "odd" |
Thus, the if control has this syntax
if ( test_condition , true_result , false_result )
The test_condition sub-expression is evaluated. If it yields true, then the true_result sub-expression is evaluated; otherwise, the false_result sub-expression is evaluated.
GREL doesn't support looping in the conventional meaning, as supported by for in Javascript. However, you can still process arrays of things using the various for- controls, e.g.,
| forEach("Once upon a time in Mexico".split(" "), v, v.length()) | return array of lengths of words, [4, 4, 1, 4, 2, 6] |
The forEach control has this syntax
for ( array_subexpr , element_var_name , element_subexpr )
The array_subexpr sub-expression is evaluated to an array. For each element in that array, element_subexpr sub-expression will be evaluated with the variable named by element_var_name standing for that element. In the example above, when v.length() is evaluated for the first element--the string "Once", the variable called v stands for that string "Once", and the sub-expression v.length() evaluates to 4.
Another useful control is with that can be used to define a new variable. For instance, if we want to compute the average word length for the string "Once upon a time in Mexico", then we might start with
forEach("Once upon a time in Mexico".split(" "), v, v.length()).sum() / "Once upon a time in Mexico".split(" ").length()
We have "Once upon a time in Mexico".split(" ") repeated twice. To shorten that expression, we can define a variable to stand in that sub-expression's place:
with("Once upon a time in Mexico".split(" "), a, forEach(a, v, v.length()).sum() / a.length())
Thus, the with control has this syntax
with ( subexpr1 , var_name , subexpr2 )
First, the sub-expression subexpr1 is evaluated, and then a new variable called var_name is defined to stand in its place while the sub-expression subexpr2 is evaluated.