Sorting data efficiently has always been a challenging task in computer science but there is also another challenge you will encounter after you tackle the sorting. Keeping your data sorted while updating it.
Imagine having a CONTENT table that stores thousands of contents and you allow content management users to manipulate the order of the contents. For example, a cms user might move content from the 5002nd row to the 13th row. To provide this feature, there are a couple of options you can choose from.
Basic Ranking
The first and most obvious option, is to create a column that stores the order indexes. In that case, when a user moves a record from the 20th row to the 15th row you have to update all the rows in between.
This might be okay for small tables but it can cause huge performance drawbacks for large datasets.
GreenHopper Ranking
The second option is to create indexes with a predefined gap between them, AKA GreenHopper rank. By doing that, we will be able to place content in between two other contents by just calculating the middle value of two indexes. Check out this example. We want to move Content Z to a higher rank.
As you can see from the image, because we left some empty indexes in content ranks, it is easy to move them around without updating any other row. In this case, we only have to update one row and the job is done! 👏
The issue with this system is you might need to re-index your data when your empty ranks are exhausted. A simple solution to this which was also used by JIRA in their early days is locking the system, making it unavailable until the re-indexing is completed. Well, we don’t want that. To achieve this without compromising from uptime is another story. You can have a copy table to move your current data while re-indexing it and after the process is complete, switch to the new table reading data from freshly re-indexed table. Yet again, this is not even close to an ideal solution.
GreenHopper ranking solves our update problem but the re-indexing process is tedious work that has to be handled very carefully and it contains a lot of parts that can cause trouble in the future.
Lexorank
So what is a better approach? While searching for a solution we came across Lexorank. It is the ranking system used by Atlassian on their project management tool called JIRA. Everyone who used JIRA knows that you move things around a lot. 📈
Lexorank has a similar concept to GreenHopper but with a couple of tweaks to make it easier to manage. Firstly, it is using base 36 strings instead of integers (e.g. “as5vdf”) which increases the maximum difference you can get with one character. By changing the last digit in an integer you can only have 10 different values to select from but if you use base 36 it is possible to choose from 36 different values. So less character and larger gaps! Great! But that’s not the real selling point of Lexorank.
As you can see from this example when we try to move Content Z to a higher rank we don’t need to update any other line. Calculation of the new rank is also quite simple.
/**
* Calculates the middle rank between two lexorank values
* if available.
* @param first low rank
* @param second high rank
* @return Optional String if the difference is > 1, Optional.empty() otherwise.
*/
private Optional<String> calculateMiddleRank(String first, String second) {
long firstRank = BaseConversionUtil.convertBase36to10(first);
long secondRank = BaseConversionUtil.convertBase36to10(second);
long difference = secondRank - firstRank;
if (difference < 2) {
//There is no empty rank in between these values.
return Optional.empty();
} else {
long middleRank = firstRank + difference / 2;
return Optional.of(BaseConversionUtil.convertBase10to36(middleRank));
}
}
With Lexorank, similarly to GreenHopper ranking, it is possible to run out of space. This means we also need to implement a re-indexing algorithm. Before explaining the re-indexing algorithm, we should understand the buckets.
Managing buckets
Lexorank has a mechanism called buckets. A bucket is represented by a digit separated by a pipe from the alphanumeric rank. This is used while re-indexing and allows us to serve data from the table while we’re in the middle of a re-indexing task. You can have as many buckets as you like but three is the standard number.
Buckets are used in a rotating order so they will not go above 2 or below 0 (0=>1=>2=>0…). Because these bucket indexes are the preceding part of our Lexorank values, it will help preserve the order while the re-indexing is in process.
For example, when the current bucket number is 2 we start from the top of the list ordered by current ranks and start updating the rank column from “0000” with increasing values. We start from the bottom of our ordered list while the bucket number is increasing but from the top when it is resetting from 2 to 0. This will allow us to protect the current order while updating is in process.
Calculating the default gap
Default gap means how much space we will have between two neighboring rows when the re-indexing is completed.
The gap size between two ranks can be selected beforehand and hard-coded but it doesn’t have to be. It is possible to calculate a maximum difference you can use by dividing the total possible ranks by the current size of the table and using the result as gap size. Another option is to calculate a static List<String> of default rank values once and use them whenever re-indexing is required but this means you have to set a maximum size for your table.
private static void calculateDefaultRanks() {
//MAX_VALUE is ZZZZZ converted to base 10 which is 60466175
// RANKING_SIZE is the maximum size of the table can grow
int rankingStep = MAX_VALUE / RANKING_SIZE;
DEFAULT_RANKS.add(FIRST_VALUE); //FIRST_VALUE = "00000"
long curr = 0;
for (int i = 0; i < RANKING_SIZE-1; i++) {
curr = curr + rankingStep;
String strValue = BaseConversionUtil.convertBase10to36(curr);
DEFAULT_RANKS.add(strValue);
}
}
Considering we’re using 5 digit ranks, our last element will be “zzzzz”. This value is equal to 60.466.175 in base 10. If we’ve 10.000 rows of data, by dividing 60.466.175 to 10.000 it is possible to calculate a perfect value to use as an increment. Doing this makes our indexing system more flexible and allows it to have the maximum possible difference between neighboring rows at each time. Of course, because we roll up or down while this division operation, we might have some space left empty at the top or the bottom of our ranks. For example, the last element might not be “zzzzz” if we’ve started from the top. Similarly, the first element might not be “00000” if we’ve started from the bottom but this is just esthetics. It shouldn’t be a problem.
Re-Indexing with Lexorank
Re-Indexing in Lexorank means starting from the top or bottom of the list updating the ranks by increasing/decreasing the calculated gap size.
For example, if the current bucket number is 2 which means we will be resetting our bucket number to 0. So we need to start updating values from the top of our current table. This allows us to make sure every updated rank value will be ranked higher on the list compared to non updated ones because in string sort 0 comes before 2. Starting the re-indexing process from the top of the list guarantees that the indexes will be preserved even in the middle of the re-indexing.
If we receive a query for the table in this state, even though we didn’t complete re-indexing, we can still serve the correctly ordered table to the client. This actually is the real selling point of Lexorank.
Conclusion
Lexorank appears to be superior compared to alternatives but on the other hand, it is a bit harder to implement. As always, there is no silver bullet, you should study your needs with your team and choose an algorithm accordingly.
We came across this beautiful ranking system while searching for a better solution for our content ranking system at TV+ which contains hundreds of unique content that needs to be sorted and updated from time to time. Lexorank helped us with that and I hope it will help you in the future.
Thanks for reading. I hope you enjoy this blog post, stay tuned for more content!