I was recently asked my thoughts on how I would solve a slow reporting and analytic environment within Mendix. The developer described the following scenario:
I have been working on a project that creates a lot of data (millions of records per year) with no issues but I am not so happy about the analysis and reporting side where dealing with the large number of records means I need to summarise/abstract many views in order to get a reasonable response. What I have works but I wonder whether there is a better solution that will provide even more benefits for the client. With this in mind I was looking to get in contact with the person who blogged with a view to discussing/sharing approaches and how to optimise such large data set BI type applications using Mendix.
Granted, in order to truly design a practical solution there would need to be a much further dive into the application, goals, and benefits analysis. But there was a lot of information provided.
The application creates millions of records per year.
- '...not so happy' is most likely due to poor performance. It's possible that there are some metric issues here as well that might require development of other business rules but that isn't explicitly outlined.
- Currently developing aggregate denormalized tables to meet demand
- Focus on client satisfaction (most important in BI development)
- Optimize large data sets
- Based on this information, my answer is as follows:
While I've built out some basic reporting (operational level) within Mendix, I have as of yet to develop a robust BI application with the tools. However, I can share with you my ideas on how I will given the right opportunity based on my experience in the BI realm.
Firstly I would ensure I move the analysis entities into their own module. On-Line Transactional Processors (OLTP's) that underlay great application designs are fantastic for data input, but lousy for data extraction necessary for reporting and analysis. I would build a new star-schema (Operational Data Store, or ODS, based in On-Line Analytical Processing, or OLAP) domain design without the burden of key design since the associations build that essentially for you. In each dimension, I would Index the key values that the Analyst will use for slicing and dicing the fact dimension with key figures. And I would populate this domain model with scheduled events that run asynchronously every 30 minutes or so, or whatever makes sense for your application. Generally, too often will lead to poor performance because of all the keys being reindexed upon every commit, but not often enough will leave you with stale data.
That's for the basic root-cause, trending, and what-if scenario support.
For real-time, you really want to limit the reports and alerts on top of the ODS's domain module to recent events, such as commits (change date) within the past month. Those sorts of querys or data retrieves are intended to help operational managers make intra-day and intra-week decisions, whereas the analysis component is the bigger picture that you do want abstracted.
Your approach of aggregate denormalized entities is not uncommon or even inappropriate. But given the volumes and velocity you described, I would consider the ODS approach.
This is all theory-crafting but rooted in deep experience. I'd love to hear other ideas in the comments!