I have not written anything for almost a year. I think that one of the major change is that I am dealing with a distributed BI architecture.
There is not much extra work for a solution architect to do in a distributed BI architecture. Design of schema and dashboards are the same. The metadata are the same. The challenge is all on the system, but supporting it including debugging the issues is new. Understanding how it works itself is interesting.
High Availability – The BI system needs to be always available. If a machine is down, can the site be still up? For the front-end UI, we need to have a load balancer that is the site accessed by users. We can have multiple nodes that serve requests from the users, but which node is being used is random, or can be configured based on rules. Having HA means that the nodes involved are providing the same services so they can replace each other. Even one node failed, the other node can still take care of new requests.
Disaster Recovery – A stable BI system needs to support recovery. DR is about resuming the operation from a disaster. It is different from HA in that the nodes in HA are serving requests. DR site is a backup and may be passive, not active. DR site typically has some distance from the main site so it won’t experience the same disaster.
Horizontal scaling – Achieving horizontal scaling may also be involved in multiple distributed nodes but it is a bit different. This is about reacting to the growth of your organization and thus the growth of the data. You can add more nodes to serve new request is horizontal scaling, but having nodes that are providing the same servicing without distinguishing the existing and new can be more desirable for achieving horizontal scaling.
Separation of Data Preparation/Integration and Data Request – Many conventional BI and ETL systems are separate by nature. But breaking a system that has been originally designed to be one JVM and one process to be separate but integrated systems is equally challenge. Data consistency requires that the data that are updated not available unless other integrated parts are also available. Latency is OK but inconsistent is not OK. While the data is being updated, the users who rely on the data to make decision should not be affected, until the last moment that the new consistent data becomes available.
Technologies
Zookeeper – I don’t want to use the term database to describe it but it is indeed a database that store a specific type of data – the messages between nodes.
Apache Helix – A cluster management A good slideshare.
Shared Metadata Database
Shared File System
An enterprise level BI architecture is not just provide instant queries and not just about providing the minimal latency. It is about highly available, supporting disaster recovery, and being scalable over time. Different types of challenges indicate how the deploying company is relying on the system and reflects the stage of the product and company in its lifecycle.
37.536894
-122.324851
You must be logged in to post a comment.