The Lens platform is built on the principles of distributed architecture. Some of the highlights of the platform are:
Before you start with the data analysis, you need to set up trackers to your website or mobile app to ensure all your user events are captured accurately.
Scribe is a Collector API that collects data from the trackers and then writes it to the Kinesis stream. The Collector API, developed in GoLang, acts as a listener and continuously checks for data from the trackers. Additionally, the Collector API also performs minimal processing for the following:
For the above cases, Facebook/Google allows the trackers to measure only those key metrics that do not impact the page loading performance. The Collector API then processes the available key metrics into a standardized format accepted by the Lens platform. The Scribe module is containerized using AWS Elastic Container Service (ECS) and the Application Load Balancer (ALB) balances the load across multiple EC2 instances hosted in different availability zones.
Once the data passes through Scribe, it is then written to Kinesis, which acts like a data storage queue. Kinesis can store up to 7 day’s data. One of Kinesis's most significant advantages is that data can be retrieved from any point in the queue so you can get timely insights and react quickly to new information.
Accumulo reads the data from Kinesis and converts the data into Avro format files. This
module,
developed
in Golang and Java, is a containerized module designed to solve the small file problem
existent
in the big
data ecosystem. The module waits until the data reaches the desired file size; it then
converts
the data
to an Avro format file and then finally writes this file to the data lake on S3.
Additionally, Accumulo performs schema validation, schema evolution and schema compatibility
checks to
ensure the raw data confirms the defined standards set by the platform.
Prism is a unified data processing engine which cleanses, reformats and enriches the data.
The
data (in
the desired format) is then written to different data stores, which in turn powers the Lens
Dashboard.
Prism supports both Lambda and Kappa architecture with the Databricks Delta Lake
transactional
support.
Following are the workflows supported by Prism:
This workflow supports a constant flow of data from Kinesis, which updates with high
frequency.
Real-time
streaming analytics is particularly useful to analyze real-time data, such as “How is the
performance of
the key metric at this point in time” and realign the business strategy, for example, “How
can
we improve
the key metric performance?”.
For real-time streaming analytics, we use the spark structured streaming job, which runs in
either AWS EMR
or Databricks on the AWS environment.
Post-processing, the data is copied to ElasticSearch Service and the real-time streaming
Lens
Dashboards
is powered from ElasticSearch Service.
This workflow supports the processing of a large volume of data collected over a period of time. The Batch Streaming is particularly useful to analyze historical data, such as “Compare the key metric performance for the current week with last week” .
This workflow generates recommendations and personalization. The recommendation and personalized engine uses AWS SageMaker/Databricks ML Flow for the ML model lifecycle management on top of the raw data and semi-processed data in the data lake.
The data cleansing process ensures data quality and utility by catching and correcting errors before further processing. The following are the data cleansing process followed:
Data reconciliation ensures that the business-critical conversion event data syncs with the recorded OLTP transaction. It also recalculates the aggregated results to handle late events to improve overall data accuracy.
Enriched data enables you to gain valuable insights about your audience segments and alter
your
marketing
and business strategies to suit the audience.
We employ the following methodologies to
enrich
your
data:
Based on the business requirements, the application loads the cleansed and enriched data to different data stores. This enriched data is moved to the primary data store AWS Redshift and the data lake as parquet files. Aggregation is performed to minimize the processing time required for data comparison and to improve user experience. Cleansed data is aggregated over a given time period to provide statistics such as the key metric average for a quarter. You can analyze the aggregated data to gain insights about specific key metrics. The aggregated data is stored in Amazon Aurora.
The data lake is a storage repository that holds a vast amount of raw data/enriched data until it is needed. The raw, minimally-processed data from Accumulo is copied to S3, which is the data lake of choice for the Lens platform. AWS Sagemaker then uses this data for ML workflows.
Based on the sharpness and the level of cleansing, data is classified into the following
categories in
the data lake using Databricks Delta Lake:
Note: The Databricks Delta Lake also supports ACID transactions.
Following are the available data sources used by the Lens platform:
Lens Dashboard is an in-house, powerful data visualization UI that can be customized based on
your
business needs.
You can use the Lens Dashboard to:
Lens Dashboard is powered by any of the data sources based on different business
questions. The
user’s
cache page is copied to the nearest edge location and served using AWS CloudFront. The Lens
Dashboard is
also designed based on distributed architecture concepts to ensure scalability and reduce
fault
tolerance.
Some of the custom reports available are: