Platform for turning data into action

LENS

Lens is a unified data platform that enables you to leverage the power of big data to drive business decisions.

Lens offers a robust distributed architecture built on the Amazon Web Services (AWS) ecosystem with a wide range of analytical capabilities that can scale to satisfy your business requirements.

Lens Dashboard, a highly visual, information-rich analytics dashboard built on the Lens platform, enables you to compare key metrics, create custom reports/interactive dashboards, set business goals and provide user recommendations.

Architecture

The Lens platform is built on the principles of distributed architecture. Some of the highlights of the platform are:

  • Scalable without sacrificing performance
  • Data governance
  • Fault tolerance
  • Technology agnostic
  • Cost-effective

Features of Lens Platform

  • Well-governed data lake
  • Big data stores for Online Analytical Processing (OLAP)
  • Smart dashboards
  • Unified data pipeline
  • Data processing framework
  • Artificial Intelligence (AI)/Machine Learning (ML)/ Recommendations engine

Services and Technology Used Within Lens Platform

Cloud Services Used

  • Amazon Elasticsearch Service
  • Amazon Redshift
  • Amazon Kinesis
  • Amazon EMR
  • Amazon Elastic Container Service (ECS)
  • Amazon CloudFront
  • Databricks

Technology Used

  • Golang
  • Scala
  • Spark
  • Python
  • React/NodeJS
  • Java
Data Streams/Trackers

Before you start with the data analysis, you need to set up trackers to your website or mobile app to ensure all your user events are captured accurately.

Tracking Ecosystem
  • Mobile App Tracking (Android SDK/iOS SDK): Lens platform employs Software Development Kits (SDKs) to track user events within a mobile app. The platform supports both Apple and Android operating system SDKs. For Example, all in-app user events such as conversion journey, app engagement, wishlist, browsing history and so on are captured via SDKs.
  • Server-side Tracking (PHP SDK): To ensure all your business-critical events are captured accurately, the Lens platform employs PHP SDKs and these events are tracked from the server-side. For Example, conversion events that are bound with revenue.
  • Website Tracking (JS SDK): User interactions on your webpage gives valuable insights into user engagement. The Lens platform employs the JS SDK to track webpage user events using pixels.
    • Google Tags are also used in conjunction with JS SDKs to ensure the events’ instrumentations are accurate. The platform benchmarks data accuracy with open-source analytics tools like Google Analytics.
    • The static SDK content is served to the user's nearest edge location using AWS CloudFront.
Scribe

Scribe is a Collector API that collects data from the trackers and then writes it to the Kinesis stream. The Collector API, developed in GoLang, acts as a listener and continuously checks for data from the trackers. Additionally, the Collector API also performs minimal processing for the following:

  • Facebook Instant Article
  • Google Accelerated Mobile Pages

For the above cases, Facebook/Google allows the trackers to measure only those key metrics that do not impact the page loading performance. The Collector API then processes the available key metrics into a standardized format accepted by the Lens platform. The Scribe module is containerized using AWS Elastic Container Service (ECS) and the Application Load Balancer (ALB) balances the load across multiple EC2 instances hosted in different availability zones.

Kinesis

Once the data passes through Scribe, it is then written to Kinesis, which acts like a data storage queue. Kinesis can store up to 7 day’s data. One of Kinesis's most significant advantages is that data can be retrieved from any point in the queue so you can get timely insights and react quickly to new information.

Accumulo

Accumulo reads the data from Kinesis and converts the data into Avro format files. This module, developed in Golang and Java, is a containerized module designed to solve the small file problem existent in the big data ecosystem. The module waits until the data reaches the desired file size; it then converts the data to an Avro format file and then finally writes this file to the data lake on S3.

Additionally, Accumulo performs schema validation, schema evolution and schema compatibility checks to ensure the raw data confirms the defined standards set by the platform.

Prism-Processing Layer

Prism is a unified data processing engine which cleanses, reformats and enriches the data. The data (in the desired format) is then written to different data stores, which in turn powers the Lens Dashboard. Prism supports both Lambda and Kappa architecture with the Databricks Delta Lake transactional support.
Following are the workflows supported by Prism:

Real-time Streaming Analytics

This workflow supports a constant flow of data from Kinesis, which updates with high frequency. Real-time streaming analytics is particularly useful to analyze real-time data, such as “How is the performance of the key metric at this point in time” and realign the business strategy, for example, “How can we improve the key metric performance?”.

For real-time streaming analytics, we use the spark structured streaming job, which runs in either AWS EMR or Databricks on the AWS environment.

Post-processing, the data is copied to ElasticSearch Service and the real-time streaming Lens Dashboards is powered from ElasticSearch Service.

Batch Streaming

This workflow supports the processing of a large volume of data collected over a period of time. The Batch Streaming is particularly useful to analyze historical data, such as “Compare the key metric performance for the current week with last week” .

Recommendation

This workflow generates recommendations and personalization. The recommendation and personalized engine uses AWS SageMaker/Databricks ML Flow for the ML model lifecycle management on top of the raw data and semi-processed data in the data lake.

Functions of Prism Processing Layer

Data Quality Checks

The data cleansing process ensures data quality and utility by catching and correcting errors before further processing. The following are the data cleansing process followed:

  • Bot Traffic Check: Malicious bots alter the data by scraping websites or misusing APIs. The bot detention methods distinguish real traffic from malicious bots.
  • Basic Data cleansing: The following are a few checks performed to ensure the data is clean:
    • The IP address shouldn’t be null
    • Data should be bound by schema
Data Reconciliation

Data reconciliation ensures that the business-critical conversion event data syncs with the recorded OLTP transaction. It also recalculates the aggregated results to handle late events to improve overall data accuracy.

Data Cleanser
  • De-duping data: Remove unwanted events from your data, including duplicate events or irrelevant events. When you combine data sets from multiple trackers, scrape data, or receive data from clients, there are chances to create duplicate data.
  • Normalization: Ensure that the data is structured logically by flattening the nested structure.
  • Client-clock synchronization issues: Ensure that the event triggered syncs with the server time.
Data Enricher

Enriched data enables you to gain valuable insights about your audience segments and alter your marketing and business strategies to suit the audience.
We employ the following methodologies to enrich your data:

  • Browser User Agent Parsing: The User-Agent string is parsed to collect information about your web browser name, operating system, device type and so on. This information is then used to create a user profile by linking all the extracted parameters to a User ID.
  • Referral Agent Parsing: Similarly, Referral Agent provides valuable information about the user journey, the referring site and so on.
  • IP-based Geocoding: Prism pairs the IP address to a geographical location to identify where your web visitors are coming from.
  • User data Enrichment/Merging: Enrich the user information with the demographics and other user segmentation details. Also, multi-platform user information is merged to build user profiles.
  • Reverse geocoding: Convert the latitude and longitude coordinates to a readable address.
Refresh Data

Based on the business requirements, the application loads the cleansed and enriched data to different data stores. This enriched data is moved to the primary data store AWS Redshift and the data lake as parquet files. Aggregation is performed to minimize the processing time required for data comparison and to improve user experience. Cleansed data is aggregated over a given time period to provide statistics such as the key metric average for a quarter. You can analyze the aggregated data to gain insights about specific key metrics. The aggregated data is stored in Amazon Aurora.

Data Lake

The data lake is a storage repository that holds a vast amount of raw data/enriched data until it is needed. The raw, minimally-processed data from Accumulo is copied to S3, which is the data lake of choice for the Lens platform. AWS Sagemaker then uses this data for ML workflows.

Data Quality Within Data Lake

Based on the sharpness and the level of cleansing, data is classified into the following categories in the data lake using Databricks Delta Lake:
Note: The Databricks Delta Lake also supports ACID transactions.

  • Bronze: Raw data (raw data, perform data cleansing and conform it into something more consumable for their analytic efforts.)
  • Silver: Enriched data with minimal processing (cleansed and conformed into usable structures, sometimes referred to as minimal-viable-data-products.)
  • Gold: Pre-aggregated data with a high degree of data integrity and quality
Data Source Selection

Following are the available data sources used by the Lens platform:

  • ElasticSearch Service: Used for real-time streaming and running analytics queries. This database contains search indexing and centrally stores your data for lightning-fast search, fine-tuned relevancy and powerful analytics that scale with ease.
  • Redshift: Primary datastore that can be used for historical analysis like funnel analysis, cohort analysis, customer journey and so on.
  • DynamoDB (key-value database): Used for content personalization and generating recommendations. The application fetches the data from Redshift and processes through EMR or directly from the data lake. This data is then taken to AWS Sagemaker for building and tuning models for ML algorithms.
Lens Dashboard

Lens Dashboard is an in-house, powerful data visualization UI that can be customized based on your business needs.
You can use the Lens Dashboard to:

  • Track and measure your key metrics
  • Run real-time analytics and batch analytics to make strategic business decisions
  • Actionable insights to accelerate ROI

Lens Dashboard is powered by any of the data sources based on different business questions. The user’s cache page is copied to the nearest edge location and served using AWS CloudFront. The Lens Dashboard is also designed based on distributed architecture concepts to ensure scalability and reduce fault tolerance.
Some of the custom reports available are:

  • Funnels
  • Event Segmentation
  • Search Insights
  • Path Finder
  • Sunburst
  • Metric Comparisons
  • Custom Dashboard Studio