On–premise Hadoop based ecosystem help enterprises process varied data sets and build actionable analytics. However, as these platforms are adopted at large scale, enterprise face challenges with provisioning clusters, increased costs, governance and performance. Analytical, Sandbox type of environments require provisioning On-demand compute needs which are difficult with on-prem Hadoop architecture as it does not support decoupling compute and storage.

Enterprises can address these problems by migrating to a stable, secure, governed cloud platform like AWS that can scale-on-demand, effectively manage costs, facilitate Pay-per-use features and meet compliance requirements. Analytical users can also tap into on-demand provisioning of infrastructure and leverage large base of prebuilt library components. Hadoop Migration to AWS EMR can play a key role in Data Landscape Modernization and can help capitalize opportunities provided by the data economy.

Infosys and AWS partnered together to fortify AWS practice for our Data & Analytics capabilities, along with Hadoop migration strategy and accelerators that can help enterprises accelerate the migration journey to AWS cloud efficiently.

Infosys data and analytics team has built solution through well-defined strategy and suite of tools to accelerate the Hadoop migration journey to AWS EMR.

We have identified different approaches for efficient migration to AWS cloud:

  • Lift/Shift - Migrating the on-premise process with no changes to AWS cloud
  • Retrofit - Migrating objects with minimal changes like storage components and functions compatible to a new environment
  • Re-architect: Redesign the application to achieve the benefits of modernized platforms
  • Hybrid: Migrating the applications with a combination of different patterns

    Of the three Hadoop migration patterns, migration to AWS EMR provides below advantages –

  • Provisioning of clusters in minutes
  • Easy scalability of the resources
  • Provides single-click high availability
  • Scaling managed by EMR itself
  • Easy reconfiguration of running clusters

We have designed accelerators and processes, to help migrating on-premise data lake objects and applications by any of the above patterns followed by an implementation strategy to help clients in achieving scaled and predictable outcomes.


Apache Spark












Analytical model

Analytical model


Pipeline (workflow)




Business Capability Driven Migration Approachesi

By LOB (Horizontal)

By Architecture Layer

By New Capabilities / Workloads

Security Controls

Methodologies and past experience to deliver solutions

Hadoop Platform on cloud

Hadoop to AWS EMR

Hadoop to Next-gen services

Data Operations Service Offerings

Fig 2: Implementation Strategy

Accelerate your cloud migration with Infosys Data Wizard and AWS


Accelerated AWS cloud migration journey by 50% with capabilities -

  • Inventory Metadata collection
  • Schema conversion
  • Historical Data migration & catch-up loads
  • Data Certification

The Infosys Data Wizard can help accelerate the migration process. The solution consists of below components:

  • Assessment: A Comprehensive assessment framework that can identify usage patterns of source data stores and recommend best suited target data store
  • Modernization Recommendation: Decision matrix to help identify the right approach for each type of data store
  • Database Object Migration: Solution accelerators that help in migrating different types of DB Object inventory classes
  • Code/ Pipeline Migration: Solution accelerators that help in migrating different types of Data Processing Object inventory classes
  • Consumption Migration Solution accelerators that help in migrating different types of Consumption Object inventory classes
  • History Data Migration: Solution accelerators that help in migrating History Data to target Data Platform
  • Testing and Validation: A Comprehensive testing solution that accelerates validation of migrated assets
  • Partner Ecosystem: Vendor partnerships complement migration framework and solutions

We have varied approaches to meet client specific needs to migrate the workflows/code that are compatible with tools across different platforms.

Migration from Hadoop to AWS can be enabled in the below way:

  • Hadoop platform on AWS cloud
  • Hadoop to AWS EMR
  • Hadoop to Next-gen services (Native+3rd party)

Challenges & Solutions

  • Establish Value Realization Framework at the beginning, capture and monitor its throughout
  • Leverage capabilities offered by AWS platform, like:
    • AWS Managed services to simplify & save on administration cost
    • Usage of spot, on-demand storage & processing cluster (ephemeral model) compared to persistence
    • Leverage storage/compute savings plan as a regular task to save cost
    • AWS EMR provides seamless decoupling of storage and compute. It also provides cluster capabilities with high availability and transiency for cost management

  • Ensure right AWS Cloud Migration approach is followed like - Lift-n-Shift, Retrofit, Re-Architect etc. by considering benefits of target platform tools. Also depending on the workload all of these approaches can be followed instead of one.
  • Start small – build test sandbox & run POCs with smaller/ non-critical data, associated jobs and tune target product configurations
  • Identify dataflow patterns (pattern, tool, biz area) & build foundational components for Data Ingestion, Data Engineering, Common Data Libraries, Data Governance (Quality, Metadata, Lineage) in target tools
  • Leverage migration tool by target product vendor or its partners

Construct the right migration team with clear RACI (Responsible, Accountable, Consulted and Informed)

  • De-risk the program aptly.
  • Make a comprehensive program plan cutting across Governance, Hardware, Hadoop Software, Architecture, Application (Data, Objects, Code, Workflow, Consumption), Testing & Deployment

Split the data domains by timestamp, business lines, workload and convert it into an apt MVP (Minimal Viable Product)

  • sprint in plan
  • People churn is inevitable so consider knowledge mgt., issue mgt. as a critical activity

  • Security (authorization/access) & migration monitoring (Auditing, Logging) should be considered at the beginning
  • Validate each of the target technology components for security compliance (Network, Firewall, software, applications, encryption at rest/motion) with smaller set of data
  • Run security after each major task before the migration is released for production