IT BizOps

Building an Enterprise Document Lake on Cloud

This paper discusses the concept of Enterprise Document Lake as a single source of truth for all digital content, and benefits of expanding the horizons of traditional document management systems to cover the entire spectrum of an organization. This paper includes technology and security recommendations that could benefit organizations embarking on a similar journey.

Insights

  • Over the last decade organizations started realizing the need and benefits of managing digital content, in the same manner as data, but were constrained by various factors such as isolated repositories, lack of capabilities for intelligent processing and absence of scalable platforms to handle very large volumes.
  • An Enterprise Document Lake helps organizations in building enterprise-wide document management capabilities without worrying about volume or performance.
  • Enterprise Document Lake covers the entire spectrum of an organization and provides an elastic platform that absorbs any amount of digital content, and processes and delivers to any number of systems or users, in real time with no performance degradation.

1. Enterprise Document Lake on Cloud

Enterprise Document Lake provides a scalable and secure platform on public or private or hybrid cloud that allows enterprises to ingest any digital content from any system at any speed. Document lake makes it possible to store voluminous content cost-efficiently, removes the need for repository silos, and deliver content to business applications any-where any-time. It offers universal accessibility and smooth integration with business applications and collaboration platforms, which can optimize business workflows and improve overall operational efficiency. Enterprise Document Lake on Cloud encourages innovation by leveraging Cloud services for building new capabilities such as automation, analytics, and AI/ML.

Enterprise Document Lake on Cloud

Document repositories built as silos are a problem for almost every organization. A document lake, as the singular repository for all the organization’s digital content, helps breaks silos in an organization. It makes information available and accessible to everyone. A document lake is highly flexible both in terms of capabilities offered and technologies used, allowing organizations to cater to an entire spectrum of content related or content driven requirements. With inexpensive storage solutions, scaling up document lakes is far easier on Cloud which also means very less upfront development time during scaleup.

2. Evolution of Document Management Systems

Document Management systems have grown from small silos to enterprise lakes, growing a million times in size from gigabytes to petabytes. Volume of digital content managed by current document management systems introduce a lot of challenges and with it, opportunities.

2.1 Traditional Document Management systems

Document Management Systems started as imaging software that can harness digital content and make it available to business applications. Over a period, such systems evolved incorporating search and retrieval capabilities. Capabilities of document management systems started providing full lifecycle management of content and incorporated workflow tools to automate document centric business processes. Thus, document management systems evolved into enterprise content management (ECM) platforms.

Content Services Platform was the next generation document management solution supporting digital transformation of enterprises. Content Services Platform incorporates the capabilities of traditional ECM systems and complements it with additional capabilities such as analytics, automation and artificial intelligence for improving overall efficiency.

The main difference between Content Services Platforms (CSP) and Enterprise Content Management (ECM) systems is that the latter focuses on the storage of digital content while content services platform focuses on managing the content in an efficient way. Content Services Platforms connect all information sources and allows users to access content stored in an old Enterprise Content Management system. Content services platforms focus on delivering information to the right recipients (users, legacy applications, etc.) through all digital channels.

Traditional ECM systems lead to deployment of numerous, independent, disconnected repositories resulting in a complex mix of aging platforms, technologies, products, and isolated solutions. Monolithic framework used by traditional ECM platforms resists change and leads to limited functionality, introducing challenges in integrating with modern platforms, and cause expensive development cycles for new functionality and increased maintenance overhead.

3. Enterprise Document Lake

An Enterprise Document Lake is essentially a Content Services Platform hosted on cloud with an underlying petabyte scale repository catering to the document management needs of the entire spectrum of an organization. It combines Content Services, Content Analytics, Cloud hosting and Cognitive Services.

Enterprise Document Lake on Cloud

3.1 Content Services

Content Services solution typically provides Intelligent Capture, Content Repository, Security and Privacy control, Compliance, Collaboration and Connectors and Open APIs for accessing and managing content. These functionalities evolved over a period and forms the core of an Enterprise Document Lake. A key differentiator in a document lake is that the core content services are designed and built to scale to handle very large volumes in the future, with minimal technology or architectural changes. Cloud platforms act as enablers for building the lake.

Enterprise Document Lake on Cloud

3.1.1 Solution Building blocks for Core Content Services

With increasing cloud adoption and containerization of traditional Content Services products, either a proven Commercial Off-the-Shelf (COTS) or a bespoke application can be used as solution building blocks of an Enterprise Document Lake.

Below aspects need to be considered before finalizing the approach for building an Enterprise Document Lake:

  • Delivery Speed & Cost
  • Feature Completeness
  • Scaling & flexibility
  • Long-term maintenance
  • Customization & Control
  • Ease of Use

3.1.2 Commercial products

With increasing cloud adoption and containerization of traditional Content Services products, a Document Lake can be built using commercial off-the-shelf products (COTS). One of the major benefits of this approach is that organizations will get a matured product providing all typical content management and capture capabilities. COTS document management systems trades fit for price. It won’t completely match the requirements or processes of an organization as well as a bespoke software, but it may be cheaper to start with and faster to implement.

List of leading Content Services products that can be leveraged as base content services product for building a document lake.

  • Alfresco
  • Hyland OnBase
  • IBM FileNet Cloud Pak
  • Laserfiche
  • Newgen OmniDocs
  • Nuxeo
  • OpenText Extended ECM
  • OpenText Documentum

Although COTS software appears cheap initially, in the long run it may be more expensive due to yearly license fees and maintenance costs that can be significant. There could be a time and opportunity cost to working with software which does not fit your processes well.

3.1.3 Custom document management solutions

Bespoke enterprise software development requires major business investment in terms of cost and time. Bespoke software requires an initial investment to develop, which is often many times the initial cost of COTS software.

Low-code platforms make bespoke software development quick, easy, and cost-effective. A cloud based low-code platform provides flexibility and scalability in building document management and process automation applications. The myth of higher up-front cost of bespoke development can be changed with low-code platforms. Faster implementation, lower costs, larger developer resource pools, and total in-house control over the product make low code an ideal choice in the long run.

3.1.4 MACH Technology Architecture for custom solutions

MACH (Microservices-based, API-first, Cloud-native, and Headless) architecture is a set of design principles used for building flexible and scalable cloud-based applications.

MACH Technology Architecture for custom solutions

Advantages

  • Flexibility: Rather than limiting to a single technology stack as in a monolithic application, the composable architecture gives the flexibility to choose the best technology stack for each service.
  • Scalability: Each component runs independently and can be scaled independently. This allows enterprises to right size workload on cloud.
  • Faster time-to-market: Services can be developed, tested, and run independently. Each microservice is small and focused on a single functionality, making it easier for developers to understand and modify the service, thus reducing update time.
  • Personalized user experience: Headless approach in MACH allows decoupling of services and user interface layers, allowing development of best-fit custom user interfaces for specific devices or audiences, delivering more personalized experiences.
  • Lower total cost of ownership over long term through optimized workload utilization, faster development of features, and lower maintenance costs.

Capabilities of IT team in handling the complexities of a MACH architecture should be considered. Developing and maintaining a solution using MACH requires high degree of technical expertise, in-house or outsourced. An enterprise must carefully assess the pros and cons by comparing MACH approach with its existing monolithic systems. MACH solutions demand a more sophisticated and mature IT structure.

3.2 Content Analytics

Content analytics is about getting insights and numbers from data embedded in large volumes of content. Documents hold large amount of unstructured data which can provide insights into a business, its customers, and its value chain. It is concrete data that describes the content and can be used to generate actionable business insights using artificial intelligence driven analytics tools.

Typical use cases for Content Analytics in Document Lake

  • Content Analytics can help in optimizing document-intensive processes through intelligent classification and routing of content items.
  • Enterprise search and contextual discovery is also a part of Content Analytics. It acts as a gateway for search to enterprise content repositories, allowing data – structured and unstructured – to be enriched, searched, discovered and analyzed.
  • AI-enabled analytics and visualization applications can improve insight into content across massive enterprise repositories.
  • Identify potential fraud and risk by analyzing information contained in documents.

Intelligent enterprise search is part of Content Analytics. It uses AI technologies, such as Natural Language Processing (NLP), semantic search, and Machine Learning (ML), to automatically extract relevant information from unstructured data and to provide an engaged, relevant search experience.

Enterprise Document Lake combined with Cloud based Analytics services, could provide real-time reporting & analytics capabilities giving insights into content and content centric processes with interactive dashboards. This helps in better, faster decision-making. Combining analytics engine with audit trails could help in building early warning systems and investigating incidents.

Content Analytics could help users find relevant digital content using Search by Context. Users can search documents through a “what” approach versus a traditional “where” approach offering more relevant and accurate results. It can leverage metadata extracted from documents using technology advancements (such as NLP, AI) giving a boost to cognitive search capabilities.

3.3 Cognitive Services

A major benefit of hosting document lake on cloud is the ease with which Artificial Intelligence capabilities can be integrated with content related services. When applied along with big data systems, AI & ML offer tremendous possibilities in ‘understanding’ digital content and automating content related business processes.

A major challenge faced by traditional platforms during large scale content ingestion is lack of cognitive capabilities to ‘understand’ the content. This reduces accuracy and automation during content ingestion. Understanding includes identification of the document, extracting data, validation and understanding the meaning of extracted data.

Machine learning can also help in analyzing large volumes of data, including text, images, and videos data, in a scalable and cost-effective way. It is used for a wide range of tasks, including sentiment analysis, text summarization, and image analysis. From increased productivity to reduction in manual errors, applying AI/ML (artificial intelligence and machine learning) technologies to content centric platforms helps in increasing process automation.

Cognitive services can be used in a document lake to:

  • Extract Information using state-of-the-art natural language processing capabilities.
  • Multilingual models that can be trained in one language and used for multiple other languages.
  • Detect personally identifying information and redact sensitive information in documents.
  • Perform text summarization that extracts sentences or summarizes text that represent all the important or relevant information within the document.
  • Classify documents into different types relevant for the organization using custom AI models.
  • Respond to customer queries using generative AI.
  • Detect the original language of text and classify documents using data extracted.
  • Conduct sentiment Analysis by mining text for information about positive or negative sentiments contained in documents or other communications.

Cloud platforms provide AI tools for Natural language processing, document comprehension, image recognition, text-to-speech, integrating pre-trained Large Language Models and process automation. These services can be imported to your virtual network on cloud, integrated with business applications and trained using your data thereby providing contextualized capabilities. Integrating cognitive capabilities on cloud can be done with minimal development efforts compared to building such capabilities on-premise. This approach ensures data confidentiality as your data stays within your virtual network.

Organizations can also leverage AI and ML services on cloud during large scale digitization of legacy physical documents for extracting metadata, improving the accuracy of extracted data, and validating data. It can also help in automating compliance and governance. Organizations can build automated systems for enforcing regulatory requirements by automatically handling data in accordance with specific rules and compliance requirements. This is important especially in sectors like Government, Finance and Healthcare.

3.4 Cloud Hosting

Cloud platforms act as enablers for building Enterprise Document Lakes by providing Compute, Storage, Security and Cognitive services.

  • Compute – Cloud platforms provide virtual machines, containers, serverless computing along with auto-scaling and high availability capabilities for building a scalable document lake.
  • Security – Cloud platforms provide inherent infrastructure security for all solutions deployed. Along with best practices such as guard rails, security hubs and intelligent monitoring tools, cloud platforms make it easier to secure the document lake.
  • Cognitive services – Cloud platforms provide a large set of pre-trained AI services such as NLP, Chatbots, text & image analysis, Large Language models etc. along with development tools for integrating with business applications. This helps in faster integration of AI capabilities into Document Lake.
  • Storage - Storage services like Amazon S3, Azure Blob storage are designed to support high-volume, high-performance systems like document lake.

Using Cloud platform for building document lakes simplifies infrastructure management by leveraging convenient managed services provided by cloud service providers. Cloud with its intrinsic scalability and highly available services, make the best platform for hosting document lakes.

Organizations may choose to host document lakes in public cloud or private cloud or a hybrid model. These cloud service providers provide different levels of capabilities and total cost of ownership may also vary. At the same time, simply rehosting or moving all the repositories to cloud will not remove the complexity associated with managing and delivering content across the organization.

Advantages of Cloud hosting:

  • Organizations and IT teams can focus on generating business value instead of focusing on managing the complexities associated with infrastructure and data centers.
  • Lower total cost of ownership by leveraging different models such as reserved instances, economy of scale and serverless computing etc.
  • Use managed auto scaling services for optimizing workload utilization.
  • Cloud services are flexible and agile, offering on-demand infrastructure provisioning.
  • Organizations can re-design, re-engineer, and re-architect business applications and supporting solutions more easily in cloud.
  • Cloud-based document lakes makes technology adoption faster by using out-of-the-box development, integration, deployment and monitoring tools.
  • Increased reliability and availability on cloud. Cloud service providers use infrastructure management best practices and self-learning monitoring and recovery tools to ensure high availability and disaster recovery.
  • Cloud providers offer serverless services for all layers of the stack: compute, integration, and storage. Adopting a serverless architecture helps in optimizing cost during low-usage periods by automatically scaling up and down.

3.5 Key features of Enterprise Data Lakes

Enterprise Document Lakes aim to increase operational efficiency and deliver enhanced capabilities that allow:

  • Seamless content and information usage without the need to replicate it across repositories by consolidating the content in a single universal repository for the organization.
  • Diverse content type support in an enhanced multi-model and polyglot architecture by following any format, any size and any metadata methodology.
  • Governance and fine-grained data security that leverages a zero-trust security model, by implementing a rule & policy-based access and compliance system powered by rule engines.
  • Ability to fully decouple storage and compute resources and to consume only the resources needed at any point in time.
  • Better cloud economics with autoscaling that adjusts cloud resources infrastructure to match the actual demand.
  • Modularity so that service use is use-case driven.
  • Interoperability with any system in the organization through Open APIs.
  • Basic architecture of Document Lake on Cloud is technology agnostic allowing organizations to choose from a variety of programming languages, tools, commercial products and hosting platforms that align with overall Enterprise Architecture.

3.6 Benefits of Document Lakes on Cloud

Benefits of Document Lakes on Cloud

  • Minimize total cost of ownership (TCO) compared to on-premise systems.
    Total cost of procurement and operations for a fixed period can be minimized on cloud by leveraging economy of scale and by optimizing cloud pricing models.
  • Simplify and standardize document management function.
    Enterprise Document Lake provides same capabilities and services across the organization. Individual divisions need not invest separately on building document capabilities, thereby reducing cost and ensuring faster implementation. It is easier for organizations to enforce document management policies and standards across all divisions.
  • Build document centric capabilities supporting business functions.
    Business applications will have a standard set of document management services readily available which can be integrated on the fly.
  • Improve performance and efficiency.
    By implementing best practices at enterprise level, document lakes ensure performance and efficiency improvements are realized across all business applications.
  • Leverage artificial intelligence and machine learning capabilities offered by cloud services to improve automation.
  • With associated data lakes, you get faster analytics.
    Documents hold huge amount of information that can be extracted using intelligent capture solutions leveraging OCR and AI/ML. Extracted data can form part of Data Lake enabling organizations to analyze them on-demand
  • Improve security, compliance and governance by enforcing policies and processes at enterprise level.

3.7 Typical use cases for an Enterprise Document Lake

Industry Use cases Capabilities Benefits
Banking
Insurance
Healthcare
Manufacturing
Digitizing paper files Even with rapid digitalization, many organizations still retain large volume paper or hard copy files. One reason for this is the absence of BIG document repositories which can store and manage very large volumes of digital content securely, and without affecting overall performance. BIG content repositories.

Cognitive services for document capture.
Provides cost effective petabyte scale repositories.

Efficient processing of multilingual documents, data extraction, identification and classification of documents using cognitive services during digitization.

Helps to aggregate documents from disparate sources, make them legible, extract data with precision while continuously improving extraction accuracy.
Banking
Insurance
Healthcare
Manufacturing
Managing BIG content

Not every document and not every piece of information should be accessible to everyone. Controlling access based on roles, document types, and other characteristics on a system level makes it easier for organizations and divisions to share documents securely.
Rule engine based robust access control capability implemented as micro services. Ensures data integrity using robust access control system.

Content encryption at every stage, at rest and in transit, to mitigate data theft.

Improved Data confidentiality and compliance using cloud AI based redaction services to remove or blackout sensitive information from content.
Banking
Insurance
Healthcare
Manufacturing
Integration with enterprise applications

Document Lake provides seamless integration using standard Open APIs.
Seamless integration of digital content with all business applications. Acts as a single source of truth for all digital content in the organization.

Integrates all producer, consumer and collaboration applications of the organization
Banking
Insurance
Healthcare
Strengthen compliance.

Leveraging compliance services provided by cloud providers and automating processes like retention and disposition schedules, a document lake can help ensure content governance for regulatory compliance.
AI driven governance and compliance Minimizes risk and maximizes efficiency.

Document lakes can help effectively streamline Content Lifecycle Management across the organization.

Helps in enabling Hybrid Records Management, thereby improving overall regulatory compliance.
Banking
Insurance
Healthcare
Secure anywhere anytime access to documents. Document delivery through multiple channels. Provides a streamlined, organized content storage and delivery system making documents easily accessible and improving overall user experience.

Facilitates secure anytime-anywhere information access and real-time collaboration.

Provides smart search and intelligent recommendations.

4 Optimizing enterprise Document Lake implementations

Optimizing enterprise Document Lake implementations

4.1 Cost Optimization

Cloud platforms allow converting Capital expenditure to Operational expenditure. Enterprise Document Lakes require a large initial investment to setup infrastructure in on-premise data centers. Cloud hosting removes the need for large capital investment and allows organizations to pay-as-you-use. Cloud platforms usually provide volume and reserved capacity discounts which considerably reduces the total cost of ownership over the long term.

Total cost of ownership in Cloud can be further optimized by moving to auto scaling and serverless computing.

  • Leverage auto-scaling of cloud workload to optimize the utilization and cost efficiencies when consuming services. When demand or traffic drops, auto-scaling will automatically terminate or deallocate excess resource capacity to reduce operational cost.
  • Serverless and event-driven architecture on Cloud can be used to design, build and run applications. Public cloud platforms offer various managed serverless services that inherently optimize resource utilization and provide automated scale-in, scale-out of workload. With Serverless applications, resource utilization and cost are automatically optimized avoiding over-provisioning.

4.1.1 Workload optimization

Factor in cost when selecting all components for your workload. This includes using application level and managed services or serverless, containers, or event-driven architecture to reduce overall cost. Minimize license costs by using open-source software, software that does not have license fees, or alternatives to reduce the cost.

Managed services from cloud providers remove the need for the customer to manage a resource, and provide the function of running code, queuing services, and message delivery. The other benefit is that they scale in performance and cost in line with usage, allowing efficient cost allocation and attribution.

Using event-driven architecture (EDA) with serverless services. Event-driven architectures are push-based, processing happens on demand as the event occurs. This way no computing resources are used for continuous polling. This means reduced consumption of network bandwidth, reduced CPU utilization, reduced idle fleet capacity, and fewer SSL/TLS handshakes.

4.1.2 Optimizing storage cost in Cloud

This section discusses how to optimize storage costs without compromising on performance using an AWS example. Comparable features are available in other major cloud providers also. Using the right Amazon S3 storage class and in-built automation using intelligent tiering, total cost of storage for a given period can be minimized.

Amazon S3 Intelligent-Tiering

Organizations can use Amazon S3 Intelligent-Tiering to minimize storage costs by automatically moving content between access tiers when content access patterns change, to make the entire system most cost-effective.

S3 Intelligent Tiering automatically moves objects between three access tiers:

  • A tier built for objects requiring frequent access. This ensures acceptable retrieval times for mission critical applications.
  • A lower-cost tier designed for infrequent access.
  • A very-low-cost content archiving tier suitable for rarely accessed objects. Archive tiers supporting instant access are also available.

S3 Intelligent-Tiering monitors patterns while accessing objects stored and moves objects automatically between different storage tiers. Less frequently accessed objects are moved to lower-cost access tiers. Automatic archiving capabilities of S3 Intelligent-Tiering can also be used by applications accessing content asynchronously. Cloud Storage Systems are designed for very high durability up to 99.999999999% of objects.

4.2 Backup strategy

Data Backup and protection from threats like Ransomware attacks

Losing access to a document store is the worst nightmare for any organization. Ransomware attacks target organizations, encrypting data and preventing legitimate users from accessing it. The best ransomware backup strategy is to have critical data, compute systems, machine images and resources backed up to an alternate location or region, preferably on a different account.

Below points need to be considered while defining a backup strategy for a document lake on cloud:

  • Geographical locations available, and proximity to the organization's primary datacenter.
  • On-site vs cloud hosted storage options.
  • Network bandwidth available and latency for large scale data movement and system recovery.
  • Criticality of the data, systems to be backed up and legal or regulatory requirements.
  • Frequency of data backups and recovery point objectives.
  • Minimum redundancy required for defined recovery point objectives.
  • Data encryption to protect data.
  • Provisions for secure access.
  • Periodic DR and BCP procedures.

Cost of backup storage resources is a major factor influencing ransomware backup strategies. Cloud storage can save money by reducing the need for physical footprints but need to be designed to optimize distribution across locations, storage utilization and cost. From a threat perspective, more locations mean increased access points for threat actors.

4.3 Retention Management

Enforcing Retention policies for all content stored in the Document Lake. Policies may vary for different divisions, but it should be clearly defined and enforced using automated tools.

Use retention policies and lifecycle policies to reduce storage costs for the identified resources. Define retention policies on supported resources to handle object deletion per the organizations’ policies. Identify and delete unnecessary or orphaned resources and objects that are no longer required.

Example in AWS:

  • Using lifecycle policies of AWS Data Lifecycle Manager to automate deletion of Amazon EBS snapshots and Amazon EBS backed AMIs.
  • Use lifecycle configuration on S3 bucket to define actions for Amazon S3 to take during an object's lifecycle, as well as deletion at the end of the object's lifecycle, based on your business requirements.

4.4 Security

Security architecture and processes are very important for Document Lakes on Cloud. Security on Cloud is as strong as we make it. Public cloud is always connected to the internet and so moving to a cloud platform introduces additional risks compared to on-premise data centers. Existing security tactics are not sufficient and by understanding the unique perspectives and challenges of cloud security, and applying security best practices, cloud platforms can be made as secure as on-premise data centers.

4.4.1 Security Challenges in Cloud

  • Shared Responsibility Model
    In a cloud environment, the responsibility of maintaining the environment is shared between the Cloud Service Provider (CSP) and the customer. CSP is responsible for setting up and securing the underlying infrastructure. They are responsible for maintaining and updating hardware and securely disposing storage devices. The customer using the cloud infrastructure is responsible for securing the services provisioned and deployed in cloud environments including data and business applications. The customer is also responsible for patching operating systems, following best practices, configuring the cloud services provisioned, and controlling access to workloads and applications.
  • Workload Asset Management in Cloud
    One unique benefit of cloud platform is the ease with which new infrastructure and services can be provisioned. It is easier to add or remove services on cloud, but unless proper monitoring and right guardrails are implemented, it allows for misconfigurations thus making the entire workload vulnerable.
    Cloud environments may change very fast with technologies like autoscaling, managed services (e.g.: AWS RDS) and serverless computing. This means assets in a cloud may appear and disappear frequently. Traditional security systems and measures such as vulnerability (VAPT) scanning are no longer sufficient to address the threat matrix because a vulnerable asset may go live or exist only for a small duration, preventing it from being detected in a scheduled vulnerability scan. But the vulnerable asset could still be exploited by an attacker. Ease of deployment and frequent changes make it difficult for cloud security teams to maintain a holistic picture of infrastructure they are responsible for.

4.4.2 Best Practices for securing Document Lake on Cloud

  • Baseline security configuration for Document Lake
    This should include the account and entire workload used for building the core services and storage. Security baseline should clearly describe every aspect, from configuration of assets to planning response to security incidents. Baseline should be applied to every environment including production, any test and pre-production environments. Reevaluate the baseline at periodic intervals to incorporate new threat database and changes in environment.
  • Enforce baseline
    Security baseline should be enforced. A monitoring solution should be used to detect workloads noncompliant with the security baseline (either because it was misconfigured or modified after deployment). Automated tools and processes should be setup to fix misconfigurations as soon as detected. This will reduce the human intervention required to keep the environment safe.
  • Limit access
    Access to Document Lake workload should be on a need basis and only minimum required privileges are given to users and systems.
    • Discourage usage of root user. Use root user only in rare circumstances where it’s absolutely required. Delete access keys associated with root user.
    • Use Federated SSO whenever possible.
    • Policies and permissions should be aligned to roles and not to individual users.
    • Enforce strong password policy.
    • Mandate MFA for all privileged operations.
    • Delete inactive users and unused credentials.
    • Use access keys only in unavoidable scenarios and rotate keys regularly.
  • Watch for vulnerabilities
    Unpatched vulnerabilities present a threat. In cloud platforms, assets appear and disappear dynamically. Organizations should use vulnerability management tools with dynamic asset discovery for automatically detecting new instances, scan for vulnerability and non-compliance and take corrective actions.
  • Unify Cloud and on-premise security practices
    It is a common mistake to approach cloud and on-premises security separately. Organization’s wrongly use different strategies and tools for on-premise and cloud assuming that potential threats are also different. The resulting gaps will leave both environments vulnerable. This creates holes that can be exploited by malicious actors. Unifying cloud and on-premise security functions under one team is extremely important for ensuring compliance to security policy and also during an incident.
  • Automate
    Automation is important to ensure your cloud environment continuously adheres to organization’s security baseline. Tools should automatically flag incidents and/or immediately terminate workloads not in compliance.

4.5 Logging and Monitoring

  • Collect and secure logs
    All operations in cloud should be logged. It is a critical source of data for detecting malicious activities and responding to such activities. Logs are also important for ensuring compliance.
    Best practices for Logging in cloud:
    • Create log trails for all regions. Nothing should be excluded.
    • Use consolidated buckets for storing logs from all accounts.
    • Protect the bucket storing consolidated log data.
    • Logs are a key source of data for detecting, investigating, and remediating a security incident and so are a prime target for attackers. Buckets storing log files should not be publicly accessible. Access should be restricted on a need basis. Log all operations on the bucket and enforce MFA to delete log buckets.
    • Use encryption of log files whenever feasible.
    • Use log validation to detect tampering of logs. Manipulating log files is a common practice used by attackers to hide the path.
    • Network Logs (traffic flow logs) captures data on the network traffic going to and from the virtual private network in cloud. It could help in identifying known malicious IP addresses, intra network port scanning and traffic anomalies.
  • Monitor, Detect and Respond
    Monitoring tools provided by CSPs with updated threat database (potential security risks) should be used to detect and alert security incidents or suspicious behavior. Organizations may also leverage self-learning (AI/ML) systems for identifying or detecting security incidents. No manual intervention should be required to raise alerts on suspicious activity.

Conclusion

Enterprise Document Lake on cloud enables organizations to manage large volumes of digital content at enterprise level, build next generation capabilities while optimizing cost. Document Lake considerably improves automation, efficiency, and user experience by allowing secure access to content from any device, any geo-location, at any time. It allows faster development of business solutions that are subsequently easier to maintain.

Authors

Rajesh K.S

Senior Technology Architect

Girish Pande

Principal Technology Architect