Generally we see many blogs focusing on the Cloud Security Best Practices mentioning from the Products/Services perspective which is good as it stands. But for a change, I wanted to take a different approach through this blog which focuses from the ‘GCP Project Execution teams or GCP Partners’ perspective to ensure they can perform the best cloud security posture review for their customers and/or their GCP tenancy
Intention behind this blog is to provide a guide to GCP Consultants and Partners help them to deliver The Best Cloud Security Posture Review offerings to their customers. As this is aiming at the GCP Consultants and Partners, the assumption is that they already have a very strong understanding of the GCP Products / Services including a thorough Security knowledge required to carry the assignment and /or assessment
What is Cloud Security Posture Review [CSPR]?
It is an engagement phase aimed to identify the improvements in the existing Security designs and Operational processes by looking at the existing GCP tenancy of the customer
What is the Outcome of this phase?
A successful outcome results in addressing the Customer’s Security concerns and helps increase the confidence in them. And in turn, it also helps increasing the GCP adoption with confidence as a secure solution
The CSPR report document consists a detailed list of recommendations and the reasons behind them based on the assessments done through discussions/workshops among the applicable stakeholders.
Who is the Customer in this context?
- who have there applications/workloads running already in GCP under different environments like DEV, TEST, STAGE or PROD
- who have an appetite to get their GCP environment reviewed to align with the Google Cloud recommended best practices
- who have a security team but may or may not be aware of the GCP security configurations or setup like a google cloud’s SME and they want to get a better visibility of the things running within their GCP environment
Before proceeding with the CSPR, the GCP Consultants or Partners required to gather background, additional [required] information/permissions and set the expectations correctly with the customer. I’m not covering the preparation checklist in this blog as it totally depends on the mutual agreement and alignment between the customer and the GCP Consultants/Partners and it can be as detailed as possible. The main objective of this blog is to cover the Key areas of focus for the CSPR and highlight the Top 3 to 5 elements under each of those areas
Let’s begin the journey by going through 7 key areas by asking the fundamental questions, understand the reasons behind those questions [why?] and come up with the best recommendation/s possible as a solution to each of them
Cloud Resource Management
It provides the ability to centrally manage all GCP Projects and further helps to group projects into Folders based on business units, developer environments, customer environments, or a combination of them. This hierarchical organisation enables the customers to easily manage common aspects of their resources such as Organisational Policies and IAM policies. Common questions to be asked under this area are listed below:
- How are the Projects and Folders organised in your organisation ?
[Why ?] This helps create a clear and logical structure within their organisation such as departments, groups, teams, environments etc., and gives visibility following a defined hierarchy
[Recommendation/s] → Based on the business needs, define your organisation structure which can be then easily grouped by Cloud IAM permissions and Organisation policy inheritance. Customers can refer here to get more insights and collaborate accordingly for a well structured resource hierarchy for their foot-print within the GCP environment
- Do you’ve right IAM permissions set at those levels / layers ?
[Why ?] Following a Resource Hierarchy as recommended by Google Cloud, enables to simplify the application of IAM permissions. At the same time over complicating of Resource Hierarchy leads to a messy IAM permission process and makes it difficult to apply effectively and efficiently. Striking the right balance in the hierarchy is so critical else you might face difficulties in implementing principle of least-privilege access through the layers
[Recommendation/s] → Make use of folders to apply Cloud IAM permissions and organisation policies. Depending on the folder structure defined, more restrictive permissions and policies can be applied through IAM perms and Organisation policies. For e.g., production folder/s must have more stringent restrictions than the development or test folder/s.
Caution — Avoid extensive use of folder level IAM permissions, but instead, apply permissions at a project or resource level.
- Do you’ve a standard / approval process defined for new project/s creation ?
[Why ?] By default, Project Creator and Billing Account Creator is assigned to all the users in the domain at the Organisation level and if this gets compromised or if users could create projects that do not follow the defined organisation best practices, that leads to abandoned projects and creates unnecessary management overhead
[Recommendation/s] → Limit the Project creation under your organisation and implement required organisation policies to grant this only to applicable users. Another way to grant this through a service account such that it can be used through automation. Along with this close control, also define a Resource Isolation strategy which helps to have only a single application, service in a project for a specific environment to enforce full isolation
- Do you’ve a defined process & review mechanism to implement new org polices ?
[Why ?] If there is a process and / or review mechanism defined already, the implementation of org policies might lead to a significant impact to an existing application/s or service/s. So, a rigorous and well-defined review process should be established to reduce risk of operational impact and to clearly identify which org policies are meant to be implemented keeping in mind the business aspects
[Recommendation/s] → GCP comes with wide range of Org Policies ready to be implemented. To name a few are like ‘prevent External IP address’, ‘skip default network creation’, ‘use trusted images’, ‘resource location restriction’ and many more useful policies which contributes to a good security posture. Partners can refer to educate the customers and collaborate effectively to strike the right balance
- How are Organisation Policies applied ? is it an automated process ?
[Why ?] Configuring Org Policies via policy-as-code techniques using automation such as Terraform enables ease of use. Helps in reviewing each change request before the implementation, provides traceability and also change history
[Recommendation] → Use Terraform to manage org policies within the organisation and configure org policies as a code. More can be referenced here
Identity, Authentication and Authorization
Cloud Identity is used for managing users, groups, and domain-wide security settings. It is critical to centrally manage and secure user accounts, service accounts, and service account keys.
Key to this is to be able to manage IAM permissions at scale while granting the minimum set of permissions for the users to do their job through the least privilege principle possible
- Do you have a single source of truth established already and is it synchronised ?
[Why ?] In order to handle the user lifecycle events such as on-boarding, off-boarding can be processed to the Cloud Identify platform. For a successful and risk free administration, synchronising from a single source of truth like HR system is the best approach. As part of this if they’ve any internal and external users, they can be synchronised from their corresponding single authoritative identity system/provider
[Recommendation/s] → Use a single authoritative identity store such as (Active Directory) and for external users (e.g. customers) into customer identity store and Leverage Google Cloud Directory Sync [GCDS] to sync users and groups from their directory service to Cloud Identity/GSuite. Ensure to synchronise users in real-time to Cloud Identity such that the status and profiles are matched between the Cloud Identity and the Central Identity service
- How do you disable the external or internal users and their access ? Is there any process defined ? What about Audit ?
[Why ?] Terminated or Suspended users/employees should not be able to access GCP resources as soon as they are disabled from the HR system. Fool proof implementation of the audit and automation process will limit the damage and minimise the disruption to the users accessing cloud resources
[Recommendation/s] → Automate and regular audit the process of synchronising users, groups and organisational units from your Identify Store such as Active Directory to Cloud Identity. Leverage GCDS service to sync from your active directory in real-time. More can be referred here
- How are the Administrative roles limited and monitored ?
[Why ?] A Super Admin account is the highest privilege account in Cloud Identity and comes with super powers. These accounts should be limited in number and locked away to prevent unauthorised access and to reduce the risk of any compromise. Always have more than one Super Admin account as a backup in case a Super Admin account gets locked out. To reduce the potential for any compromise to this account, it should be *only* used infrequently. Remember to NOT assign this role to any accounts/email addresses that are used for your day-to-day communications
[Recommendation/s] → Do NOT configure more than 3 or 4 Super Admin accounts. Multiple accounts are needed for a backup strategy for scenarios like if one account gets locked out etc., Create dedicated accounts for Super Admin usage with an email address that is not specific to a particular user email address such as superadmin-users@examplecompany.com
- How are you handling Authentication and is it enforced ?
[Why ?] To protect Super Admin accounts and avoid the situation of compromise of those accounts
[Recommendation/s] → Enforcing 2-factor authentication with Security Keys for all the Admins is a must. All Super Admin accounts must be set with recovery options enabled in case of any unexpected situations with strong authentication methods backing them up.
- How do you manage your service accounts, rotation policies and enforcement etc., ?
[Why ?] Users who have Service Account User role on a service account/s can indirectly access all the resources that service account has access to. It is critical to audit the keys and ensure minimum amount of permissions are assigned to the service account/keys and keys are rotated frequently to reduce the attack surface in case if those keys get compromised
[Recommendation/s] → Disable default service accounts but create custom service accounts for each service with *only* the minimum permissions required for that service. Restrict users who can act as service accounts. Service accounts used by external systems use a static user managed key that do not expire and can be used from any location too. Doing the regular audit, giving them the least permissions possible and scheduled rotation is very critical to minimise the attack surface in case of any compromising situations. Define a consistent naming convention for service accounts that will help in auditing and identifying the purpose behind those service account’s usage easily. Ideally there should be ‘0’ service account users who can act as a service account
Network Security
GCP Virtual Private Cloud (VPC) provides networking functionality to VM instances, GKE containers, and Google-Managed Services that run on GCE VMs (such as Cloud Dataproc, App Engine Flex). A VPC provides global, scalable, flexible networking for cloud-based services and has several capabilities to secure and monitor the network
- Are default networks being used ?
[Why ?] Default networks allocate a large ip range on creation and they do come with default firewall rules that open ports such as ssh, icmp etc., These IP ranges may be larger than required and increase the risk of IP conflict with on premises and/or other connected networks. Also, default open ports are a security risk if not properly addressed. Leaving these networks active opens them to the risk of users putting GCP resources on them and prone to compromise
[Recommendation/s] → Delete and stop using Default Networks. Instead, create a new custom network with regions and IP address ranges and custom firewall rules as per the needs. Implement Org Policy constraint constraints/compute.skipDefaultNetworkCreation. Use terraform for project creation and build default network deletion in to the terraform module. More can be referenced here. Keep in mind that applying the constraint at the org level will only apply to future projects that are created in the organisation
- Are Shared VPCs being used ? and do you’ve dedicated projects as host projects ?
[Why ?] Communicating between services over public internet increases potential attack surface as the traffic leaves the network. Shared VPC allows an organisation to connect resources from multiple projects to a common VPC network so that they can communicate with each other securely and efficiently using internal IPs from that network and it also helps centralise network administration and implement separation of duties.
[Recommendation/s] → Use Shared VPC to enable private IP communication between services and centralise network administration. Avoid using Shared VPC host projects for anything besides Shared VPC administration. Service project admins should have access to only specific subnets in within VPCs to ensure adequate isolation and separation of duties. More details can be referred here
- Is Private Google Access enabled for all subnets requiring access to Google APIs ? How do VMs access endpoints on the Internet ?
[Why ?] By default, communication to Google APIs and Services is over public internet. With Private Google Access requests to stay within the Google Network thus enabling a public IP or NAT is not required for VMs to access GCP services
[Recommendation/s] → Enable private access on a subnet by subnet basis. Keep a note that, the firewall rules must allow egress to Google APIs and Services. Private Access is only applies to VM instances which are with Private IPs [NO external ip]. VMs can leverage Cloud NAT to go out to the internet to access the endpoints
- Is Internet egress DENIED by default ? and Are Load Balancers use strict SSL policies ?
[Why ?] Creating a default deny firewall rule for all egress traffic mitigates data exfiltration risks. Ensure that the COMPATIBLE SSL load balancer policy is NOT used as it allows the broadest set of clients, including those which support out-of-data SSL features, to negotiate SSL with the load balancer
[Recommendation/s] → Create a firewall rule with the lowest priority that blocks all outbound/egress traffic for all protocols and ports in combination with default deny all ingress traffic. Ensure to create higher-priority firewall rules for specific traffic in order to open required ports and protocols. This way, we can prevent ports and protocols from being exposed unnecessarily. Use of RESTRICTED profile is recommended for sensitive and regulated workloads for load balancers. Ciphers enabled in the RESTRICTED profile are only supported by TLS 1.2. About SSL policies, more details can be referred here
- Are VPC flow logs enabled ? Are Firewall logging enabled for implicit DENY firewall ?
[Why ?] It is critical to have the ability to conduct forensics if an incident were to occur for the networks with high sensitive data and that can be achieved by enabling the VPC Flow logs for every subnet in the VPC network. Similarly, Firewall rules logging helps providing network security by defining rules for traffic incoming and outgoing from the network. These logs are very critical to audit and to verify that existing firewall rules are working as intended :)
[Recommendation/s] → Enable VPC Flow logs for all the sensitive networks so that in case of an incident, network forensics can be conducted. The logs can be very large and have associated costs as well. Adjust the sampling rate and aggregation rate based on your requirements and you can turn them on a specific subnet as well which handle sensitive information. Similarly, ensure that Firewall Rule logging is enabled for all the firewall rules in place
- How do you protect your applications against DDoS and Application layer attacks ? Have you configured end-to-end TLS encryption for applications behind HTTP<S> load balancers using IAP ?
[Why ?] In oder to protect your internet facing applications, filtering incoming traffic is critical through security policies. End-to-end encryption is essential to NOT to send the traffic over the wires in plain-test. This ensures transport security and protects your sensitive data
[Recommendation/s] → Global Load Balancers provide automatic defense against L3/L4 infrastructure DDoS attacks and Use of Google Cloud Armor security policies protects your load balanced services from application layer attacks such as OWASP Top10 vulnerabilities. Use Global HTTP<S> load balancer in combination with Cloud Armor Security Policies to front end your internet facing applications and services. Configure HTTPS load balancer, IAP, backend service and application to listen on SSL port only and allow only HTTPS traffic on VPC firewall.
VM Security
GCP provides a wide range of ways to configure your VM which includes options on how to best secure your instance (IAM access, firewall rules, service accounts, API scopes, SSH, base images, logging, etc). It is a huge responsibility to keep the VM operating system and applications up to date with the latest security patches from both Customer’s and also from being the GCP Parter’s perspective.
- Are you in need of a sole tenancy ?
[Why ?] Sole-tenant nodes ensure that your instances do not share host hardware with instances from other projects and use of labels might help to separate your sensitive workloads from non-senstive workloads. As an example, some payment processing work loads might require physical isolation from other workloads or virtual machines in order to meet the compliance requirements
[Recommendation] → Use sole-tenant nodes to keep your instances physically separated from other instances. More details can be referred here
- Create least privileged custom service accounts for VMs ?
[Why ?] By default, compute engine default service account has Project Editor role which grants access to all the GCP resources within that project. If this service account is assigned to a Compute Engine instance, anyone with access to this instance will have access to all the resources in the project which is not recommended practice.
[Recommendation] → Do NOT use the Compute Default Service Account rather always a create a custom service account and provide minimum IAM permissions needed for the application on the VM instance to get the required job done
- Do you need SSH/RDP access to your VMs from Internet ?
[Why ?] IAP’s TCP forwarding feature lets you control who can access administrative services like SSH and RDP from the public internet. This feature prevents from being openly exposed to the internet and instead, services must pass authentication and authorisation checks on IAP proxy before they get to the targeted resource
[Recommendation] → Configure IAP with context-aware access to restrict IP CIDR range, client devices and users/groups who are allowed to access. The users gain access *only* if they pass the authentication, authorisation, Ip / device check
- How do you ensure only trusted golden images are deployed ?
[Why ?] Only images that have passed the security vulnerability tests and have been configured in an appropriate manner should be allowed to be used within the VPC and we know well enough already the reasons behind it ;)
[Recommendation] → Ensure an Organisation Policy is created to limit the allowed compute hardened images available for consumption
- Regularly patch the custom images and establish an image update process for custom images ?
[Why ?] Having an automated plan for managing the images with the latest security patches and other security updates reduces the attack surface and helps to quickly remediate vulnerabilities in OS and installed software packages
[Recommendation/s] → Automate the creation and update of new images process. Further tag your images with the version, time, date to help with auditing and debugging purposes. Enforce lifecycle policies on custom images by marking images for deletion and obsolescence and use org policy to ensure that only trusted images are being used.
Data Security
GCP offers fully-managed, scalable database services to support your applications and store your data. With managed services there is a wide range of controls and configurations available to store your data securely. The use of these configurations and controls help define your security posture to the best possible.
- What tools are being used to scan, classify and report any sensitive data ?
[Why ?] In case if you’ve sensitive data such as credit card numbers, names, social security numbers etc., this information needs to be redacted on a top priority else has the risk of being compromised
[Recommendation] → Consider implementing DLP [Data Loss Prevention] practice that helps discover and classify the sensitive data. Then redact for sensitive data elements which helps to preserve the utility of your data for joining or analytics while obfuscating the raw sensitive identifiers
- Are you protecting your Cloud SQL database instances from being open to the world ?
[Why ?] Minimise the attack surface on database instances and only trusted/known and required IPs should be authorised to connect to it.
[Recommendation/s] → Configure CloudSQL with private ip only to minimise the attack surface. Configure the root user with a very strong password. Always setup Cloud SQL instances to accept only SSL/TLS connections ie., enforce SSL for all connections.
Note: Connections to your instance through the Cloud SQL Proxy are encrypted whether you configure or enforce SSL/TLS or not. SSL/TLS configuration affects only connections made using IP addresses
- Are any datasets shared publicly or are there any datasets shared directly with external users ?
[Why ?] To avoid data exposure, restrict public access especially on data sets that contain restricted or sensitive data
[Recommendation] → To avoid data exposure, restrict access to internal users on a need to know basis. This can be done in configuration step using IAM Policies and enforced using Organisation Policies
- Do you have Data Lifecycle policies configured for GCS buckets and did you define retention policies for compliance needs ?
[Why ?] Ensure that the data older than necessary is no longer retained, reducing the opportunity for unintended data loss. Data in a bucket has a risk of inadvertently being overwritten and/or deleted without a policy in place. From compliance perspectives, this can be a potential issue
[Recommendation/s] → Data life cycle policy configurations to buckets help to apply a set of rules to current and future objects in that buckets. When a particular object meets the criteria of one of the rules, Cloud storage automatically performs a specified action on that object which overcomes the unintended loss or storage. On the other side from compliance requirements, Bucket locks can help protect the data is not overwritten or deleted
- Do you have publicly accessible buckets or objects ?
[Why ?] To avoid unnecessary data exposure, restrict public access especially on buckets/objects that contain restricted or sensitive data
[Recommendation/s] → Update the IAM policies on the buckets, update the object level ACLs or disabling the use of object level ACLs with an organisation policy to remove the unnecessary public access. This can be further enforced using VPC Service Controls or Forseti validator
K8S Security
GKE offers a wide range of configuration options that allow organisations to apply security controls and policies that meet their requirements, and new and improved security features are added frequently. The overall security posture and strength of isolation for each workload, cluster, and surrounding environment depends largely on the combined set of security features, controls, and policies that are put in place.
- Are container images built from a minimal base ? and where do they come from ?
[Why ?] Helps to have only core and necessary software available which in turn protects from future work and maintenance work from security perspective. Enforcing the latest approved base images during the image creation process ensures the latest security fixes and updates are incorporated as well which reduces the no. of vulnerabilities in the resulting image
[Recommendation/s] → Use the smallest possible base image to reduce the amount of OS packages and libraries installed. Build validation process in to your CICD pipeline to ensure only secure base images are used. More details can be referred here
- Are you storing secrets in your containers ? if so, think again!
[Why ?] In case of any compromise, revocation and rotation of secrets would only involve changes to the security credentials themselves and would not require a rebuild and redeploy of every affected container image. This is possible only if you remove all environment-specific configurations and security credentials out of the container images
[Recommendation/s] → Secrets should be stored in a centralised secrets management system and workloads should only be combined with the necessary secrets at run-time per-environment basis. It is critical to perform a regular audit to your code repository to ensure that NO database credentials, API keys or other secrets are ‘based in’ to the code and image build process. Watch this to refresh your thoughts about container image best practices
- Do you’ve Container LifeCycle Management defined and followed ?
[Why ?] It is critical to scan container images upfront and it is equally important to scan them over time to ensure their security posture is always the latest to meet the desired standards. Consider a scenario where A scan passed the vulnerability scan initially, but after discovery of a new vulnerability or CVE in the cyber world, it will no longer meet the necessary latest requirements
[Recommendation] → Ensure all images are scanned, updated, rebuilt on a recurring basis to obtain the required security fixes. Configure GCP Container Analysis or any 3rd party tool to scan the container images and notify teams when containers in use NO longer meet the desired security requirements. Please refer here for more insights
- What about ‘Resource Isolation’ and ‘Resource Management’ under K8S ?
[Why ?] Resource Isolation is very critical to avoid contention issues and also to reduce the likelihood of certain resource exhaustion failure scenarios. Resource Management is key as well to ensure all pods have requests and limits rightly configured to prevent over and / or under subscription of resources
[Recommendation/s] → Enable LimitRanges to set a base CPU and RAM requests/limits on pods for a better Resource Management across the workload/s and similarly place the critical micro-services on to a dedicated node pool and isolate them from custom micro-services on to another dedicated node pool. Use ‘taints’, ‘tolerations’ and ‘node affinity’ features to ensure that the scheduling and deployments happens as per the specifications and configurations
- How are you managing your ‘cluster admins’ and do you use ‘cluster network policies’ ?
[Why ?] The no. of users who have the ability to control your GKE clusters should be restricted to the smallest possible admins and it should be a very well known and limited list of members/users. Defense-in-Depth can be achieved by enabling and properly defining the Network Policies to restrict pod-to-pod communication. For eg., if you’ve a compromised front-end service in your application, through network policy you can control the communication links back to the lower levels down the layers and successfully protect the bottom layers from the compromised front-end service/s. It fundamentally helps to reduce the attack surface in case of any compromise or malfunction in the system
[Recommendation/s] → Ideally a small no. of DevOps/SRE users and development team members should be part of the Cluster Admin responsibilities. Monitor closely the user base with those permissions through a regular audit and review mechanisms. Add applicable rules in Network Policies to restrict communication between workloads and help to avoid lateral movement within the cluster. More details can be referred here
Security Operations
Securely managing the automated operations lifecycle of GCP resources, protecting its logging and monitoring streams, ensuring all systems conform to a secure configuration baseline, and responding to security incidents are all important capabilities for security and operations teams to master. Insufficient focus in these areas can manifest in security misconfigurations, slow development cycles, and missed or delayed responses to security incidents.
- Do you’ve a centralized and aggregated logging in place ? Is there any 3rd party SIEM integration in place ?
[Why ?] Configuring a Centralized system that aggregates the logging allows security teams to manage alerts and gives full context of incident/s.
[Recommendation] → Create a sink in a GCP organisation or folder and set that sink’s includeChildren parameter to True. This configuration inherits all the child projects under that organisation and folder which can simplify the process of exporting all logs for a given environment in to a 3rd party SIEM solution or an integration with Cloud Security Command Center
- Do you restrict access to Exported Logs ? Is versioning enabled on Log sink buckets ?
[Why ?] Always limit the access to these exported logs to a no. of accounts with appropriate permissions in place to avoid any compromise and data exposure. To support the retrieval of objects that are deleted or overwritten, Object versioning feature is very important
[Recommendation/s] → Follow the principle of least privilege to provide access permissions to these logs. Grant minimum permissions required to get the job done and monitor who has what access and why and document as necessary. For all GCS buckets which are targets of log exports, enable Object Versioning or implement ‘bucket lock’ feature to prevent any malicious action
- How often do you conduct Risk assessments and Vulnerability Scanning ?
[Recommendation] Frequent assessments and scanning of the infrastructure and other GCP services and your source code ensures to remediate vulnerabilities whenever they arise [‘why’ is very well known hence skipped]
- Do you’ve an Incident Response [IR] plan in place ?
[Why ?] A well formed IR plan can reduce the impact of an incident, avoids the confusion but helps to remediate the issue in an organised way through systematic approach and /or actions
[Recommendation] Develop and maintain an appropriate IR Plan in accordance with the necessary compliance frameworks. The NIST 800–61 is commonly used baseline for building a response plan. Please refer here for more details
That’s enough with questions and recommendations as an entry point to achieve the goal.
During this engagement process, GCP Partners can come up with a document covering all the applicable areas through ‘Technical Review and Improvements’ [sample is shown below] and present it to the stakeholders. Based on the mutual agreement and alignment, this engagement can get in to the implementation phase/s
A Good way of showing the improvements is by preparing the list of items/sections and comparing the “Before/Old Status” and “After/New Status” such that the value addition is very clear and precise to the stakeholders. Sample is shown below
This is it!
Hope you find this information useful.
Thanks for taking time to read!