New White Paper: Trusted Workload Migration with EMC, RSA, Intel, and HyTrust

I’ve been working on this white paper for some time now but because we were pulling so much information from so many different companies, it took a while to get the review process wrapped up.  However, today it was finally published.

The white paper discusses the security and logistical challenges faced by enterprises and service providers who are faced with application workload migration between data centers.  It then provides an overview for a solution that was jointly developed by EMC, RSA, Intel, and HyTrust, which enables trusted workload migration between data centers using technology available from those companies.

The solution demonstrates two virtualized, active/active geographically dispersed data centers, managed and administered by the same service provider.  The cloud environments in each data center will both be active with both servers and storage virtualized in them.  It meets several key goals of the project as listed below:

  • Demonstrate immediate non-disruptive workload migration within and between data centers
  • Enable hardware root of trust for cloud hosts to validate that the hosts running the virtual machines have not been compromised by attacks such as BIOS rootkit attacks that run underneath the hypervisor
  • Provide an example of active security policy enforcement using hardware security data collected from the cloud hosts
  • Implement audit and reporting capabilities so an enterprise or service provider can pull real time reports showing an overall view of their cloud host integrity status

You can find the full paper here.

EMC World Session Preview: Disaster Recovery-as-a-Service (DRaaS)

It’s that time of year.  Time to head out to Las Vegas for a greatly anticipated EMC World show.  I’ll be out there to present two technical sessions on disaster recovery for cloud providers and service providers.  I’ll also be serving as the captain for the Cloud and Big Data kiosks within the EMC Solutions Group pavilion on the solutions floor.  Don’t worry, I’ve already gotten lots of jokes thrown my way now that I have my captain title (thanks to some of our friends across the pond).

My sessions will be covering a DRaaS solution we developed for service providers and cloud providers.  It’s a multi-tenant, multi-site cloud infrastructure with data protection provided by RecoverPoint, and DR orchestration provided by VMware SRM.  Our engineers in Cork did a nice job building out this environment for us as well as all of our testing.  It’s really a preview for a DRaaS white paper that we’ll be releasing next month so stay tuned for that.  Here is a high level view of the two data center cloud environments:

DRaaS

 

 

Make sure you stop by to see us in the Cloud and Big Data pavilion as well.  We have Dan Dunn doing his Paintjam sessions during the week (www.paintjam.com) so that should keep everyone entertained.  We’ll be covering the DRaaS solution mentioned above, but will also be covering Hadoop-as-a-Service and a solution around VMAX Cloud Edition.  We also have some of the guys from the OpenStack team if you want to see what EMC is doing with that as well.

If you’re going, I look forward to catching up.  It will be a busy week but should be a great one.

Slowdown In Spending? Or A Shift In Where The Money Is Going

Slowdown In Spending? Or A Shift In Where The Money Is Going

Several analysts, such as the one in the story above, are hinting that F5′s results might be an indicator that IT spending is heading downward and they indicate that this trend might affect your bigger vendors like Juniper and Cisco.  I can’t tell you whether there is a slowdown.  But I can tell you there has been a definite shift in where those resources are going.

This trend has actually been in the works for a few years and it’s part of my job to be on top of what these vendors are offering, where they are heading, etc…  I can tell you that I’ve had serious doubts about F5′s ability to survive for a few years and I’ll tell you why.

To give a brief summary, F5 is in the position that Checkpoint was in for many years.  They were viewed as the premiere product offering in their niche category.  They had great products that no one could touch, and they charged a high premium to obtain those.  I think F5 is finding themselves in this situation now and frankly, my experience with them tells me that they haven’t reacted quickly enough to save that reputation.  In turn, this means they have lost the ability to protect those years of very nice revenue and profit.

In my previous role at a manged service provider, I was responsible for much of the networking product offerings the company sold.  Prior to 2008 when we released our first multi-tenant public cloud offering, the load balancing products were pretty cut and dry.  We offered some basic load balancer options based on BSD for customers that didn’t need all the functionality that F5 offered.  For the customers that needed that functionality, we offered F5 which at the time was viewed as the top of the line (and may still be for the time being).  The monthly recurring price difference between those two services was very high, but the customers could basically do anything they needed to do from the perspective of load balancing.

Enter cloud.  Early on in the cloud days, we were focused primarily on virtualizing the compute and storage.  The network resources were abstracted from the cloud infrastructure and there were no virtual firewalls, load balancers, etc…  So we pulled each tenant environment out of the compute layers and into the network, via per-tenant VLANs, and then offered layer 4 through 7 services upstream.

We needed the ability to contextualize and segment each tenant’s traffic while preserving precious resources on the physical hardware appliances running those services.  At the time, Cisco had ACE virtual contexts which allowed us to break an ACE load balancer up into multiple tenant contexts.  We did a test pilot that was actually pretty successful.  We could carve up resources for multiple tenants, although the number of tenants were very small at the time, but that was ok because every tenant didn’t need load balancing services and it was viewed as a premium value add.  We could even allocate a percentage of resources that tenant could use (memory, CPU, SSL transactions per second, etc.)  So it seemed like this was a very viable solution for what we needed.  We could have each tenant’s environment extended into the network through localized VLANs, and then we could extend those VLANs into virtual contexts upstream within the firewalls and load balancers (we were using Juniper VSYS within the ISG platform…similar concept).

The problem for us was not technology, because that was there.  The problem was two fold.  Operationally, we did not want to introduce another load balancer vendor into an already crowded managed networking product line.  We would have had to train staff on a new platform, complete all the product work required to release a new product, train the field on how to sell it, buy spares for each data center they were deployed in, etc…  The second issue was that Cisco was viewed as a horrible vendor choice for load balancers by many of our customers.  Even though they wouldn’t be managing them, we still had to fight the perception that we were using inferior products to what F5 was offering.  The issue here is that F5 didn’t offer the ability to do what ACE could.

So we went to F5 and asked them for similar functionality.  We explained what we were looking for.  At first, we got blank stares like “why in the world would you want this?”  Cloud was new and we were one of the first providers to introduce a multi-tenant IaaS offering.  As time went on, that tone changed to “just wait, we understand the need, and we have that functionality coming.”  We waited.  We waited some more.  Then we were told that we just had to wait for “version 10″ to come out.  That would solve these challenges.  It didn’t.  F5 had not listed to us or others in the industry explaining the needs for these multi-tenant infrastructures.

We ended up having to settle with using the F5s in a one-arm design that sat upstream and outside of the cloud environment.  This put severe limitations on the services we could offer them and we couldn’t restrict SSL TPS which is one of the most expensive licenses you will buy for the platform.  So one tenant could come in and use up 90% of the resources, but pay the same price as the other tenants not using many at all.

Today, I know F5 has adjusted some and offer some of these features.  But it came too late in my opinion and I’ll bet many providers were in similar situations as us.  I’m confident that F5 sales have gone down like the reports indicate and I’m betting they continue to slide.

Now, everything is being driven towards abstraction, even the network resources.  Convergence is not just happening at compute and storage, it’s happening everywhere.  Cisco and Juniper are very much in the game (look at the Juniper Contrail acquisition).  VMware is driving towards software defined networking with the Nicira buy.  It won’t be long before all of these vendors have offerings that reside on the hypervisor and there won’t be a need to run those services on physical appliances.

I deal with this stuff everyday.  I keep up with where the industry is trending and constantly research new offerings that may address some of these challenges.  In the past, F5 would have been right there in those discussions.  I can’t tell you the last time F5 came up when discussing these large cloud infrastructures.  Cisco and Juniper still come up because there is so much investment in them already within the legacy environments.  Vendors have come to trust them and many will look for their leadership going forward.  Vendors like F5 who had very niche offerings will continue to be marginalized and will ultimately be replaced.

So, back to the original point.  I don’t think F5′s performance is an indicator of the spending trends in the industry.  I think it symbolizes a shift in spending and those dollars that once fed F5 are being spent on other solutions.  To me, that means SELL and thankfully I did a few years back.

AWS Taking On CIA Cloud Contract

Last week, there were lots of articles around the Amazon Web Services announcement that they had won a contract to host some of the CIA’s cloud services.  Some of the initial articles contained very little information at first, which is understandable.  They made it sound like the CIA was going to be hosting their apps on the AWS public cloud.  For those of us in this business who deal with multi-tenant cloud infrastructures and the challenges that come with them, this seemed like a bit of a stretch.  We had some lively discussion going on back and forth with some of my fellow EMC peers and honestly, my first thought was that AWS had convinced the CIA to host some non-critical, public facing app on the AWS cloud as a PR stunt.  We all know how those marketing guys like to take the smallest bit of information and turn it into a major, ground breaking release!

More information has now come out about it and David Linthincum over at Infoworld made some pretty generic comments about it here.  He does briefly touch on some key points that everyone building cloud services or thinking about moving to any type of managed cloud service should consider.

It now appears that AWS is building and will be running a private cloud infrastructure for the CIA.  You can probably bet that the apps running on that infrastructure will not pose a threat to national security should they be compromised.  I imagine our great congressional oversight committees would have a few questions about that, especially since they decided to broadcast it out in press releases last week.

David makes a good point in his post about how this isn’t the core business of AWS and how it actually is getting away from their stated purpose.  He’s right.  He’s also most likely right in that AWS couldn’t turn down the money the CIA was throwing at them and took on this one-off to capture that $600M in revenue and to gain the publicity that would come with saying you run the CIA’s cloud.

This is the same type of scenario that often happens in any public ISP.  Your sales guys are out there selling your services.  Well, your services don’t exactly fit some of these large deals in their pipeline.  So what happens?  They come back to the business with these extreme one-off requests, the business looks at the numbers, and then says tell them we’ll do it.  Many times, this information never makes it back to the engineering or operations teams for vetting the feasibility of taking on such a project.

So what happens?  The deal gets done and then it just funnels down to those teams for design and implementation, after the commitment has already been made to the customer.

While the revenue numbers often look very tempting and will certainly present a nice story at the next board meeting, these deals can often lead to many more problems for the company.  

The first obvious question that comes to mind is “Can we actually build what we’ve promised?”  Lots of engineering cycles might have to be pulled off higher priority projects to figure this out, costing the business revenue that could be generated in the future from new services that are in development.  In my past, I’ve been at companies that counted lost revenue in the number of days it took to bring new customers up or launch new services.  This is a very real, and sometimes very big number.

Once you have designed that project, of course you have to implement the solution which also could significantly pull time away from engineering resources, especially on the larger sized deals.  As that is happening, your operations teams have to be figuring out the logistics of supporting this new model and customer.  Again depending on the size of the deal and customer, this actually might end up looking like dedicated resources for the project, as well as individualized run books for this particular customer.  Once again, you are costing the business more money by doing this and you are pulling your valuable resources away from your core business model.

I’m not advocating that companies turn down these opportunities.  Not at all.  In fact, many companies actually build from these type opportunities and create entirely new business models and revenue streams from such projects.  This is just a warning that as you take on these deals, get your engineering and operations teams involved early on if possible.  There have been many such deals that have turned out to be loss leaders for an organization simply because they company didn’t truly understand their cost of goods sold before inking the deal.  The business ultimately needs to be able to ask themselves if it’s worth it and you at least need some idea of true costs before you can make that determination.

Back to AWS and the CIA.  Will this be good for AWS?  Probably so.  Will it cause them to shift valuable engineering resources away from other projects to focus specifically on this?  Absolutely.  Could it ultimately impact their overall service?  Taking on a single $600M contract when your current revenues are around $1.5B will absolutely cause you to shift resources, and a lot of them, towards making sure that customer is happy.  Time will tell what that ultimately means for AWS.  There are a ton of companies out there just waiting for them to make one misstep so they can jump in to take revenue share.

 

Building & Deploying a DRaaS solution for Cloud Providers Using EMC Recoverpoint & VNX

Building & Deploying a DRaaS solution for Cloud Providers Using EMC Recoverpoint & VNX

If you’re going to EMC World this year, be sure to sign up for this session that I’m presenting.  We’ll be walking through building a DRaaS solution for cloud providers using EMC’s Recoverpoint and VNX.  We’ll discuss these in terms of multi-tenancy support so if you are wanting to build multi-tenant DR services for your cloud environments, stop in and hear about our experiences.

Secure Workload Migration Within and Between Data Centers

This was a very busy week for me at EMC.  It marked the announcement of a solution we put together with a lot of hard work by people from EMC, RSA, Intel, and HyTrust.  This was the first project I was handed after coming on board with EMC and I’ll have to admit that I’m extremely satisfied with the results and with the overall success the project has seen.

The solution was announced this week at Intel IDF 2012 in San Francisco.  We started off the week giving an overview at the Solution Provider Summit for the Open Data Center Alliance (ODCA).  We then got air time during Renee James’ keynote on Wednesday, I had a technical presentation that day, we got the press releases out, shot about two hours of video footage with the Intel film crew that will be posted later this year to Intel’s Cloud Builders site, and spent a collective 12 hours on the demo floor meeting with people to give live demos of what we had done.  Needless to say, I am ready to get on my plane tomorrow morning and come home to North Carolina!

If you want the general overview for the solution and don’t care too much about the details, you can click here.  If you care about the details and want to learn more, continue reading.

First, let’s talk about the challenges we are trying to address.  In this solution, we’re trying to address two major issues that service providers face.  While these are targeted at service providers, it’s important to realize that any enterprise building out their own private cloud will face similar challenges.

The first is workload migration, or moving your VMs from one data center to another to avoid disaster, recover from it, perform maintenance, or other activities that might require you to move your workloads.  For example, let’s say you have a data center on the east coast and have a hurricane approaching.  You may want to plan for the worst and move your VMs from that data center to a data center outside the path of the hurricane.  The challenge here is how to do this non-disruptively without causing downtime to your applications and users, thus avoiding potential lost revenues or productivity.

The second issue we’ve helped address here is guaranteeing the integrity of the underlying hosts that your workloads are running on.  As enterprises move their workloads to cloud providers or build their own, they are asking questions like:

  • How will the cloud infrastructure be verified?
  • How do I know the hosts that my VMs are running on are secure?
  • Will I be able to satisfy my audit and compliance requirements in this environment?  Ultimately an enterprise owns their audit process but when parts of their infrastructure or applications are managed by a service provider, that provider has to be a partner in the audit process.

Look at it this way.  If a single hypervisor is attacked using something like a BIOS rootkit attack, you could compromise dozens of systems in an enterprise cloud or dozens of customers in a multi-tenant cloud provider environment.  Attacks at this level are designed to evade your typical runtime security software like your anti-virus.

Based on those challenges, we wanted to design a joint solution to help address the needs.  First, we wanted to show how EMC VPLEX could be used to enable non-disruptive workload migration within and between data centers.  Second, we wanted to show how Intel Trusted Execution Technology, or TXT, could be used to help add security to a service provider cloud environment by allowing security policy enforcement based on TXT trust status.  Finally, we wanted to show how the TXT control procedures could be used in overall compliance reporting, providing an end to end view of the trust status for the hosts running your cloud environments.

So here is an overview of the solution we built.  We are representing a single service provider who has a cloud infrastructure stretched across two data centers.  We have ESXi 5.1 on all but one host because we wanted that host to be out of compliance, as we’ll discuss later.  We have Intel TXT enabled on the hosts.  We’re using EMC VNX storage arrays and EMC VPLEX Metro which is a storage virtualization appliance that we’ll also discuss later.  We are using HyTrust Appliance for the active security policy enforcement and we are pulling both security logs and TXT host trust status into RSA Archer eGRC to provide overall compliance reporting.

Let me start off by giving a brief overview of EMC VPLEX.  The bottom line with VPLEX is that it is an in-band storage virtualization device that sits between your hosts and your storage and gives us the ability to export virtual volumes from the underlying storage arrays simultaneously.  To my hosts, VPLEX appears as a target and to my targets, it appears as a host.  It gives me an instant copy of my data at both locations.  At the base you have the physical storage layer. Next is the virtual storage layer with VPLEX that supports heterogeneous storage arrays and can create virtual volumes across these different arrays. You then have the physical host layer with VMs on top of that.  Now the really interesting bits of this solution come up when we introduce the second site.  VPLEX’s AccessAnywhere technology allows you to export a single virtual volume from both of these VPLEX clusters simultaneously. From the perspective of workload mobility, my data is now already at both sites.  So when it comes time to move VMs from one site to another, all I have to do is a simple vMotion.  I no longer have to worry with Storage vMotion or replicating large amounts of data from one site to another because my data is already there.  This eliminates the time needed for your data to be moved from one site to another and there are a ton of interesting solutions that can be built on top of this technology.

So we now have our baseline infrastructure.  We have our two data centers with storage, network, and compute and we are using storage virtualization to enable our non-disruptive migration between sites.  Now let’s focus on the security aspect.

For those that are not familiar with Intel TXT, let give a brief overview of it for reference.  TXT is a hardware based security technology that is built into current Intel chipsets.  The bottom line is that it allows me to specify a known good configuration for my hosts in my cloud environment, and then measuring every host in the environment against that known good configuration each time a system is booted.  During that process, parts of the BIOS and hypervisor are measured and if they match the known good values, that host is given a label of “trusted.”  If the values don’t match, that means something has changed that should not have changed and a “not trusted” label is applied to the host.  We can then take that trust status and bring it into our security applications.

The first thing we are going to do with our TXT trust status is bring it into RSA’s Archer eGRC platform and specifically RSA’s Solution for Cloud Security and Compliance.  This solution is based on the Archer eGRC platform.  As of the current release, over 130 VMware-specific controls have been added to Archer to enable VMware security policy implementation and management tied directly to regulations, such as PCI and HIPAA.  This RSA solution does two things.  It discovers new virtual infrastructure devices and it interrogates those devices against the control procedures to verify VMware security controls have been implemented correctly. The results of these automated discovery and configuration checks are fed directly into Archer for continuous monitoring across the cloud environment.  For this solution, we have now brought in Intel TXT related control procedures on top of the existing controls for cloud environments.  This allows us to gain a high level view of our overall hardware compliance, in addition to all the benefits we’ve previously had with our GRC system.

In addition to simply using that trust status for overall reporting of our cloud infrastructure integrity, we can also bring the TXT trust status into HyTrust Appliance.  That solution sits in between the administrators and vCenter and gives me the ability to create administrative policies based on that trust status.  In our solution, we have set up policies that prevent an admin from moving a virtual machine, or workload, from a trusted host to an untrusted host.  If that is attempted, HTA will block it and generate real-time security events that can also be fed into RSA Archer’s Incident Management view so that actions can be taken to mitigate that risk.

So now we have the complete solution.  We have our data centers with EMC storage.  We are using EMC VPLEX to export distributed virtual volumes from those data centers to present to our hosts so that we can enable non-disruptive workload migration within and between the data centers.  We have enabled Intel TXT on all of our hosts, we have created our white list server, and we are measuring each cloud host against that known good configuration.  We are then taking that TXT trust status and creating policies that restrict movement of our workloads to a host that is not trusted.  And finally, we have wrapped everything into RSA Archer for both high level compliance views/reporting as well as real time incident event management.

Now let’s take a walk through some screenshots so you can get an idea for what this looks like in the real world.  This first screenshot is the trust attestation server or verification server.  This is what polls vCenter for the hosts and stores the overall trust status of the hosts.  This is an application developed by Intel so companies can take advantage of TXT on their server platforms.  Notice that three of the four hosts in the demo environment have an overall trust status of green.  There is one that is untrusted and if you notice, that’s because the VMM status is negative.  On this host, we installed ESXi 5.0 which does not match our white list server running ESXi 5.1.

This next shot is a view of the Cloud Security and Compliance view in RSA Archer.  As you can see from the graph in the upper left corner, our overall compliance rating is 75% which represents 3 out of our 4 hosts with a label of “trusted.”

Next, I’ll show a shot of vCenter after we have attempted to migrate a VM from one of our trusted hosts to that untrusted host.  When I attempt to do that, HyTrust Appliance blocks the migration and I get an error in vCenter.

We can then view specifics of that log by going to the HyTrust tab within vCenter.  It is from here that I can do all administrative functions for the HyTrust Appliance and view my logs for the enforcement actions.

Finally, I can show those logs being brought into RSA Archer for my real time security event management.  Once I have them coming in, I can trigger other actions to help mitigate my risk.

So that’s it end to end.  Overall it was a great project and we are targeting Q1 of 2013 for the general availability of all the parts.  As part of the release, we are planning to produce an EMC Proven Solution Guide around the solution as well as an Intel Cloud Builders document.  We will also have a complete video demo of the solution available at that time.

Cisco Cloud Services Router – Brief Introduction For Service Providers

For those that didn’t hear, Cisco announced their Cloud Services Router at Cisco Live this year in San Diego.  They didn’t put much emphasis on it at all and if you weren’t paying attention, you would have missed it completely.  The last I heard, it was scheduled for GA towards the end of 2012 but there are some pre-release versions available if you want to get your feet wet.

So what it is?  Basically, it’s an IOS-XE router running on a VM in your cloud environment.  That’s the bottom line.  That alone should generate lots of grand thoughts about possible use cases for those that have been involved in deploying large SP cloud environments.  There is a lot of potential here for the SP and cloud provider markets in general and I was surprised by how little attention this got at Cisco Live.

Having the ability to set up a VM in your cloud environment as the gateway for that environment has a lot of potential for an SP cloud offering.  Think about it for a minute.  How are you bringing customer connectivity into your cloud environment today?  You might be terminating MPLS VRFs at the provider edge and then extending VLANs into the customer cloud environments or you might be terminating VPN services at a VPN gateway somewhere upstream of their cloud and then extending the environment back from that device.

CSR opens up the opportunity to have end-to-end customer connectivity all the way to the customers cloud environment.  It will be able to serve as the MPLS or VPN gateway for your cloud environments without the need for additional specialized upstream gear to handle these functions.  You’ll be able to do full L2 over WAN connectivity from your customer’s sites/data centers to your IaaS infrastructure.  This could be huge for the SP market who is still struggling to figure out all the various connectivity models to get to their customers outside of the facilities hosting their cloud pods.  The simple fact of being able to move the VRF termination to the cloud edge eliminates having to use VLANs from the provider edge to the cloud edge, which can be a very big challenge in a lot of SP designs.

CSR will support the protocols you are probably running in your network now (OSPF, BGP, etc.) and CSR will also function as a LISP tunnel router, allowing layer 3 address mobility between your cloud environments in different data centers.

On top of all that, there appears to be some kind of firewall capabilities that will be more than just the standard router ACLs.  Depending on the extent of what gets released, this could function as your cloud customer’s perimeter gateway or at least offer another layer of firewall services for added security (perhaps use it in combination with ASA1000v?)

It should be very interesting to see this when the final version makes it to market.  I’m looking forward to seeing how it will impact some of our SP architectures moving forward.

Follow

Get every new post delivered to your Inbox.

Join 187 other followers