Thursday, 31 March 2016

Office 365 performance – our Azure CDN image renditions solution

In the last post image renditions causing slow page loads in SharePoint Online, I talked about how Office 365/SharePoint Online has some sub-optimal performance around images and image renditions, at least at the present time. Numerous people got in touch to say they also see the same issue. However, we are implementers - and we bring solutions, not problems! So in this post I’ll go into a bit more detail on our way of working around this challenge, and how it can improve page load times.

To recap, the problem is related to the image renditions functionality of SharePoint. This is a useful feature which automatically creates additional versions (sizes) of images in a publishing site such as an intranet. However, when a user hits a page which has these images - often a home page or section page – and they need to be downloaded, we see a big delay of up to 3 seconds. Clearly if a page is taking say, 5 or 7 seconds to download in total, this is a big chunk of this time. Surprisingly, the delay is NOT the actual image being sent over the wire to the user. Instead, analysis shows the 3 seconds or so pause happens on the server in SharePoint Online – most likely because of “cache misses” due to the fact that the renditions framework wasn’t originally designed for architectures used in Office 365. So, performance of this bit of the platform isn’t optimal - our solution was to roll our own renditions framework, and this post describes what we did.

Using Azure to implement renditions

Before delving into the implementation, here’s how I described the process last week:

  1. An intranet author adds or changes an image in SharePoint
  2. A remote event receiver executes, and adds an item to a queue in Azure (NOTE – RERs are not failsafe, so we supplement this approach with a fall-back mechanism which ensures broken images never happen. More on this below).
  3. An Azure WebJob processes the queue item, taking the following steps:
    1. Fetches the image from SharePoint (using CSOM)
    2. Creates different rendition sizes (using the sizes defined in SharePoint)
    3. Stores all the resulting files in Azure BLOB storage
  4. The Azure CDN infrastructure then propagates the image file to different Azure data centers around the world.
  5. When images are displayed in SharePoint, the link to the Azure CDN is used (courtesy of a small tweak to our display template code). Since CDNs work by supplying one URL, the routing automatically happens so that the nearest copy of the image to the user is fetched.

For those interested, let’s go into the major elements:

The Remote Event Receiver and associated SharePoint app/add-in

There are two elements here:

  • A SharePoint add-in used for our remote code (hosted in Azure) to authenticate back to SharePoint
    • We register this add-in using AppRegNew.aspx, specifying a Client ID and client secret which will be used for SharePoint authentication
  • A remote event receiver used to detect when new images are added

The two are related because we implement the RER code as a provider-hosted add-in using the Visual Studio template (which gives 2 projects, one for the app package and one for the app web). In actual fact, this particular RER doesn’t need to communicate back to SharePoint – when it fires, it simply adds an item to a queue in Azure which we created ahead of time. The object added to the queue contains the URL of the image which was just added or modified, and we use the Azure SDK to make the call.

We apply this RER to all the image libraries in the site which needs the solution. We do this simply as a one-off setup task with a PowerShell/CSOM script that iterates through all the subsites, and for each image library it finds it binds the RER. My post Using CSOM in PowerShell scripts with Office 365 shows some similar snippets of code which we extended to do this. The script can be run on a scheduled basis if needed, so that any new image libraries automatically “inherit” the event receiver.

The Azure WebJob

The main work is done here. The job is implemented as a “continuous” job in Azure, and we use an Azure QueueTrigger to poll the queue for new items. This is a piece of infrastructure in Azure that means that a function in our WebJob code is executed as soon as a new item is added to the queue – it’s effectively a monitor. We initially looked at using a BlobTrigger instead (and having the RER itself upload the image to Azure BLOB storage to facilitate this), but we didn’t like the fact that BlobTrigger can have a bigger delay in processing – we want things to be as immediate as possible. Additionally, remote event receivers work best when they do minimal processing work – and since a quick async REST call is much more lightweight than copying file bytes around, we preferred this pattern. When a new item is detected, the core steps are:

  1. Fetch details of the default rendition sizes defined in SharePoint for this site. This tends to not change too much, so we do some caching here.
  2. Fetch details of the *specific* rendition sizes for this image, using a REST call to SharePoint. We need to do this to support the cool renditions functionality which allows an author to specifically zoom-in/crop on a portion of the image for a specific rendition – y’know, this thing:

    Image renditions - crop image

    If an author uses this feature to override the default cropping for a rendition, these co-ordinates get stored in the *file-level* property bag for the item, so that’s where we fetch them from.
  3. Fetch the actual image from the SharePoint image library. We use CSOM and SharePoint add-in authentication to make this call – our WebJob knows the Client ID and client secret to use. We obtain the file bytes i.e. effectively downloading the file from Office 365 to where our code is running (Azure), because of course we’re going to need the file to be able to create different versions of it.
  4. For each rendition size needed:
    1. Resize the image to these dimensions, respecting any X and Y start co-ordinates we found (if the author did override the cropping for this image). There are many ways to deal with image resizing, but after evaluating a couple we chose to use the popular ImageProcessor library to do this.
    2. Upload each file to Azure BLOB storage. We upload using methods in the Azure SDK, and ensure the file has a filename according to a URL convention we use – this is important, because our display templates need to align with this.

Once the files have been uploaded to Azure BLOB storage, that’s actually all we need to worry about. The use of Azure CDN comes automatically from files stored there, if you’ve configured Azure CDN in a certain way. I’ll cover this briefly later on.

Authentication for the Azure WebJob

I thought long and hard about authentication. In the end, we went with SharePoint app-only authentication, but we also considered using Office 365/Azure AD authentication for our remote code. Frankly that’s my “default” these days for any kind of remote code which talks to SharePoint (assuming we’re talking about Office 365) – as discussed in Comparing Office 365 apps with SharePoint add-ins, there are numerous advantages in most cases, including the fact that there is no “installation” of an add-in, and the authentication flow can be started outside of SharePoint.

However, one advantage of using SharePoint authentication is that we aren’t tied to using the same Azure subscription/directory as the one behind the Office 365 tenant. Our clients may not always be able to support that, and that was important for us in this case – using this approach means we don’t have that dependency.

Display templates

As mentioned previously, a big part of the solution is ensuring SharePoint display templates align with file URLs in the CDN. So if we’re using roll-up controls such as Content Search web parts around the site and these reference rendition images, these also need to “know the arrangement”. Effectively it’s a question of ensuring the thing that puts the file there and the thing that requests the file are both in on the deal (in terms of knowing the naming convention for URLs). It’s here that we also implement the fall-back mechanism (more on this later) to deal with any cases where a requested image isn’t found on the CDN. In terms of the swapping out the default behaviour of fetching images from SharePoint to fetching them from the CDN instead, it just comes down to how the value used within the <img src> attribute is obtained:

<img src="_#= imgSrc =#_" />

Simply implement a function to get that value according to your URL convention, and you’re good. Although not shown in the snippet above, it’s here that our fall-back mechanism is called, courtesy of the ‘onerror’ handler on the <img> tag.

WebAPI

Since we’re talking about architecture pieces, there’s some WebAPI thrown in there too – this is part of the fall-back mechanism, described later.

Azure CDN configuration

As mentioned earlier, the CDN part is easy with Azure. When a file gets uploaded to Azure BLOB storage, it gets a URL in the form:

https://[MyAzureContainer].blob.core.windows.net/MyImage.jpg

..but if you configure the CDN, it can also be accessed on something like:

https://[MyCDNEndPoint].azureedge.net/MyImage.jpg 

When the latter URL is used, in fact the file will be requested from the nearest Azure CDN data center to the user. If the file hasn’t propagated to that location yet, then the first user to be routed through that location will force the file to be cached there for other users in that geographical region. Our testing found this additional delay is minimal. There are a few more CDN things to consider than I’ll go into detail on here, but initial configuration is easy – simply create a CDN configuration in Azure, and then specify it is backed by Azure storage and select the container where you’re putting your files. The images below show this process:

image

Create CDN endpoint from BLOB storage

The fall-back mechanism

So I mentioned a few times what we call “the fall-back mechanism”. I was always worried about our solution causing a broken image at some point – I could just imagine this would be on some critical news article about the CEO, on a big day for one of our clients. Fortunately, we were able to implement a layer of protection which seems to work well. In short, we “intercept” a broken image using the HTML 5 ‘onError’ callback for the <img> tag. This fires if an image isn’t found on the CDN for any reason, and this kicks off our mechanism which does two things:

  1. Substitutes the original rendition image from SharePoint - this means we’re “back to the original situation”, and we haven’t made anything worse.
  2. Makes a background async call to our WebAPI service – this adds an item to our queue in Azure, meaning the image gets processed for next time. This is the same as if the RER fired against this particular file.

The image below shows what happens (click to enlarge):

CDN image renditions - fallback mechanism 2

One nice thing about this mechanism is it works for existing images in a site. So if the mechanism is implemented in an existing site with lots of images, there’s no need to go round “touching” each image to trigger the remote event receiver. Instead, all that needs to happen is for someone to browse around the site, and the images will be migrated to the CDN as they are requested.

Challenges we encountered

Along the way we faced a couple of challenges, or at least things to think about. A quick list here would include:

  • Thinking about cache headers from Azure CDN, cache expiration and so on – this relates to scenarios where an author may update an image in SharePoint but not change the filename. Clearly end-user browsers may cache this image (and an end-user can’t be expected to press CTRL F5 to do a hard refresh just because you’ve updated a file!). My colleague Paul Ryan wrote a great post on this at Azure CDN integration with SharePoint, cache control headers max-age, s-maxage
  • Parallel uploads to Azure (e.g. if we’re creating 8 different sizes for image found, may as well upload them in parallel!)
  • Ensuring we understand how to handle different environments (dev/test/production tenants with different Azure subscriptions)
  • Implementing a nice logging solution
  • Testing

Summary

As I summarized last time, it would be great if the original performance issue in Office 365 didn’t occur. But CDNs have always been useful in optimizing website performance, and in many ways all we’re doing is broadening Microsoft’s existing use of CDNs behind Office 365. The building blocks of Azure WebJobs, Azure file storage, CDN, SharePoint add-ins, remote event receivers, WebAPI and so on mean that the Office 365/SharePoint Online platform can be extended in all sorts of ways where appropriate. This was a solution developed for clients of Content and Code so it’s not something I can provide the source code for, but hopefully these couple of articles help awareness of the issue and architectural details of one way of working around it.

Tuesday, 22 March 2016

Office 365 performance – image renditions causing slow page loads in SharePoint Online

Just like any website, there are many reasons why page load times might not be amazing in SharePoint Online. Perhaps it’s a page with too many ‘heavy’ controls (e.g search web parts), a particularly slow custom control, the amount of data going over the wire (e.g. due to large images JavaScript/CSS files), use of a known performance killer such as structural navigation, or maybe things are slow from the office due to network infrastructure such as reverse proxies slowing things. If users are far away from where the Office 365 tenant is located, that can certainly exacerbate things. As always, if the site has any kind of customization, some optimization steps need to be taken - good performance won’t always happen by default. Recently however, we’ve been noticing slow page loads even in:

  • Sites we have optimized
  • Out-of-the-box publishing sites

Analysis showed that the issue was related to image renditions in SharePoint. If you’re not familiar with this feature, it does something useful which, ironically, is intended to improve site performance. For each image uploaded, multiple resized versions are automatically created in the background – the idea is that end-users don’t download a large ‘original’ image when only a tiny thumbnail is needed. A classic example is large images added to content pages, which are then shown as a list of rolled-up links on a home page e.g. “most recent news articles”.

If it wasn’t for the performance issue, the principle works well – a 4MB image is typically shrunk to around 200k for a typical size, and that’s a lot less data being downloaded to users. I’m not sure if anything has changed recently in Office 365 (since most sites I’ve been involved in use image renditions), but a couple of clients noticed the issue around the same time we did. Specifically, pages with renditions are slow on “first-time” page loads i.e. whenever the images need to be downloaded because they are not served from the local browser cache. But unfortunately it’s not just that - rendition image files are served with expiry headers of 24 hours, meaning even regular users will have at least one very slow page load every 24 hours. And of course, we might not just be talking about their home page – rather, every page they hit could be slow once every 24 hours. That’s certainly enough to damage user perceptions about Office 365.

Sidenote – first-time page loads vs. returning user page loads
So renditions are slow even outside of first-time page loads. But whilst we’re on the subject, how much do we need to care about first-time page loads anyway? I typically advise my clients not to worry too much – frankly, Office 365 will always be slow here since there are some pretty heavy JavaScript files that need to be downloaded (even if they do come from a CDN). It’s a rich, highly-functional platform after all. But this is very much part of the Office 365 design – I believe Microsoft take the view that users are forgiving, so long as their *subsequent* browsing experience is quick. Most users don’t have the same expectations of their corporate intranet/collaboration platform as they do of public consumer sites such as Facebook and Google – and since most intranet usage is *not* first-time page loads, things work out in the end. I agree with this viewpoint frankly.

But why are image renditions slow?

When further analysis is performed, we see the delay happens on the server – the big surprise is that the delay is not for the actual image to be downloaded, which is what you’d normally expect. The following image shows a real home page being loaded, and we can see a delay of multiple seconds for many images (each one being a “rendition” image), as indicated by the long green bars:

Rendition image delays_Small

When we dig deeper, we see the delay is not in the content download, but is in the “waiting” stage – this indicates the delay is with Office 365 itself, and not in the actual file being downloaded:

Rendition image delays - detail_Small

We believe this happens because there are typically “cache misses” on rendition images being served from the BLOB cache. When the image is not served from the blob cache, the SharePoint Online infrastructure is very slow to process and serve the image. It seems that cache hits are very rare for end-users – possibly due in part to BLOB cache settings in SPO (e.g. disk size allowed), but more likely due to sheer number of front-end servers in a typical SPO farm. I’m told that some of the larger farms in the service have between 100 and 200 front-end servers now – clearly this is a very different situation to an on-premises environment which SharePoint was originally designed for. So whilst the renditions architecture would be very effective in a typical on-prem farm of say, 3-6 front-end web servers, in the Office 365 world this is not the case. Of course, if you work closely with the product you sometimes see some examples of things like this that didn’t quite translate perfectly to the cloud world. That said, having worked on many deployments I’m always amazed at just how well SharePoint does work as a service (no doubt due to some hard work from talented Microsoft engineers, including some folks I know) - but there will always be “opportunities for improvement”, and the service often evolves to include these.

Our solution (based on Azure CDN)

So what can we do about it? Well, we could just avoid using SharePoint image renditions completely, but then performance would still be poor due to large files being downloaded to users. So we definitely do want to use different image sizes – and since the core work of resizing images isn’t that hard, why not do it ourselves? We could then take advantage of other things, like some automatic use of a CDN such as Azure CDN (N.B. that link explains what a CDN does if you’re not familiar). This is the direction we took to work around the performance issue in SharePoint Online. Things work pretty well, and what we implemented has the following benefits:

  • The Office 365 renditions delay does not occur
  • There is no impact to intranet end-users (except the improved performance)
  • Intranet authors do not need to do anything different or have additional training
  • The image files are hosted in Azure CDN (Content Delivery Network), which places the files in various Azure datacentres around the world, to ensure they are close to the user. This can significantly boost performance further for some users, especially those far away from the Office 365 datacentres (e.g. non-European users in the case of most of our clients)
  • All of the capabilities of the image renditions framework are supported (e.g. the ability for an author/administrator to crop an image rendition so that a certain portion of the image is used)

Technical architecture of our solution

I’ll go into more detail in the next post, but briefly our solution works as follows follows:

  1. An intranet author adds or changes an image in SharePoint
  2. A Remote Event Receiver executes, and adds an item to a queue in Azure (NOTE – RERs are not failsafe, so we supplement this approach with a fall-back mechanism which ensures broken images never happen. More on this next time).
  3. An Azure WebJob processes the queue item, taking the following steps:
    1. Fetches the image from SharePoint (using CSOM)
    2. Creates different rendition sizes (using the sizes defined in SharePoint)
    3. Stores all the resulting files in Azure BLOB storage
  4. The Azure CDN infrastructure then propagates the image file to different Azure datacentres around the world.
  5. When images are displayed in SharePoint, the link to the Azure CDN is used (courtesy of a small tweak to our display template code). Since CDNs work by supplying one URL, the routing automatically happens so that the nearest copy of the image to the user is fetched.

This is depicted below:

CDN image renditions for SPO

Fall-back mechanism

Clearly it’s critical that an intranet home page never displays a broken or missing image – that would be A Very Bad Thing for most organizations. So how can we guard against that? Also, we said that Remote Event Receivers cannot be 100% reliable (and also should not do “heavy” processing work), so…..what about that? And what about existing images in a site that was running before a solution like this is implemented? My colleague Paul Ryan and I wrestled with these challenges and more as we architected the solution and wrote the code - I’ll talk more about the fall-back mechanism (which takes care of these aspects) and go into more detail on the technical implementation in the next post. 

Summary

It would be great if this issue didn’t exist in Office 365 in the first place of course. But I write this post to show that with the right building blocks, we can certainly supplement functionality around Office 365/SharePoint with some effort. This was a solution developed for clients of Content and Code so it’s not something I can provide the source code for, but hopefully this write-up helps awareness of the issue and potential ways of working around it. More next time..