Computer screen

Not on Our Watch: an Introduction to Application Performance Monitoring

Photo by Carlos Muza

How do you debug an inexplicable glitch on your website? How do you find the point(s) of failure in your application if and when they occur? Where do you turn to troubleshoot these problems? In addition to the variety of performance-related modules in Drupal, the growing number of third-party products and services available to analyze and maintain the operational health of your site can be daunting.

Take the mystery out of your application’s performance and squash small problems before they balloon into bigger messes by monitoring your site’s resources and runtime with an option that fits your needs. From free, open-source tools to full-on enterprise solutions, there's something for everyone no matter what your size and budget.

Topics will include:

  • The value of gaining visibility into resource utilization.
  • Application performance monitoring options in the current landscape.
  • Best practices for monitoring.
  • Integration strategies for automating monitoring tasks and customizing metrics.
  • This session is for anyone who wants to explore the simple and in-depth ways to understand and ultimately optimize their website's performance.
MidCamp 2018 / March 10, 2018

Transcript

Clare: Welcome to the session Not on Our Watch, an Introduction to Application Performance Monitoring. My name is Clare Ming. Here's my contact info if anyone wants to reach out at some point after the talk. I am a developer with Chromatic. We are a fully distributed digital agency, and we are also one of the sponsors here at MidCamp, so shoutout to Chromatic. I have the slides available here. If anyone has trouble seeing this, you can just go to that link and download the PDFs for this. I'll just leave that up for a few seconds.

[pause 00:00:46]

Clare: We're all set? Okay. Again, welcome. Before we get into what application performance monitoring is and what it can do for us, I want to just take a moment here to talk about where we've come in this industry in terms of speed and delivery of our apps and sites over the years. As our technology stacks and tools evolve, one of the biggest concerns for all of us involved in building apps and sites is and remains, of course, performance.

Let's take a look at how performance expectations on the internet have changed over the years. Here are some choice quotes from a woman named Maile Ohye. She until very recently was the Developer Programs Tech Lead at Google since the mid [unintelligible 00:01:54]. Some of you might be familiar with who she is. She's done a lot of popular YouTube videos on the internet about SEO search rankings.

Anyway, she says, "Two seconds is the threshold for e-commerce website acceptability." "A fast site increases conversions." "Site performance is a factor in Google rankings." What's interesting about these quotes is that I pulled them from a YouTube video that Maile Ohye did back in the spring of 2010, so that's nearly eight years ago. Now here's a fellow named John Mueller-- Mueller, not sure how to say his name. He is a Webmaster Trends Analyst at Google.

Here he's tweeting in response to someone asking him about optimal page load limits. He says, "There's no limit per page. Make sure they load fast for your users. I often check webpagetest.org and aim for under two to three seconds." I don't know if you can see the date, but note that he's tweeting this at the end of 2016, so interestingly enough, this metric of two to three seconds for page load times hasn't really changed much over the years.

What has changed, however, is the relevance of mobile sites and mobile devices. Earlier this year in January, Google made an announcement on their Webmaster Central blog, saying that, "Google is switching to a mobile-first index this summer. Page speed will be a ranking factor for mobile searches starting this July." According to DoubleClick, which is Google's digital marketing platform, they published some research in the fall of 2016. In that, they said, "To keep people engaged, mobile sites must be fast and relevant."

In that research, they did an analysis of more than 10,000 mobile web domains and found that most mobile sites actually don't meet this bar. In that research, they said, "The average load time for mobile sites is 19 seconds over 3G connections." To put those 19 seconds in context, that's basically as long as it takes to sing the alphabet, so quite some time. In that study, they said that "53% of mobile site visits are abandoned if pages take longer than three seconds to load." The obvious takeaway here is that slow-loading sites frustrate users and negatively impact product owners and publishers.

Think with Google is a platform where Google shares the latest marketing research, digital trends, and consumer insights. Last month, they published an article with the new industry benchmarks for both mobile page speed. Here's an infographic from that study that they put out in that article talking about mobile page load times. You can see, as the page load time goes from one to three seconds, the probability of bounce increases 30%.

Tack on two more seconds, and the bounce probability triples to 90%, and so on and so forth. You can see how the numbers go. The good news is, though, that since that time that Google looked at mobile page speeds about a year ago, the average time it takes to fully load a mobile page dropped by seven seconds. Even with that gain, the bad news is that it still takes about 15 seconds according to their new analysis, and that's just way too slow when you consider that more than half 53% of mobile site visits leave a page that takes longer than three seconds to load.

Google's data shows that while more than half of overall web traffic comes from mobile conversions are lower than desktop. The internet has a lot of work to do for the half of the mobile sites that are out there in the wild. Then in a similar article that Think with Google published just a month ago in February, they looked at 11 million mobile ads landing pages spanning more than 200 countries. The results of their study revealed some pretty disquieting observations.

It confirmed their thesis that even as most traffic is now occurring on 4G over 3G connections, the majority of mobile sites are still slow and bloated with way too many elements. All this is to say that lower is better when it comes to how quickly a mobile page should display content to users. As of February 2018, less than a week ago, the best practice is to serve mobile pages in under three seconds, along with desktop as we all know.

In an article called, The need for mobile speed: How mobile latency impacts publisher revenue, and DoubleClick, again, which is a property owned by Google, shared that publishers whose mobile sites load in five seconds, earn up to two times more mobile ad revenue than those whose sites load in 19 seconds. In that Think with Google piece from last month about mobile page speed industry benchmarks, the basic premise and the conclusion backed by their data and their research is that speed ultimately equals revenue, and faster is better, and less is more.

Just to get a read on where everyone's at, is everyone familiar with APM or application? Yes? Some folks. Okay, good. Because this is an introduction, let's maybe define some terms and get into what exactly is application performance monitoring, or I should say management? The M can refer to both in the acronym. Most people consider application performance monitoring a subset of application performance management.

According to Wikipedia, APM is the monitoring and management of performance and availability of software applications. The purpose of it is to detect and diagnose complex application performance problems to maintain an expected level of service. APM ultimately is the translation of IT metrics into real business meaning or value, like as we saw earlier speed, definitely being tied to revenue for most businesses.

What in essence can APM do, for all of us looking at this? Succinctly, it can do several key things. Measure and monitor application performance. Find the root causes of application problems, and identify ways to optimize application performance. To get a little bit more granular, here's a definition of APM, according to Gartner, which is a technology research firm. They identify five main functional dimensions for the components of APM software and or hardware.

The first being end-user experience monitoring. APM strives to capture user-based data to gauge how well the application is performing and identify potential performance problems. Second, application topology discovery and visualization which basically means, the visual expression of the application in a flow map or graphically to establish all the different components of the application and how they interact with each other. Then third, user-defined transaction profiling. This is using the software to examine specific interactions to recreate conditions that lead to performance problems for the purposes of testing.

Then, we have application component deep dive. Collecting performance metrics pertaining to the individual parts of the application identified in the second dimension which was the visualization of performance. Lastly, IT operations analytics which is taking everything that your company has learned in all the previous four dimensions and discovering usage patterns, identifying performance problems, and anticipating potential problems in order to avoid them and prevent them before they happen.

Before we dive a little bit further into the weeds of APM, I want to take a moment here to just tell you all a story. This is the story of how and why I became an APM evangelist. Last year, I was working on a dev team for a major online content publisher. It was a D7 site and during the course of my time on that project, we encountered some strange phenomenon that was totally perplexing all of us. We couldn't figure it out. In the next series of slides, I'm just going to share some screenshots from the APM tool that we used to help us diagnose and ultimately solved a really hard problem.

I actually blogged about it. You can go to Chromatic's website and get all the gory details of what happened and what we did. In a nutshell, this client has their application wired to an enterprise APM solution called New Relic. What we're looking at here is a view of web transactions over a targeted period of time. You'll notice-- I don't know if you can see it, all these vertical lines on the graph. Those are warnings. Lots of them. Lots and lots of warnings over a period of time.

We knew something was up because we kept getting alerts from the system, from New Relic. One thing I wanted to point out here is why the vertical axis, the y-axis of this graph, that's milliseconds of response time. Our response times were good. We were coming in under 700 milliseconds, which is seven-tenths of one second. Remember, when we were first reviewing what performance metrics and how they changed over the years. Keep it under two to three seconds. We were fine then. We were doing really great but we kept getting these alerts and it was driving us mad.

It motivated us to try to get to the bottom of what was going on. Based on what New Relic was telling us, we began a process of just deeply examining the application. Analyzing its queries. What were its longest transaction? Trying to eliminate inefficiencies wherever we found them. With this information, we started whittling down problems in the codebase, doing everything we could to just stave off these warnings.

Interestingly enough, with everything that we did, we just kept getting these alerts and we couldn't-- it was driving us nuts, like, "What is going on?" Then one fateful day, we got a spike in our tooling that nearly brought us down to our knees, but not for too long, thankfully, because what happened at that moment was that, we got some really critical information that ultimately leads us out of the darkness and into the light.

There was a silver lining in that big spike, and because of the granularity with which we could drill down into the APM software and into the interface, we could see what transactions were taking way too long. It was then that we finally got the clue that leads us to the resolution to solving this grand mystery around these alerts. What we're looking at here now is the aftermath of all the work that we did, that we put into optimizing the code, cleaning up slow queries, all the things that we tried and did to get the application back on track.

You can see that the frequency of alerts dropped significantly, thank goodness and that was a huge relief for us. Tremendous, tremendous relief.

Speaker: It's like you disabled the external web [unintelligible 00:15:18]

Clare: Yes. That was important. There was a reason for those being there, and then we finally identified what needed to be turned off for that. What I want to highlight here is that even though early on we were getting decent page load times at 700 milliseconds previously, all the efforts that we went to, to troubleshoot all the alerts, lead to this phenomenal decrease in page load times. We went from 700 milliseconds on average or just below that down to 200 milliseconds.

That was over a 70% decrease in page load times. 71.4% to be exact. Needless to say, our client was thrilled. We not only solved this mystery with the alerts but we brought significant performance gains to their application just by using the information that we were getting from the APM. That is the story of how I became a true believer in APM because in all honesty, I don't know if we would have been able to solve that problem had it not been for the information that New Relic was giving us.

Let's pivot back to the nuts and bolts of APM. The big question is, how do we measure performance over time? The answer is Application Performance Metrics. Here are some key Application Performance Metrics. The top one being user satisfaction which we'll get into in just a second. It's measured by something that's called Apdex scores. Then we have average response time, error rates, application instances counts, request rate, application and server CPU, and application availability.

Let's start with this first key metric, the Apdex score. The Apdex score is an application performance index. I would like to spend a little bit of time exploring Apdex here because I think all things being equal, I personally feel that Apdex scores are the most revealing metric for the health of an application. Again from Wikipedia, Apdex is an open standard developed by an alliance of companies. It defines a standard method for reporting and comparing the performance of software applications in computing.

The purpose of Apdex is to convert measurements into insights about user satisfaction. This is done by specifying a uniform way to analyze and report that on the degree to which measured performance meets user expectations. Here is a definition according to New Relic. Apdex is an industry standard to measure users' satisfaction with the response time of web applications and services.

It's basically a simplified service-level agreement solution at an SLA that gives application owners better insight into how satisfied their users are. Apdex is a measure of response time based against a set threshold. It measures the ratio of satisfactory response times to unsatisfactory response times. Response time here is measured from the time an asset is requested to complete a delivery back to the requester.

When we talk about user satisfaction in the context of APM, we know that it's measured by an Apdex score. That has become the industry standard for tracking the relative performance of an application. It works by specifying a goal for how long a specific web transaction should take. Here's some information from a company called Stackify. They're like a mid-low to mid-range affordable APM solution but that's more for in a .NET Java space but they had some good information there too.

They talk about web request being bucketed into a few different categories, satisfied, tolerating, too slow, and failed. All that can be represented in a math formula wherein you would determine your Apdex score going from zero to one. Zero, obviously, being, or maybe not obvious, but zero is the worst possible score you can have where a hundred percent of response times are frustrated for the end-user and one is the best possible score where the hundred percent of the response times are satisfied.

Here's a visual representation of the Apdex formula. The Apdex score is a ratio of satisfied requests plus tolerating requests over the total requests made, the total number. You'll notice that, and I think this is just a convention from the industry but satisfied requests are considered one, are counted as one whole thing while each tolerating request is considered half of one satisfied request.

For example, let's take a look at what that formula would happen if you had say, a sample of 100 requests with a target time delivery of say 3 seconds. Of 100 samples, we can say that 60 requests came in below the threshold of 3 seconds. Again, these are sort of arbitrary that you would set these against for your own application or for your own company. 60 requests came in and satisfied the page load times within 3 seconds.

Then say, 30 requests came in and within the response time of say between 3 and 10 seconds and so we're going to split that in half. Then maybe there's like 10 remaining of the 100 samples that didn't make the cut and we're giving response times in excess of 10 or 12 seconds or whatever you want to make that mark. When we plug in the numbers into the Apdex formula, we end up with an Apdex score of 0.75.

To refer back to the case study that we just went through, we were getting those alerts because our threshold which we set at a certain index was falling below that and so for a period of time, we kept getting alerts over and over again because our Apdex score was coming in lower than where we had set the threshold. Hopefully, Apdex score makes sense. That's, again, a cross-industry standard that almost every APM tool uses to determine where your applications-

Speaker: Where did you set the threshold at? 0.9?

Clare: I can't remember. On that application, it might have been somewhere between 0.7, 0.8, somewhere in there. Now, let's move on to the rest of the key application performance metrics. The second one in that list was average response time which is a very traditional metric that's defined as the amount of time an application takes to return a request to a user.

In theory, an application should be tested under many different circumstances, for example, the number of concurrent users, the number of transactions requested. Typically, this metric is measured from the start of a request to the time the last byte is sent. What this does is it allows us to view the performance of our application over time. This metric ultimately enables us to understand what's normal so that we can begin to determine what's abnormal for an application.

Say, for example, you were able to capture the average response time of key web service calls over a period of a couple of days or a couple of weeks. You could compare the current response times of those web service calls to their historical response times and raise an alert if the current response times say, I don't know, is more than two standard deviations away from the historical mean.

It's important to be cautious about this as a metric in terms of its accuracy because you can think of-- The way I like to visualize average response time is like in a bell curve. There are factors like geographic location of the user or the complexity of the information that's being requested, that can all affect the average response time. All these should be considered when you're evaluating or making an evaluation of application performance. It can be skewed by just a few very long response times.

Moving on. Third is error rates. There are three different ways to track application errors. HTTP error percentage which is the number of web requests that end in an error. Logged exceptions which is the number of unhandled and logged errors from your application. Thrown exceptions which is the number of all exceptions that have been thrown. Application instances count. If your application scales up and down in the cloud, it's really important to know how many server and application instances you have running.

A lot of times, in cloud hosting solutions, you have auto-scaling enabled. That's to ensure that your application scales to meet demand and then during off-peak times lower them. This can create a couple unique monitoring challenges. If your application automatically scales up based on, say, CPU usage, you might never see your CPU get high. Instead, you would see the number of server instances get high and potentially increased hosting costs. It's important definitely to keep an eye on that especially for cloud applications.

Then there's request rate. Understanding how much traffic your application receives will impact the success of your application. Potentially, all other performance metrics are affected by increases and decreases in traffic, which makes sense. Request rates can be useful to correlate to other application performance metrics to understand the dynamics of how your application scales. Monitoring request rate can also be really good to watch for spikes or even inactivity. For example, if you have a busy API that suddenly gets no traffic, that could be a really bad sign that something is wrong and it's something to watch out for so that could be very revealing in that way.

If your CPU usage on your server is extremely high, I guarantee you, you have a problem with your application performance. Monitoring the CPU usage of your server and applications is a very basic and critical metric. Virtually, all server and application monitoring tools can track your CPU usage and provide monitoring alerts. It's important to track them per server but also as an aggregate across all your individually deployed instances of your application.

Then lastly, application availability. You definitely want to monitor if your application is online and available and so it's a key metric to be tracked. Most companies use this as a way to measure uptime for their SLAs, their service level agreements. If you have a web application, the easiest way to monitor application availability is by a simple regularly scheduled HTTP check.

Now that we've wrapped up the discussion on key application performance metrics, let's talk about what comprises the components of a complete Application Performance Management solution. What should an APM have? An APM solution should allow you to analyze the performance of individual web requests or transactions. They should enable you to see the performance and the usage of all application dependencies like databases, web services, caching, all that jazz. It should also let you see detailed transaction traces to see what your code is actually doing. Optimally, it should allow for code-level performance profiling. It should have basic server metrics like CPU, memory. It should have application framework metrics like performance counters and queues.

Custom application metrics, that's an important one. An APM solution should definitely allow dev teams or anyone actually from the business side, product owners, to create and customize metrics. Application log data should enable you to aggregate search and manager logs. It should allow you to set up robust reporting and alerting for application errors. Ultimately, it should facilitate real-user monitoring to see what your users are experiencing in real-time.

Before we move on, I just wanted to take a moment to talk about custom metrics. There's typically three ways in which custom metrics might be applied. The first one, sum or average which can be used to count how often a certain event might happen. You could count the number of times an item is hitting an API, you could set up conditional metrics and count those. Then time, monitoring how long transactions or processes take, so you could track the processing of queued messages or calculate latencies. Then there's gauge. Gauge, for example, you could track concurrent operations or the current number of connections to something or how many jobs are executing concurrently.

Good enterprise APM solutions should allow customers to create and apply custom metrics. One way we did this recently on a client site was to track deployments. We could see in real-time how deployments affected end-user response times. This was, or this is, enormously helpful just to see right away like when you introduce a new code, how does it affect your applications' performance. It's also a great way and knowing right away, like as soon as you roll out a deployment and something goes haywire, you know that the code is in that release.

Let's segue into best practices for APM. Here's a shortlist that I consolidated from when I was researching this topic. Plan and configure alerts that work for you. Remember that monitoring tools only do what you tell them to do and all these solutions are only as good as how we make them. Good monitoring tools will allow for granular alerting which is often used for escalation alerting. This means that you can set up alerts and thresholds based on the number of failures for any particular metric.

Set your priorities, classify your systems based on importance. Not all systems are critical or not as critical as others, so identify the most important systems and be sure that their alerting is a bit more sensitive than the others. Never allow a single point of failure. This is more, again, referring to more enterprise solutions but on- premise solutions are a single point of failure so who's going to monitor the monitoring solution.

If you have a cloud application or SaaS, software as a service, monitoring tools provide more than one location so you should be using more than one location. Know your audience, know what kind of media will get your attention if you're on a dev team. This is key to a successful monitoring solution, monitoring tools that provide a wide range of alerting methods will ensure that when alert comes in, someone in the chain will catch it and hear it, that's definitely important. Periodically verify and test your alerting and escalation protocols.

This got me once, make sure never to set up email filters for your alerts, that's definitely something you don't want to do. Very few systems have 100% up-time, so downtime is sometimes unavoidable. Keep an eye on your monitoring tool. If you don't receive an alert for a stretch, double-check that everything's still configured correctly. Create a process for how to handle alerts, allows for the quickest resolution, and holds everyone in the chain accountable, Ask for help. Good vendors have really good support and technical staff that are there to help their customers take advantage of their product knowledge and their experiences with other customers. Again, on the enterprise level, they'll even review your setup and give you advice so that you can preempt failure and preempt faulty setups.

Lastly, our favorite, document everything. You want to document how you've set up your monitoring tools and why, and you want to make sure that documentation is readily available and accessible to your dev team. Just as a tag onto that, some key considerations when you're looking and choosing an APM solution. Obviously, the programming language support is your stack supported by whatever tool or vendor that you're looking at. Does it offer cloud support? Does it provide support for SaaS or your on-premise application? Pricing, obviously. Then, ease of use. Some of these tools can get really, really complicated, and hard to configure. What kind of interface are you going to be working with to configure all your metrics and set your thresholds.

When I started researching this topic, I was completely floored by the immeasurable variety of APM vendors, tools, and solutions that were out there. It turns out that since the first half of 2013, APM entered a period of intense competition of technology and strategy with a multiplicity of vendors and viewpoints. This caused a huge upheaval in the marketplace. In some sense, over time, APM's become a really diluted term and it evolved into this concept where application performance management across a lot of diverse computing platforms has become the norm rather than just a single marketplace.

The interesting thing about going down the rabbit hole of comparing APM vendors is the dizzying spectrum of options in terms of complexity and hence pricing. Here's a fraction of vendors on the enterprise level. One article I came across, I think listed a hundred vendors, products, and services. Every day when I was researching APMs, I would run into new ones all the time, the ones that I had never heard of before.

Then just to note, these I filtered down that apply to PHP applications, so any Drupal site, they should be able to handle. Historically, APM pricing has been really prohibitive so much so that many development teams maybe until recently couldn't afford them. The top APM vendors are still really, really expensive which leads us to the question, does it have to be? The good thing is that innovation and competition makes it such that there is a pretty wide range of pricing for an equally wide range of options.

In fact, there even turns out to be open-source APM solutions that you can look into. Here's a short list of some pretty futuristic-sounding projects and they provide either a full or partial toolset that you can piece together for integrating a custom open- source APM solution, but of course that requires a dev team at your disposal or a developer that can bring that together.

As the industry has matured, it's no surprise that there are more and more affordable APM options coming onto the market. There's a lot of mid to low-range SaaS options that are continually popping up, as I mentioned before, that are actually getting more and more sophisticated and surprisingly moderately priced. The upside to the intense competition in this space is that nearly every APM vendor that I've looked at or studied, had free trial versions and free options to test-drive, which is a good way to narrow down the choices and try to find a couple of contenders for your business or your organization.

Ultimately, whatever you end up doing, you definitely have to do your research because there is a lot out there to come through to find the right solution for your situation. Even though the premise of this talk is to talk about the relevance and importance and the promise of APM, maybe we need to take a step back a little bit and ask this more elemental question of, do you really need APM for your site?

My recommendation is that if you have a lot of custom code, you need APM. Some of the following scenarios if your company develops custom IT application solutions from scratch, you need APM or, say, if you have lots of systems that interact with a lot of other custom IT solutions, maybe your IT custom application is a major revenue generator or it's really important to your business processes, maybe your application is out of regular vendor service and some legacy system and you need your own internal IT team to support it.

In all those use-cases, I would definitely lobby hard for getting some kind of APM in place. If none of these use-cases apply to you, then maybe you don't need a fully automated sophisticated complex, and expensive APM system.

If your site isn't complicated, like I said before, with a lot of custom code, you might be able to get away with just a lot of free online tools. Also, there's actually a lot of free versions that vendors offer with, obviously, the option to upgrade. A lot of these will just take your domain and spit back to you, statistics, insights, about your site. In theory, you could even just use Google Analytics to alert you about slower page speeds. Here's a short list of website and page load speed tests and tools. I was thinking, for kicks, we should just plug in the MidCamp URL and see what happens.

Speaker: Mean.

Clare: [laughs] What was that?

Speaker: That's mean.

Clare: Is it mean? No, I actually did it, it wasn't too bad. [laughs] I'll just do it since I was right here for a second. You probably can't see that. Is that up there? Here's one of the tools that was in that previous list, the PageSpeed Insights. If we plug in midcamp.org, let's see what happens. Oh. no, maybe it's choking. Oh, there we go. I thought PageSpeed Insights was great. I wasn't even super aware of all the tools that were out there.

Here we have mobile and desktop tabs. They offer some optimization suggestions. Anyway, I was really impressed with this. It'll show you, make suggestions about how to optimize images and it even tells you what optimizations that you've already done. Looking pretty good on mobile with a score of 78. Desktop could use a little work but there we have it with PageSpeed. I think I lost the rest of my slides, how to get in there. I think at this point I'll just open it up for Q& A. Please, yes.

Speaker: I ask this because I think it'll be beneficial to the audience. You sort of teased about, I won't mention the client's name, but the publication that was having a particular issue. You talked about it being a Drupal thing. You didn't really uncover what it was. I think the audience might be interested in what the actual issue is, if it's a core issue.

Clare: It is a core issue.

Speaker: If you have a lot of files on your site, you might fuck this up like we did.

Can you talk a little bit about that?

Clare: Sure, I'll say that. Like I said, this online content publisher, this was a client of ours, had a D7 site. It is. It's a core issue with D7 where the file system, if you have more than a certain number of files in any one directory in your file system-- I should backtrack this a little bit and say that this was hosted by Acquia, an Acquia cloud hosting solution. In there, best practices docs, in Acquia's, is how I finally eventually saw this.

If you have more than 2,500 files in any one folder in your file system, it causes performance problems. What we uncovered in our client's file system was that we had single folders that had 50,000 files in it, 60,000 files in it. That was what was going on. It wasn't our code per se or any of our custom code, it was the fact that none of the files were being distributed properly for all their assets.

Again, because they were producing tons of content, their editors were just uploading images, thousands of images every day. That was the issue with that. When we found the Memcache errors and some of the transaction profiling around that, that's when we started applying solutions to basically reduce all the counts in the files so that they were below 2,000 files per directory. That was a very humbling experience.

Speaker: The core issue is that Drupal out of the box will not solve for this. If you have an add a file field, the default is just to throw them all in one folder, one directory.

Clare: Exactly.

Speaker: You need to configure that in Drupal 7 for a field, otherwise, this can happen. D8, I believe, has this solved out of the box.

Clare: No, D8 is totally done. Out of the box D8 resolves this for you but D7, I think the core issue is still growing.

Speaker: If you have a file type and you have thousands of pieces of content with a header image you're going to have more than 2500, right?

Speaker: Yes, you really need to use tokens or something to throw it in directly based on the date or something, or account of some kind, such that you cannot get past this limit. It's like, "The application is fine, the application is fine," and then you will reach this tipping point and then you start to see these terrible, absolutely atrocious response times.

Speaker: In a case like that, do you see the performance issue is just like when someone's trying to get a list of files in a folder or even just trying to load a file?

Clare: I'm sorry, can you repeat that?

Speaker: I was just wondering what triggers the performance issue? Is it just when an admin wants to see a list of all the files in the folder or is it just try to load a file within a folder like that?

Clare: That's a good question. When we started getting those alerts, I think it was anytime an editor was in the admin interface-- One thing that they did on that particular use case was that they were bulk uploading assets. They had this bulk uploader where their editors would, I don't know, upload 10 big images at a time. Over time I think it just became an issue in terms of accessing the files directory and locating them and all that.

Speaker: It's a performance issue too, just because the OS has to look at such a large number. I believe that it actually ends up being an OS issue but, where a bad actor pulls a bad actor in that play.

Speaker: I've definitely seen a similar issue in some sites where, if they're using the IMCE editor to upload files or something.

Clare: That starts choking.

Speaker: They're like, "It's timing out just trying to load the list of the files."

[crosstalk]

Clare: For sure. Yes?

Speaker: You mentioned about the Apdex score. You had it set between 0.7 and 0.8. Is there an industry standard in terms of what is ideal? I guess closer to one is the best point. Is there something that--

Clare: We didn't actually end up tweaking the threshold. New Relic comes with a lot of defaults. I think, probably a lot of these enterprise solutions comes with defaults that are set. Again, I can't remember it off the top of my head. It was somewhere between 0.7 and 0.8 was the threshold. You can manually adjust that, you can fine- tune your application and say, you really want it to be high, you want to do it as close to one as possible. I think it just depends on the client or your organization. How you want to set it.

Speaker: I would say generally that if you're running a Drupal site as you got any amount of content at all that you want to run [unintelligible 00:47:20], because Drupal fights nature, especially if you have login users. It's going to bypass the cache and that's going to eat up a lot of server resources. Because I, admin server myself, I found that using New Relic-- I used New Relic before recently, but using New Relic was really important because otherwise, the server would be slow. It could take 10, 15 seconds to load a page and I had no idea why. That actually gave me some ideas about what to do.

One of the things I discovered too which is if you actually are [unintelligible 00:48:00] through server administration is you need to also look at your logs because most of your server traffic these days is bots, it's not request. Bots can really cause big issues. Just a thousand logging on your site and how much you get bit. Having any kind of APN [unintelligible 00:48:21] I think, I won't say it's a requirement but unless you've only got a few pages up, you need to deal with some insight of what's really going on because Drupal will not tell you from its logs. You can see errors, which is great, it's great to able to see errors but overall Drupal logging is not going to give you a clue as to why it takes five seconds to load a page.

Clare: You're preaching to the choir. I totally agree with you. Again I think what's the issue for a lot of operations is cost. I think New Relic is one of the most expensive options out there on the market [crosstalk] Yes I know, but again I think the competition is such that there are a gazillion options out there and it does take time to research them and try them. Even if you download 10 trial versions you got to spend the time working with them so it's not easy. It's definitely a tough thing.

Speaker: New Relic has a library of graphs and charts and data that looks like it would be really useful. I've gone in there, there are times when it's like, "This looks like it probably represents a problem, but how do I find out if this really is a problem that I should spend time on," versus, "There was a change and I shouldn't have to be concerned about it." Is there a good resource for finding out, learning more about what's important and the not important [unintelligible 00:49:58]

Clare: Again, one thing that helped us when we were working with New Relic, in particular, was again tying it to deployments, to be able to see when we rolled out a new code or when we put out a release, we could see the results right away if something was going haywire or something was going wrong. At that moment, we'd be like, "Okay, revert that. Get that out of there." It's tricky. I think it's sort of getting to know the tools and getting to use it--

Even in the time that I spent on that project, I felt like I was barely touching the surface of what New Relic could do. Their support is great and so I would definitely ask the vendors and particularly if it's New Relic and some kind of other enterprise solution. Ask their support, what else can I do to figure out what's going on? For us, tying deployments and tracking them on the graph was enormously helpful in pinpointing problems.

You can actually do that from your version control. You can do that from GitHub, tie-- There's a couple of ways you can integrate that, but you can do it within the New Relic interface or you can do it every time you kick off a release from GitHub or Bitbucket or whatever SVN you're using. New Relic will capture that and mark it on that graph and show you, "Oh, you just deployed at this time," and then see the graph accordingly of how performance responds. Anything else?

[applause]

[00:51:39] [END OF AUDIO]

Watch more Chromatic presentations