«

»

Aug 09

SOLVED: “There have been <N> misaligned log IOs which required falling back to synchronous IO. The current IO is on file <logfilename>.” – Trace Flag 1800

Do you have an Availability Group set up that is seeing some strange latencies or behaviors? Have you looked at the error logs on your replicas and seen this?

There have been <N> misaligned log IOs which required falling back to synchronous IO. The current IO is on file <LogFileName>.

What the heck does this mean?

Generally, it means something is different between your primary and your replica(s) in terms of the disks that the t-logs are on. The replica that is showing this error in the log is the one that is having trouble. You should see this demonstrated in a larger Log Send Queue. So what’s the problem? If I google the error, I get a couple good hits.

TF1800_1

I’m not going to include the individual links in this post… Let me google that for you.

The first link takes you to a blog “Running SAP Applications on the Microsoft Platform”, but don’t let that deter you! There is a ton of important information there. What you’ll find from any of these is that there are different sector sizes between different vendors and types of disks. Let me start by saying that I am not a storage expert. I may say things that are technically incorrect from this point forward and I will update as I get corrected. Nevertheless, it makes sense to me and my fellow DBAs.

You have 3 basic configurations. For the sake of this post, I’m going to abbreviate: Bytes Per Sector(LBPS because this is really logical) and Bytes Per Physical Sector(BPPS):

1. LBPS = 512 and BPPS = 512: 512-byte native
2. LBPS = 512 and BPPS = 4096: 512E
3. LBPS = 4096 and BPPS = 4096: 4K native

To put this simply:
#1 used to be the norm
#3 is the new ‘norm’ (not necessarily more performant… Thanks @DBArgenis)
#2 was designed to take the new norm and present it to systems that can’t handle the new norm

You can see how your disk is configured by typing the following into an elevated command prompt “fsutil fsinfo ntfsinfo “. You will get something like this:

TF1800_2

All of these articles tell you that when SQL Server starts up, it checks to see what the “sector size” is and writes accordingly. If we have a primary with configuration #1 and a replica with configuration #3, there is clearly a mismatch and something needs to be done. However, what if our primary is configuration #1 and our replica is configuration #2 (as was my case)? One would think that SQL Server starts up and recognizes 512-byte sector for both of these and that everything is fine. What these articles DO NOT tell you, is that the LBPS don’t matter in this case… Microsoft repeats “sector size” on its KB article 3009974 and confused the heck out of me. They should have said Bytes per Physical Sector. Even though it treats it as 512-byte in configuration #2, it is still writing to 4K sectors and leaving them misaligned! That’s why you are seeing the error. So how do you fix it?

The first is to get disks that match the BPPS. It is my understanding that if they are spinning disks, you cannot reconfigure them. This is how they are made… Sorry. However, if you are using local SSDs or an all flash array, you “might” be able to change the way that these drives are configured. It depends on the vendor.

If you cannot do the above, Trace Flag 1800 is here to save the day! This is a startup trace flag, it is not something you can just enable. You have to add it as a startup trace flag through SQL Server Configuration Manager and restart the SQL service. Basically, this trace flag overrides the detection and forces it to write in 4K native format whether the BPPS is 512 or 4K. Personally, I recommend to enable this on ALL replicas. (NOTE: THIS AFFECTS ALL DATABASES ON THE SERVER! If you have strange configurations… be sure that this is understood.) I started by enabling it on my remote asynchronous replica and restarting it. Then my synchronous secondary and restarting. Finally, my primary after failing over. You will know that it worked because you will see this entry in your error log on startup:

There have been X misaligned log IOs which required falling back to synchronous IO. The current IO is on file ….

Following the above procedure fixed our latency issues as well as those pesky error messages in the log. That being said, if you are setting up a new set of servers, take the time to check the disk configurations. This situation has lead me to add another step to my pre-build checklist.

EDIT on 10/15/2016: Thanks @DBArgenis for giving me some education and mentioning some things should be mentioned as important notes: “4Kn isn’t necessarily better for perf, and once you enable TF1800 it affects every single database on the instance. If you have a mix of AGs/Mirrors/LS from 512 and 4Kn things get messy.”

References:
Message misaligned log IOs which required falling back to synchronous IO in SQL Server Error Log
SQL Server–Storage Spaces/VHDx and 4K Sector Size
FIX: Slow synchronization when disks have different sector sizes for primary and secondary replica log files in SQL Server AG and Logshipping environments
@DBArgenis educated me that 4k-Native doesn’t necessarily mean better performance and once you enable TF1800 it affects every single database on the instance. If you have a mix of AGs/Mirrors/LS from 512 and 4Kn things get messy.

Leave a Reply