HPE Storage Users Group

A Storage Administrator Community




Post new topic Reply to topic  [ 6 posts ] 
Author Message
 Post subject: iscsi performance
PostPosted: Tue Dec 09, 2014 9:46 pm 

Joined: Tue Dec 09, 2014 9:26 pm
Posts: 2
hi,
i have a new 4 node 7450 3.2.1mu1 connected to 32 blades running vsphere 5.5u2 over iscsi connected via 4x6120xg switches. Each Vhost has 8 paths to each datastore, rr with iops=1.

When i migrate VMs from our P4900 to the new 3par (1TB thin vv) the average IO latency inside the linux guest increases by quite some margin.

As i spent the last two weeks debugging and ran out of ideas (besides ordering FC hardware) perhaps you guys can help me.

Benchmarks with fio also show the increased latency for single IO requests:

Code:
root@3par:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
Jobs: 1 (f=1): [w] [100.0% done] [0K/3136K /s] [0 /[color=#FF0000]784  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=5096
  write: io=102400KB, bw=3151.7KB/s, iops=787 , runt= 32497msec
    slat (usec): min=28 , max=2825 , avg=43.49, stdev=28.71
    clat (usec): min=967 , max=6845 , avg=1219.50, stdev=156.39
     lat (usec): min=1004 , max=6892 , avg=1263.63, stdev=160.23
    clat percentiles (usec):
     |  1.00th=[ 1012],  5.00th=[ 1048], 10.00th=[ 1064], 20.00th=[ 1112],
     | 30.00th=[ 1160], 40.00th=[ 1192], 50.00th=[ 1224], 60.00th=[ 1240],
     | 70.00th=[ 1272], 80.00th=[ 1288], 90.00th=[ 1336], 95.00th=[ 1384],
     | 99.00th=[ 1608], 99.50th=[ [color=#BF0000]1816[/color]], 99.90th=[ 3184], 99.95th=[ 3376],
     | 99.99th=[ 5792]
    bw (KB/s)  : min= 3009, max= 3272, per=100.00%, avg=3153.97, stdev=51.20
    lat (usec) : 1000=0.42%
    lat (msec) : 2=99.21%, 4=0.34%, 10=0.02%
  cpu          : usr=0.82%, sys=3.95%, ctx=25627, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=3151KB/s, minb=3151KB/s, maxb=3151KB/s, mint=32497msec, maxt=32497msec

Disk stats (read/write):
    dm-0: ios=0/26098, merge=0/0, ticks=0/49716, in_queue=49716, util=92.15%, aggrios=0/25652, aggrmerge=0/504, aggrticks=0/33932, aggrin_queue=33888, aggrutil=91.79%
  sda: ios=0/25652, merge=0/504, ticks=0/33932, in_queue=33888, util=91.79%


Code:
root@p4900:/tmp# fio --rw=randwrite --refill_buffers --name=test --size=100M --direct=1 --bs=4k --ioengine=libaio --iodepth=1
test: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 1 process
test: Laying out IO file(s) (1 file(s) / 100MB)
Jobs: 1 (f=1): [w] [100.0% done] [0K/7892K /s] [0 /[color=#BF0000]1973  iops[/color]] [eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3193
  write: io=102400KB, bw=7464.2KB/s, iops=1866 , runt= 13719msec
    slat (usec): min=21 , max=1721 , avg=29.71, stdev=14.23
    clat (usec): min=359 , max=22106 , avg=502.07, stdev=222.14
     lat (usec): min=388 , max=22138 , avg=532.20, stdev=222.72
    clat percentiles (usec):
     |  1.00th=[  386],  5.00th=[  402], 10.00th=[  410], 20.00th=[  426],
     | 30.00th=[  438], 40.00th=[  454], 50.00th=[  470], 60.00th=[  490],
     | 70.00th=[  516], 80.00th=[  548], 90.00th=[  596], 95.00th=[  660],
     | 99.00th=[ 1032], 99.50th=[[color=#FF0000] 1192[/color]], 99.90th=[ 2672], 99.95th=[ 4192],
     | 99.99th=[ 8032]
    bw (KB/s)  : min= 6784, max= 8008, per=100.00%, avg=7464.89, stdev=339.23
    lat (usec) : 500=64.13%, 750=32.82%, 1000=1.85%
    lat (msec) : 2=1.04%, 4=0.10%, 10=0.05%, 50=0.01%
  cpu          : usr=2.01%, sys=5.86%, ctx=25635, majf=0, minf=20
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=25600/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=7464KB/s, minb=7464KB/s, maxb=7464KB/s, mint=13719msec, maxt=13719msec

Disk stats (read/write):
    dm-0: ios=0/25647, merge=0/0, ticks=0/12712, in_queue=12712, util=87.10%, aggrios=0/25617, aggrmerge=0/166, aggrticks=0/12748, aggrin_queue=12736, aggrutil=86.71%
  sda: ios=0/25617, merge=0/166, ticks=0/12748, in_queue=12736, util=86.71%


Top
 Profile  
Reply with quote  
 Post subject: Re: iscsi performance
PostPosted: Tue Dec 09, 2014 11:18 pm 

Joined: Tue May 07, 2013 1:45 pm
Posts: 216
Is that during the move, or after the VM is done moving?


Top
 Profile  
Reply with quote  
 Post subject: Re: iscsi performance
PostPosted: Tue Dec 09, 2014 11:22 pm 

Joined: Tue Dec 09, 2014 9:26 pm
Posts: 2
this benchmark is from 1 old an 1 new test vm


Top
 Profile  
Reply with quote  
 Post subject: Re: iscsi performance
PostPosted: Wed Dec 10, 2014 1:51 pm 

Joined: Wed Nov 19, 2014 5:14 am
Posts: 505
Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post viewtopic.php?f=18&t=883&p=4246&hilit=interrupt#p4246


Top
 Profile  
Reply with quote  
 Post subject: Re: iscsi performance
PostPosted: Wed Dec 10, 2014 6:25 pm 

Joined: Tue May 07, 2013 1:45 pm
Posts: 216
Oh, and a 100MB test file isn't going to tell you anything, you need to be several times the size of the storage cache to get any value out of a synthetic benchmark.


Top
 Profile  
Reply with quote  
 Post subject: Re: iscsi performance
PostPosted: Wed Dec 10, 2014 8:07 pm 

Joined: Wed Oct 30, 2013 2:30 pm
Posts: 242
JohnMH wrote:
Since you have the front end host view I would start by looking at the backend storage view. e.g See what the 3PAR VLUN is doing in the IMC under Reporting, Charts. You;ll probably find it's sat idle waiting for data with the occasional spike.

3PAR uses interrupt coalescing at the controller host HBA to reduce CPU load in a multitenant environment so if you only have a single threaded app or you only test with a very low queue depth on a benchmark you'll see higher latencies. For write coalescing, the reason is the HBA will hold these I/O's until it's buffer fills before sending a interrupt to the controller CPU to process the I/O so you incur a wait state.

If you really do have a single threaded app then turn off "intcoal" on the HBA port and it will issue a interrupt for every IO posted, you will also probably want to adjust the host HBA queue depth also.

If you don't have a single threaded app then you would be better off testing using a much higher queue depth to simulate multiple hosts or multi threaded apps on the same HBA port. This in turn will fill the buffer quickly and get the system moving so you won't have to wait before the interrupt kicks in, but never test with a low queue depth as you just aren't stressing the system.

See this post http://www.3parug.com/viewtopic.php?f=1 ... rupt#p4246


+100 to what John and afidel said. I see this particular issue coming up relatively often here. These systems are designed to function under load in a multi threaded multi tenant environment. If your benchmarking isn't stressing the system, the performance number will be lackluster.

It's a little counterintuitive because conventional wisdom would say lower load = higher performance. But in the car of extremely small workloads that is not necessarily the case. A workload that isn't big enough to hit the buffer queues, write cache, etc, won't get very high performance numbers in a system that is otherwise idle


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 6 posts ] 


Who is online

Users browsing this forum: Google [Bot] and 214 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group | DVGFX2 by: Matt