VMware & ATS

Discussion in 'Automated Trading' started by WinstonTJ, Feb 14, 2012.

  1. Anyone running or using ESXi 5.0?

    Do you have two machines? If so could you ping between the two?

    I'm running at <.5ms but need to chop that in half quick.

    0.370 ms
    0.353 ms
    0.404 ms
    0.447 ms
    0.365 ms
    0.404 ms
    0.446 ms
    0.361 ms
    0.402 ms
    0.443 ms

    Times - Hardware might be better place to post this. Mods feel free to dump wherever.

    Looking for ANYONE who might have a VM on ESXi 5.0 (enterprise +) who has ping times less than 0.025ms between machines (same HW or HW to HW).

  2. Can you describe your hardware?

    I am more a Hyper-V type, but here we go - the same applies to VmWare to my knowledge.

    it may depend on the network card. There are special cards / chips by Intel that have hardware queues. Basically the incomin etherner packets are not handled by the hHyper-V switch but aare pushed into a queue based on target MAC address... and the driver reads them out from the VM. This totaly kills all processing from the hypervisor (except the driver configuring the card) and significantly adds to performance under network load.

    THAT SAID: if you really need to get extremely fast, then basically skip the hypervisor. You always itnroduce latency of one sort and you have not determined behavior. Modern Hyper-Visors outside the mainframe so far do not allow locking cores to VM's so you always run the risk of having to wait for a time slice, even with higher priority than the other processes. If all cores are buy, it is a small delay.

    Plus you have additional processing - and I am quite sure super fast trading is not on the radar of companies like VmWare or Microsoft when optimizing their hypervisors.

    But your best bet on in that thing first would be a hardware review.
  3. Running into latency issues in a vDC environment:

    Using VMware via vCloud director:
    Testing Excel Market Data Feed Processor on various VM configurations:

    #1: Single 3Ghz Core 2GB RAM
    #2: Dual 3 Ghz 2GB Ram
    #3: Quad 3Ghz 4GB Ram
    #4: 8 Core 3Ghz 16GB Ram.

    All VM's running the same excel app / market feed.
    Measuring precision (data cycle processing interval)... Really just the frequency that Excel can process new market data.

    Our development server is a Xeon Quad 2.4Ghz with 2GB allocated.. no VMware and our processing precision is <10ms.

    The VMware environment allocates many more resources but Excel 2010 with multithreading enabled for all cores runs much slower... by a Factor of 5 to 10 compared to dev server. Don't understand why Dual Core VM out performs the quad and 8 core. Was expecting sub 10ms precision... Stumped and looking for suggestions and ideas on how to further tweak performance.

    #1: 179 ms
    #2: 55.65 ms
    #3: 82.8 ms
    #4: 94.56ms


    I suspect the issue is specific to VMWare and hypervisor settings. Unfortunately, visibility of these settings are limited to our service provider. If VMware is allocating cores from different physical machines the delay may be related to marshalling. I am obtaining performance increases in non vmware environments by adding more cores and ram.

    This particular Excel app is set to use all available cores and iterative calcs are disabled. We do use VBA and understand its limits regarding multithreading. VBA provides the iphone style modeless form and scoreboard for stats.

    We do use volatile microtimers and a proprietary DDE pump to coerce updates much faster than Excel's 100ms static schedule. We're sustaining 10ms precision on 2500 instruments processing price feeds, order triggers and writing records out to SQL DB.

    Most of the issues we are working through are just normal dev to production issues... however, the performance decrease was not expected at all and kind of kills the traditional "throw more hardware at it" band aid.

    I think we may test the application on VMWare running on AMD processors to verify or eliminate any XEON front bus contention issue. http://blogs.amd.com/work/2010/03/30/intel-hyper-threading-vs-amd-true-core-scalability/

    Attached is a whitepaper: VMW-Tuning-Latency-Sensitive-Workloads
  4. > If VMware is allocating cores from different physical machines

    Except that VmWare can not have multi machine VM's. Very few ccan ( Iknow of two )adn they are very special, so no.

    Can it be the old VmWare bug of needing ALL cores available? THis was fixed a ong time ago - not sur how current your version is.

    Bsicalyl this one was needing8 cores avaialbel fro a 8 core VM which means the more core a VM had the longer it waited for a time slice. Hyper-V and a newer VmWare verion allocate every core separately.

    5 to 10 times slower is a NO - this should not happen unless the physical platform is overloaded (cpu maxes out, memory bandwidth maxes out, so the vm runs into switching problems). The overhead normally is below 5%. That said, the machines look quite pathetic in my eyes - I am using 6-8 core machines at the moment with 16-64gb memory. I do NOT like my physical layer to run into problems.
  5. GTS


    I'm running ESXi 5.0 and my ping time from one VM to another on the same vlan is less than .25ms, not sure if that is what you meant by same HW or not:

    • ping.jpg
      File size:
      71.6 KB
  6. We found for lowest latency VM's should not be allocated more CPU's than physically in Server. ie. On a XEON 6 CPU x 4 Core server VM's should not be allocated more than 6 CPU's otherwise performance drops off significantly.

  7. Wow! I wasn't expecting responses like this - this is great! Thanks so much... Hopefully this turns into a great thread for all involved.

    The reason why I ask is because I run some Intel Atom (super micro 2U-Twin^3) servers and they are in pretty high demand. These are fine to keep running as OS on bare-metal solutions but not that great for redundancy. I'm looking to get fast enough that I can tell someone that there is minimal difference between running OS on bare metal vs. OS on Hyper-visor.

    Production hardware is all Supermicro stuff. I have two "MicroCloud" machines (3U, 8-Node) ( http://www.supermicro.com/products/nfo/MicroCloud.cfm ) and I have a few various 2U Twin solutions (2U 4 Nodes, 2x PSU) that run either Xeon 5400 or 5500 or 5600 CPUs.

    I also have two 2U twin^3 servers ( http://www.supermicro.com/products/nfo/2UTwin3.cfm ) with Atom CPUs (2U, 8-Node). These do not run VMware and are used exclusively for bare-metal installs of some Linux/Unix flavor Server OS to run an ATS only (just execution no development or testing). They are Atom D525 CPUs 1.8ghz overclocked to ~2.1ghz and 8GB RAM. They are dual-core with HT so 4 threads (like an i3). In my opinion a perfect platform for an execution only system that needs to be ultra low latent. They all run installs of Ubuntu Server LTS - though one guy has some other flavor (Debian I think).

    NICs are all either Intel Pro 1000 server-grade (PT model) NICs. They all have plenty of onboard buffering.

    By Hyper-V do you mean Xen or Citirx XenServer vs. VMware's ESXi? At this point we all run some flavor of Xen Hypervisor, Citrix and VMware have just made it idiot proof whereas Xen.org has kept it CLI and raw.

    For the most part I'm running Intel Pro 1000 PT quad NICs. I have a couple intel 10G dual-port NICs. Nothing crazy like fibre channel but I'd put myself in the "no expense spared" category vs. "money is no object".

    Also, with regards to NICs, these are "running raw" meaning no firewall and only using the Vswitches. I've actually noticed it's slower to assign each VM dedicated hardware (or a dedicated Vswitch per port on a quad NIC) than it is to just let the HV have the whole NIC and put them all on a single or seperate Vswitch.

    I'm not over-booking (allocating more resources than you have available like cores, RAM, GPU, etc.). I rent cheap VMs out for guys to use as a 24/7/365 internet (or we joke and call it a porn VM) machine and I overbook those - but trading machines are never overbooked.

    I mentioned the Atom servers above so I get it and agree 100% that running on bare metal is faster - I just want to improve my HV speeds.

    All of the CPUs are 3.0ghz or higher (so a X5680 or a W5580 vs. something in the 2.2 or 2.5ghz range. Nothing is over clocked (except the Atom boxes) and just about everything has max RAM as allowed by the BIOS.

    My test boxes are the same (dell brand T5500 workstations because they are quiet for the home & office) Xeon 5500 or 5600 CPUs, same NICs, etc. same setup except just not racked. I've never had a VM or configuration that has worked on the test boxes that has had issues on the production machines.

    I don't think it's a hardware issue - but I could be wrong. All of the hardware is very new.

    PocketChange - are you overbooking? I have all of my test boxes being used right now on this issue (and I'm building 15 workstations for a hedge fund this week) but send me a PM if you want - I'm happy to set up a test machine based on Xeon x5680 or 5690 CPUs if you want to test on something else.

    I don't think you should overbook given your resources. There were a lot of changes in Excel between 2003, 2007 & 2010. This isn't by the book but it's how I explain it to people - Excel 2003 was limited. It worked but had hard-coded limitations. 2007 was allowed to take (had access to but didn't just take) additional memory and resources whereas Excel 2010 feels like it tries to take any/all available resources (up to it's limits) even if it doesn't need it. By overbooking resources Excel 2010 is probably slowing itself down trying to multi-thread on virtual (non-existent) threads. As NetTecture said you are probably causing lag/latency just by queuing up thread processing.

    Does your MS times reflect latency to your broker or latency to call the data from your broker or is this 100% internally on this machine/network? Is your latency solely reflected by the machine or is there networking involved too?

    At times you are running into memory issues as well. Are you overbooking memory too?

    I have a dual-Xeon 5400 series (2x 5482 CPUs) with 32GB RAM running ESXi that you could try sooner if that helps. Those CPUs don't have hyper threading and I'm happy to shut down the other machines and give you a few XP or W7 machines if it helps. I'm trying to sell it so it could be a short window that it's available.

    Even in an HA Cluster (high availability) you can't share resources across multiple Hosts. The limits today are heterogeneous high-availability clusters vs. pulling CPU resources from another machine into yours.

    PocketChange - what are you running? A lot of people sell semi knock-off "cloud computing" or even Linux "DIY supercomputer" software and that could be slowing you down as well.

    (Didn't know there was an old bug - I guess that's how new I am to this space. I've only ever run ESXi 5.0)

    Have to agree. On the production machines I book to about 75-80% of physical layer. I wasn't running VMs back during the flash crash but I know a few guys who were monitoring machine loads and if we had another flash crash my 75-80% wouldn't be enough but it would be much better than if I was already overbooked.
  8. UPDATE:

    Did a little bit of QoS work and played around with the Vswitch and now things are consistent and slightly better - but I really need to cut this in half.

    Is this possible - well I know it's possible but can anyone here give me a few pointers given the info I posted earlier about my hardware and specs?

    I feel like a few clicks of the mouse and being able to shave off a tenth of a millisecond is pretty good - but we always go after the tall poles first. How do I shave off another tenth this easily?

    Also would like to be CONSISTENT. In 2 out of 3 ping sets I'm getting soemthing in the 0.450+ ms range. If I can consolidate and keep my highs (and lows) even I can deliver cleaner data. So I guess I'm asking both how to cut latency in VMware ESXi 5.0 and how to make things more consistent.

    Going to try changing machine config so that each has one CPU vs. two and allow the HV to decide when to invoke the 2nd HW CPU.


    PING ( from 56 data bytes
    64 bytes from icmp_seq=0 ttl=128 time=0.348 ms
    64 bytes from icmp_seq=1 ttl=128 time=0.299 ms
    64 bytes from icmp_seq=2 ttl=128 time=0.303 ms
    64 bytes from icmp_seq=3 ttl=128 time=0.311 ms
    64 bytes from icmp_seq=4 ttl=128 time=0.287 ms
    64 bytes from icmp_seq=5 ttl=128 time=0.318 ms
    64 bytes from icmp_seq=6 ttl=128 time=0.276 ms
    64 bytes from icmp_seq=7 ttl=128 time=0.295 ms
    64 bytes from icmp_seq=8 ttl=128 time=0.264 ms
    64 bytes from icmp_seq=9 ttl=128 time=0.302 ms

    PING ( from 56 data bytes
    64 bytes from icmp_seq=0 ttl=128 time=0.442 ms
    64 bytes from icmp_seq=1 ttl=128 time=0.254 ms
    64 bytes from icmp_seq=2 ttl=128 time=0.165 ms
    64 bytes from icmp_seq=3 ttl=128 time=0.261 ms
    64 bytes from icmp_seq=4 ttl=128 time=0.294 ms
    64 bytes from icmp_seq=5 ttl=128 time=0.270 ms
    64 bytes from icmp_seq=6 ttl=128 time=0.328 ms
    64 bytes from icmp_seq=7 ttl=128 time=0.291 ms
    64 bytes from icmp_seq=8 ttl=128 time=0.324 ms
    64 bytes from icmp_seq=9 ttl=128 time=0.285 ms

    PING ( from 56 data bytes
    64 bytes from icmp_seq=0 ttl=128 time=0.355 ms
    64 bytes from icmp_seq=1 ttl=128 time=0.194 ms
    64 bytes from icmp_seq=2 ttl=128 time=0.286 ms
    64 bytes from icmp_seq=3 ttl=128 time=0.479 ms
    64 bytes from icmp_seq=4 ttl=128 time=0.278 ms
    64 bytes from icmp_seq=5 ttl=128 time=0.324 ms
    64 bytes from icmp_seq=6 ttl=128 time=0.263 ms
    64 bytes from icmp_seq=7 ttl=128 time=0.295 ms
    64 bytes from icmp_seq=8 ttl=128 time=0.339 ms
    64 bytes from icmp_seq=9 ttl=128 time=0.387 ms
  9. Jusst a short answer.

    Hyper-V is Hyper-V ... Microsofty Hypervisor, build on the XEN model (and sources from what I hear). So, sorry, my VmWare knowledge is limited. I just know Hyyper-V allows Intel tio bypass the whole virtual switch layer and directly push into the vm's buffer memory for faster time. The Intel Nix's are those supporting it, but little (2 queues per port). 10G are better (10 queues per port).

    This can be critical, but it also requires special drivers from Intel (the cards wok without, just not in fast mode) as well as configuration (enabling it). ;) if VmWare does not have something similar - you eat CPU switching ethernet packets, which is not good.

    Btw., there are 2 companies offering multi machine vm's. The Vm image spans multiple somputers. Makes LITTLE sense - you need high end machines to start plus Infiniband in the backend for shared memory. I fail to see many use cases where I am not better off clustering the machines on the application layer. Especialyl with the power I can get today from a mid range machine (dual socket).

    I personally have 3+ machiens at work now. a Dual Opteraon (2x4 cores, 64gb), 2 Phenoms with 16gb each and I am just setting up a 3930 with 32gb for running backtests. 2 locations (data center to be retired and my office). Nothing exchange close YET - one reason being I lack a decent provider. That shall come ;) I am though totally MS based - also from my hypervisor, so... no idea about VmWare too much, or Linux. Just there are great features at MS IF (!) the card supports them, and you pretty much are stuck with Intel (not bad) for advanced features.

    http://www.vmware.com/files/pdf/perf_comparison_virtual_network_devices_wp.pdf may help you a little here.

    Just make sure you don't overbook and keep the physical layer under control ;) It is quite hard to see anything from within a VM.
  10. Have you tried the Hyper-V inside Windows 8 or Server-8 yet? It's supposed to be pretty good except I haven't installed on bare metal yet so I can't test a VM from inside a VM (or don't know how to).
    #10     Feb 15, 2012