Monday, April 12, 2010

iPhone Templates in Google Docs

I have created a set of iPhone UI templates using Google Docs Drawings for sharing / collaborating layout ideas.  You can access the template here.  If you are logged in to your Google account, create a copy by selecting the File menu.

Sunday, January 24, 2010

Micro-benchmarking iOS devices

Update: Added iPhone 5.
Update: Added iPhone 4s, iPad 3rd gen.
Update: Added iPhone 4, iPad 1st gen.

I follow the excellent weekly posts by Mike Ash, and entered a brief discussion in comments about toll free bridging.  In particular, the difference between calling a method via Objective-C (objc_msgSend) and it's equivalent CoreFoundation C call.  Mike suggested adding it to his original suite of tests, which lead to the following results.

iPhone 5 (-mno-thumb)

Custom Apple A6 ARM Cortex A15, up to 1.2GHz

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.3 3.1
C++ virtual method call 100000000 0.3 3.3
Integer division 100000000 1.1 10.9
Objective-C message send 100000000 1.4 13.7
Float division with int conversion 10000000 0.2 24.7
Floating-point division 100000000 2.5 24.8
Objective-C objectAtIndex: 10000000 0.4 35.9
CF CFArrayGetValueAtIndex 10000000 0.5 50.9
16 byte memcpy 10000000 0.7 65.8
16 byte malloc/free 10000000 4.8 482.7
NSAutoreleasePool alloc/init/release 100000 0.1 533.4
NSObject alloc/init/release 100000 0.1 1169.0
NSInvocation message send 100000 0.1 1391.8
16MB malloc/free 1000 0.0 13331.8
Zero-second delayed perform 1000 0.1 99329.1
pthread create/join 100 0.0 120390.0
1MB memcpy 100 0.0 421517.1

iPhone 5 (thumb)

Custom Apple A6 ARM Cortex A15, up to 1.2GHz

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.3 3.1
C++ virtual method call 100000000 0.4 4.0
Integer division 100000000 1.1 10.8
Objective-C message send 100000000 1.4 13.6
Float division with int conversion 10000000 0.2 24.9
Floating-point division 100000000 2.6 26.4
Objective-C objectAtIndex: 10000000 0.4 35.6
CF CFArrayGetValueAtIndex 10000000 0.5 51.0
16 byte memcpy 10000000 0.7 65.8
16 byte malloc/free 10000000 4.7 474.3
NSAutoreleasePool alloc/init/release 100000 0.1 513.2
NSObject alloc/init/release 100000 0.1 1183.1
NSInvocation message send 100000 0.1 1241.5
16MB malloc/free 1000 0.0 12979.7
Zero-second delayed perform 1000 0.1 83574.5
pthread create/join 100 0.0 121289.2
1MB memcpy 100 0.0 426971.7

iPad 3 (thumb)

Apple A5x ARM Cortex A9 1000MHz

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 1.1 11.5
C++ virtual method call 100000000 1.3 12.7
Floating-point division 100000000 2.6 26.1
16 byte memcpy 10000000 0.3 26.2
Float division with int conversion 10000000 0.3 26.3
Integer division 100000000 2.9 28.7
Objective-C message send 100000000 3.6 35.5
Objective-C objectAtIndex: 10000000 0.7 69.0
CF CFArrayGetValueAtIndex 10000000 1.3 131.7
16 byte malloc/free 10000000 4.3 433.9
NSAutoreleasePool alloc/init/release 100000 0.1 600.2
NSObject alloc/init/release 100000 0.1 1235.4
NSInvocation message send 100000 0.3 2966.6
16MB malloc/free 1000 0.0 11633.0
Zero-second delayed perform 1000 0.1 121336.0
pthread create/join 100 0.0 130293.3
1MB memcpy 100 0.2 1662780.4

iPad 3 (-mno-thumb)

Apple A5x ARM Cortex A9 1000MHz

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.9 9.0
C++ virtual method call 100000000 1.1 11.1
Integer division 100000000 2.4 24.1
Floating-point division 100000000 2.6 26.1
Float division with int conversion 10000000 0.3 26.1
16 byte memcpy 10000000 0.3 27.1
Objective-C message send 100000000 2.7 27.2
Objective-C objectAtIndex: 10000000 0.7 68.4
CF CFArrayGetValueAtIndex 10000000 1.0 103.3
16 byte malloc/free 10000000 4.3 432.5
NSAutoreleasePool alloc/init/release 100000 0.1 570.4
NSObject alloc/init/release 100000 0.1 1209.9
NSInvocation message send 100000 0.2 1682.2
16MB malloc/free 1000 0.0 10251.2
pthread create/join 100 0.0 118494.2
Zero-second delayed perform 1000 0.1 121578.2
1MB memcpy 100 0.2 1635983.3

iPhone 4s (thumb)

Apple A5 ARM Cortex A9 ~800MHz

Name Iterations Total time (sec) Time per (ns)
C++ virtual method call 100000000 1.1 11.3
IMP-cached message send 100000000 1.3 12.6
Integer division 100000000 3.1 31.4
Float division with int conversion 10000000 0.3 32.6
Floating-point division 100000000 3.3 32.6
16 byte memcpy 10000000 0.3 32.6
Objective-C message send 100000000 3.4 33.8
Objective-C objectAtIndex: 10000000 0.9 85.9
CF CFArrayGetValueAtIndex 10000000 1.6 165.0
16 byte malloc/free 10000000 5.4 542.2
NSAutoreleasePool alloc/init/release 100000 0.1 753.2
NSObject alloc/init/release 100000 0.2 1511.7
NSInvocation message send 100000 0.2 2111.9
16MB malloc/free 1000 0.0 19033.7
pthread create/join 100 0.0 142817.5
Zero-second delayed perform 1000 0.1 146302.7
1MB memcpy 100 0.2 1787482.1

iPhone 4 (-fthumb)

Apple A4 ARM Cortex A8 ~800MHz

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.9 9.0
C++ virtual method call 100000000 1.0 10.2
16 byte memcpy 10000000 0.4 36.2
Integer division 100000000 4.1 40.6
Objective-C message send 100000000 4.1 40.8
Floating-point division 10000000 0.9 89.4
Objective-C objectAtIndex: 10000000 1.1 105.8
Float division with int conversion 10000000 1.1 105.8
CF CFArrayGetValueAtIndex 10000000 1.7 168.1
NSInvocation message send 100000 0.1 550.8
16 byte malloc/free 10000000 6.6 656.3
NSAutoreleasePool alloc/init/release 100000 0.1 979.5
NSObject alloc/init/release 100000 0.4 4277.9
16MB malloc/free 1000 0.0 20406.7
pthread create/join 100 0.0 139971.2
Zero-second delayed perform 1000 0.2 243883.3
1MB memcpy 100 0.1 1150657.9

iPhone 3GS (-fthumb)

ARM Cortex A8 ~600MHz / 1.66 ns per cycle

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 1.2 11.7
C++ virtual method call 100000000 1.3 13.5
16 byte memcpy 10000000 0.5 46.0
Objective-C message send 100000000 5.4 53.9
Integer division 100000000 6.3 62.9
Floating-point division 10000000 1.2 117.4
Float division with int conversion 10000000 1.4 138.2
Objective-C objectAtIndex: 10000000 1.4 140.1
CF CFArrayGetValueAtIndex 10000000 2.2 220.0
16 byte malloc/free 10000000 6.4 642.6
NSInvocation message send 100000 0.1 723.0
NSAutoreleasePool alloc/init/release 100000 0.1 1305.9
NSObject alloc/init/release 100000 0.6 5743.7
16MB malloc/free 1000 0.0 16104.0
pthread create/join 100 0.0 185759.2
Zero-second delayed perform 1000 0.4 353519.4
1MB memcpy 100 0.2 2170179.2

iPhone 3GS (no thumb)

ARM Cortex A8 ~600MHz / 1.66 ns per cycle

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 1.2 11.8
C++ virtual method call 100000000 4.3 42.9
Objective-C message send 100000000 5.9 59.2
CF CFArrayGetValueAtIndex 10000000 1.0 97.9
Integer division 100000000 9.8 98.4
16 byte memcpy 10000000 1.1 109.3
Floating-point division 10000000 1.2 118.5
Objective-C objectAtIndex: 10000000 1.3 129.0
Float division with int conversion 10000000 1.4 142.6
16 byte malloc/free 10000000 7.5 748.6
NSInvocation message send 100000 0.1 806.0
NSObject alloc/init/release 100000 0.5 4793.1
NSAutoreleasePool alloc/init/release 100000 0.5 4953.1
16MB malloc/free 1000 0.0 17969.2
Zero-second delayed perform 1000 0.2 211840.4
pthread create/join 100 0.0 214742.5
1MB memcpy 100 0.3 3162774.6

iPhone 3G

ARM1176 ~412MHz / 2.4ns per cycle

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 3.9 38.6
C++ virtual method call 100000000 5.0 49.9
Floating-point division 10000000 0.8 81.3
Float division with int conversion 10000000 0.8 81.4
16 byte memcpy 10000000 1.4 136.0
Objective-C message send 100000000 14.9 148.6
Integer division 100000000 16.2 162.2
CF CFArrayGetValueAtIndex 10000000 2.0 201.7
Objective-C objectAtIndex: 10000000 4.2 418.3
NSInvocation message send 100000 0.2 1833.2
16 byte malloc/free 10000000 27.3 2729.8
NSObject alloc/init/release 100000 1.4 14179.1
NSAutoreleasePool alloc/init/release 100000 1.9 18956.7
16MB malloc/free 1000 0.0 47811.3
Zero-second delayed perform 1000 0.8 803419.3
pthread create/join 100 0.1 1085830.0
1MB memcpy 100 1.0 9902796.7

iPad (-fthumb)

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.7 7.1
C++ virtual method call 100000000 0.8 8.1
16 byte memcpy 10000000 0.3 27.7
Objective-C message send 100000000 3.2 32.3
Integer division 100000000 3.4 33.7
CF CFArrayGetValueAtIndex 10000000 0.6 58.8
Floating-point division 10000000 0.7 70.5
Objective-C objectAtIndex: 10000000 0.8 81.6
Float division with int conversion 10000000 0.8 83.1
16 byte malloc/free 10000000 3.6 357.8
NSInvocation message send 100000 0.0 470.8
NSAutoreleasePool alloc/init/release 100000 0.3 2957.0
NSObject alloc/init/release 100000 0.3 3080.2
16MB malloc/free 1000 0.0 14824.2
pthread create/join 100 0.0 127386.2
Zero-second delayed perform 1000 0.2 225271.3
1MB memcpy 100 0.1 1064566.2

iPad (-mno-thumb)

Apple A4 ARM Cortex A8 ~1GHz / 1 ns per cycle

Name Iterations Total time (sec) Time per (ns)
IMP-cached message send 100000000 0.8 8.1
C++ virtual method call 100000000 2.2 21.8
16 byte memcpy 10000000 0.3 28.2
Objective-C message send 100000000 3.2 32.5
Integer division 100000000 3.4 33.9
CF CFArrayGetValueAtIndex 10000000 0.6 55.8
Floating-point division 10000000 0.7 70.9
Objective-C objectAtIndex: 10000000 0.8 81.6
Float division with int conversion 10000000 0.8 82.8
16 byte malloc/free 10000000 3.6 358.3
NSInvocation message send 100000 0.0 473.4
NSAutoreleasePool alloc/init/release 100000 0.3 3017.6
NSObject alloc/init/release 100000 0.3 3071.8
16MB malloc/free 1000 0.0 14623.6
pthread create/join 100 0.0 128674.6
Zero-second delayed perform 1000 0.3 255627.5
1MB memcpy 100 0.1 1063407.5

Note that I did reduce the iterations from the original tests, so whilst the total times are significantly less, the iteration times are still a reflection of overall performance.  Compared to Mike's results, these show that the IMP method is indeed faster as expected, but this was only after I changed to a release build.  I also compiled these with Thumb disabled unless otherwise specified.
I've recently watched some iTunes U videos released by Apple on optimizing OpenGL ES 2.0 and a key takeaway was that the Cortext A8 architecture should always be compiled with thumb enabled. The Cortex CPU uses the newer Thumb-2 instruction set, which has native instructions for floating point. The benefit of Thumb is reduced code size and potentially better performance by utilising the I-cache.

Observations

  • I've estimated the iPhone 4 CPU to be running at 800MHz.  Looking at the increased speed over the 3GS for a number of benchmarks, the average is 1.333x increase.  Multiplying 600MHz x 1.333 yields roughly 800MHz as the clock speed.
  • The IMP-cached message send is significantly faster on the newer Cortex CPU.  I have read of improvements in the branch prediction logic, which is particularly important due to the greater penalty of a misprediction in the longer A8 pipeline.  The code for executing the call is
      blx r8
    r8 contains the target address of the function, and remains so for the duration of the test.
  • For the 3GS, the Objective-C message send is very close to the C++ virtual method call.  I ran this test several times, and the behaviour didn't change.  The virtual method call is an indirect load of the pc register
      ldr pc, [r3]
    Without being able to access the PMC registers, I can't be sure of mispredictions; however, I know that 9 instructions are executed every iteration in the C++ test.  That suggests around 15ns / iteration; but, we're at 42.9.  Adding an additional 13 cycles every iteration (21.58ns) for a mispredition would get us to 37ns / iteration - much closer.  Stepping in to the objc_msgSend function finds the cached method on the first pass, totaling 28 instructions per iteration.  Given there are significantly more instructions for the Objective-C call, we're probably seeing the benefits of the dual—issue architecture.
  • Memory performance of the 3GS is significantly higher.  I've done some other micro-benchmarks, showing 2nd gen around 200 MB/s and 3rd gen around 800MB/s.  With some very well placed cache-preloads, I've actually pushed the ARM1176 to almost 300MB/s.
  • Calling the objectAtIndex: using CoreFoundation API is 2x faster on older devices; however, the gap is less significant with the newer hardware.  We've seen significant improvements to the objc_msgSend performance on the 3GS, which undoubtedly is making up much of the gap.
  • Floating point performance for scalar operations is slightly slower on the newer device.

Source code for this test is available here.