Why are slots so slow?
One of the points in Armin Ronacher's The Python I Would Like To See is that slots are slow. That is, A() + A()
is slower than A().__add__(A())
in the context of the following:
class A(object):
def __add__(self, other):
return 42
I'd like to investigate this claim for myself. To begin, let us repeat the experiment and see whether we get the same result:
$ cat x.py
class A(object):
def __add__(self, other):
return 42
$ ./python.exe
Python 3.5.0a4+ (default, Apr 25 2015, 21:57:28)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.49)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from x import A
>>> a = A()
>>> b = A()
>>> a + b
42
>>> quit()
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
1000000 loops, best of 3: 0.215 usec per loop
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a.__add__(b)'
10000000 loops, best of 3: 0.113 usec per loop
It would seem that Armin's claim stands up; a + b
is indeed considerably slower than a.__add__(b)
.
First of all, an implicit assumption of Armin's claim is that a + b
should be equivalent to a.__add__(b)
. Let us check this assumption by asking what does a + b
mean in Python? The documentation for + is probably a good place to start:
The
+
(addition) operator yields the sum of its arguments. The arguments must either both be numbers or both be sequences of the same type. In the former case, the numbers are converted to a common type and then added together. In the latter case, the sequences are concatenated.
Well, uh, that doesn't explain the observed behaviour of a + b
giving 42
. Perhaps the documentation for __add__
will shed some light on the situation:
These methods are called to implement the binary arithmetic operations (
+
, [...]). For instance, to evaluate the expressionx + y
, where x is an instance of a class that has an__add__()
method,x.__add__(y)
is called. [...] If one of those methods does not support the operation with the supplied arguments, it should returnNotImplemented
.
Well, that explains the observed behaviour, and seems to pretty much straight up say that a + b
means a.__add__(b)
. However, let's not get ahead of ourselves. On the off chance that it is relevant, let's consider the documentation for __radd__
:
These methods are called to implement the binary arithmetic operations (
+
, [...]) with reflected (swapped) operands. These functions are only called if the left operand does not support the corresponding operation and the operands are of different types. For instance, to evaluate the expressionx - y
, where y is an instance of a class that has an__rsub__()
method,y.__rsub__(x)
is called ifx.__sub__(y)
returnsNotImplemented
.
Well, whad'ya know, it was relevant. With this extra bit of information, it seems like a + b
is equivalent to something like:
if __add__ in a:
tmp = a.__add__(b)
else:
tmp = NotImplemented
if tmp is NotImplemented and type(a) != type(b):
return b.__radd__(a)
else:
return tmp
Of course, the story doesn't end there; immediately after the piece of documentation quoted above is the following gem:
Note: If the right operand’s type is a subclass of the left operand’s type and that subclass provides the reflected method for the operation, this method will be called before the left operand’s non-reflected method. This behavior allows subclasses to override their ancestors’ operations.
Bearing this in mind, maybe a + b
is equivalent to something like:
if issubclass(type(b), type(a)) and __radd__ in b:
tmp = b.__radd__(a)
if tmp is not NotImplemented:
return tmp
if __add__ in a:
tmp = a.__add__(b)
else:
tmp = NotImplemented
if tmp is NotImplemented and type(a) != type(b):
return b.__radd__(a)
else:
return tmp
I wish that the above were the full story, but alas it is not. Let us pluck another link out of thin air, this time to the documentation on special method lookup:
For custom classes, implicit invocations of special methods are only guaranteed to work correctly if defined on an object’s type, not in the object’s instance dictionary. [...] In addition to bypassing any instance attributes in the interest of correctness, implicit special method lookup generally also bypasses the
__getattribute__()
method even of the object’s metaclass.
I like to interpret this paragraph as saying bugger it, a + b
means whatever the CPython interpreter does for a + b
. Having studied the interpreter, the meaning of a + b
is equivalent to something along the lines of the following:
def get(x, field):
try:
return getattr(type(x), field) # Doesn't call __getattribute__
except AttributeError:
return None
def has(x, field):
return get(x, field) is not None
# From now on, `x.__yzw__` means `get(x, '__yzw__')`
# and `__abc__ in d` means `has(d, '__abc__')`
def tp_add_slot(x):
if x is a builtin type or a type from a C extension:
return ?
elif __add__ in x or __radd__ in x:
return slot_nb_add
else:
return None
def sq_concat_slot(x):
return ?
def slot_nb_add(x, y):
do_other = type(x) != type(y) and tp_add_slot(y) == slot_nb_add and __radd__ in y
if tp_add_slot(x) == slot_nb_add:
if do_other and issubclass(type(y), type(x)) and (__radd__ not in x or x.__radd__ != y.__radd__):
tmp = y.__radd__(x)
if tmp is NotImplemented:
do_other = False
else:
return tmp
if __add__ in x:
tmp = x.__add__(y)
if tmp is not NotImplemented:
return tmp
if do_other:
return y.__radd__(x)
return NotImplemented
slota = tp_add_slot(a)
slotb = tp_add_slot(b)
slotc = sq_concat_slot(a)
if slota == slotb:
return slota(a, b)
if slota is not None and slotb is not None and issubclass(type(b), type(a)):
tmp = slotb(a, b)
if tmp is NotImplemented:
slotb = None
else:
return tmp
if slota is not None:
tmp = slota(a, b)
if tmp is not NotImplementd:
return tmp
if slotb is not None:
tmp = slotb(a, b)
if tmp is not NotImplementd:
return tmp
if slotc is None:
raise error
else:
return slotc(a, b)
The conclusion of the above exploration is that a + b
has a rather more nuanced meaning than just a.__add__(b)
. If we accept this conclusion, then perhaps it shouldn't be surprising that a + b
is slower than a.__add__(b)
. However, in our case, a
and b
are the same type, so the above pseudo-code should pretty quickly conclude that the meaning of a + b
, in our case, is just a.__add__(b)
.
Let us consider an alternative conclusion: the people behind the CPython interpreter have spent more time optimising a.__add__(b)
than they have spent optimising a + b
. To test this hypothesis, we need to dig into the bytecode of these two expressions. If we ignore the bytecode which is common to both expressions, then we can say that a.__add__(b)
consists of two bytecode instructions (LOAD_ATTR
and CALL_FUNCTION
), while a + b
consists of just a single bytecode instruction (BINARY_ADD
).
Let's begin with a.__add__(b)
and look at what happens when the bytecode is executed:
- Begin
LOAD_ATTR
instruction. - Call
PyObject_GetAttr
. - Call
PyObject_GenericGetAttr
(via thetp_getattro
slot intype(a)
). - Call
_PyObject_GenericGetAttrWithDict
. - Call
_PyType_Lookup
. - Successfully find
__add__
in the method cache. - Return a function object to
_PyObject_GenericGetAttrWithDict
. - Call
func_descr_get
(via thetp_descr_get
slot in the type of the function object). - Call
PyMethod_New
(to binda
to the first argument of the function). - Return a method object from
PyObject_GetAttr
. - Push the method object onto the stack.
- End
LOAD_ATTR
instruction. - Begin
CALL_FUNCTION
instruction. - Call
call_function
. - Realise that we have a method object.
- Replace the stack entry underneath
b
with the bound argument from the method object. - Call
fast_function
with the function from the method object. - Call
PyFrame_New
to create a new stack frame. - Call
PyEval_EvalFrameEx
to actually evaluate our__add__
code. - Return
42
fromcall_function
. - Free the method object.
- End
CALL_FUNCTION
instruction.
On the other hand, for a + b
, we have:
- Begin
BINARY_ADD
instruction. - Call
PyNumber_Add
. - Call
binary_op1
. - Call
slot_nb_add
(via thetp_add
slot intype(a)
) [slot_nb_add
is defined via aSLOT1BIN
macro]. - Call
call_maybe(a, &Id__add__, "(O)", b)
[call_maybe
is variadic after the string parameter]. - Call
lookup_maybe
[lookup_maybe
is likePyObject_GetAttr
, but only looks in the type, and doesn't invoke__getattribute__
]. - Call
_PyType_LookupId
. - Call
_PyUnicode_FromId
[this convertsId__add__
into a PyObject representing"__add__"
, but this conversion is cached, and therefore effectively free]. - Call
_PyType_Lookup
[as in step 5 above]. - Successfully find
__add__
in the method cache [as in step 6 above]. - Return a function object from
_PyType_LookupId
. - Call
func_descr_get
(via thetp_descr_get
slot in the type of the function object) [as in step 8 above]. - Call
PyMethod_New
(to binda
to the first argument of the function) [as in step 9 above]. - Return a method object from
lookup_maybe
. - Call
Py_VaBuildValue
, passing the string literal"(O)"
and a reference tocall_maybe
's variadic arguments. - Do a tonne of string literal parsing and variadic argument fetching and tuple construction, resulting in
Py_VaBuildValue
eventually returning a singleton tuple containingb
. - Call
method_call
(via thetp_call
slot in the type of the method object). - Construct a new two-element tuple, filling it with the bound argument from the method object, and the contents of the previously constructed singleton tuple. In other words, we now have the tuple
(a, b)
. - Call
function_call
(via thetp_call
slot in the type of the function from the method object). - Call
PyEval_EvalCodeEx
. - Call
_PyEval_EvalCodeWithName
. - Call
PyFrame_New
to create a new stack frame [as in step 18 above]. - Call
PyEval_EvalFrameEx
to actually evaluate our__add__
code [as in step 19 above]. - Return
42
fromfunction_call
. - Free the two-element tuple.
- Return
42
frommethod_call
. - Free the singleton tuple.
- Free the method object.
- Return
42
fromPyNumber_Add
. - End
BINARY_ADD
instruction.
One obvious diference is that a + b
does far more manipulation of tuples and of variadic arguments. Given that call_maybe
is always called with a format of "(O)"
, let's acknowledge this by changing its signature to be fixed-arg rather than vararg, and also construct an argument tuple via PyTuple_New
/ PyTuple_SET_ITEM
rather than Py_VaBuildValue
:
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..d27cc07 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,29 +1465,22 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
/* Clone of call_method() that returns NotImplemented when the lookup fails. */
static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
{
- va_list va;
PyObject *args, *func = 0, *retval;
- va_start(va, format);
func = lookup_maybe(o, nameid);
if (func == NULL) {
- va_end(va);
if (!PyErr_Occurred())
Py_RETURN_NOTIMPLEMENTED;
return NULL;
}
- if (format && *format)
- args = Py_VaBuildValue(format, va);
- else
- args = PyTuple_New(0);
-
- va_end(va);
-
+ args = PyTuple_New(1);
if (args == NULL)
return NULL;
+ PyTuple_SET_ITEM(args, 0, p);
+ Py_XINCREF(p);
assert(PyTuple_Check(args));
retval = PyObject_Call(func, args, NULL);
@@ -5624,20 +5617,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
if (do_other && \
PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
method_is_overloaded(self, other, &rop_id)) { \
- r = call_maybe(other, &rop_id, "(O)", self); \
+ r = call_maybe(other, &rop_id, self); \
if (r != Py_NotImplemented) \
return r; \
Py_DECREF(r); \
do_other = 0; \
} \
- r = call_maybe(self, &op_id, "(O)", other); \
+ r = call_maybe(self, &op_id, other); \
if (r != Py_NotImplemented || \
Py_TYPE(other) == Py_TYPE(self)) \
return r; \
Py_DECREF(r); \
} \
if (do_other) { \
- return call_maybe(other, &rop_id, "(O)", self); \
+ return call_maybe(other, &rop_id, self); \
} \
Py_RETURN_NOTIMPLEMENTED; \
}
This gives a nice little speedup; we're down from 0.215 usec to 0.176 usec:
$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.176 usec per loop
We're still falling somewhat short the of 0.113 usec time set by a.__add__(b)
, so let's copy step 15 of a.__add__(b)
and special-case method objects:
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..2cd8e23 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,43 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
/* Clone of call_method() that returns NotImplemented when the lookup fails. */
static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
{
- va_list va;
- PyObject *args, *func = 0, *retval;
- va_start(va, format);
+ PyObject *args[2], *func = 0, *retval, *tuple;
+ int na = 1;
func = lookup_maybe(o, nameid);
if (func == NULL) {
- va_end(va);
if (!PyErr_Occurred())
Py_RETURN_NOTIMPLEMENTED;
return NULL;
}
- if (format && *format)
- args = Py_VaBuildValue(format, va);
- else
- args = PyTuple_New(0);
-
- va_end(va);
-
- if (args == NULL)
- return NULL;
+ args[1] = p;
+ if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+ PyObject *mself = PyMethod_GET_SELF(func);
+ PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+ args[0] = mself;
+ na = 2;
+ Py_INCREF(mfunc);
+ Py_DECREF(func);
+ func = mfunc;
+ } else {
+ args[0] = NULL;
+ }
- assert(PyTuple_Check(args));
- retval = PyObject_Call(func, args, NULL);
+ tuple = PyTuple_New(na);
+ if (tuple == NULL) {
+ retval = NULL;
+ } else {
+ memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+ Py_XINCREF(args[0]);
+ Py_XINCREF(args[1]);
+ retval = PyObject_Call(func, tuple, NULL);
+ Py_DECREF(tuple);
+ }
- Py_DECREF(args);
Py_DECREF(func);
-
return retval;
}
@@ -5624,20 +5631,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
if (do_other && \
PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
method_is_overloaded(self, other, &rop_id)) { \
- r = call_maybe(other, &rop_id, "(O)", self); \
+ r = call_maybe(other, &rop_id, self); \
if (r != Py_NotImplemented) \
return r; \
Py_DECREF(r); \
do_other = 0; \
} \
- r = call_maybe(self, &op_id, "(O)", other); \
+ r = call_maybe(self, &op_id, other); \
if (r != Py_NotImplemented || \
Py_TYPE(other) == Py_TYPE(self)) \
return r; \
Py_DECREF(r); \
} \
if (do_other) { \
- return call_maybe(other, &rop_id, "(O)", self); \
+ return call_maybe(other, &rop_id, self); \
} \
Py_RETURN_NOTIMPLEMENTED; \
}
This gives another nice little speedup; we're down from 0.176 usec to 0.155 usec:
$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.155 usec per loop
Even better would be to also pull the fast_function
trick that the interpreter does at step 17 in order to call a function without creating any argument tuples at all:
diff --git a/Include/ceval.h b/Include/ceval.h
index 6811367..f0997ac 100644
--- a/Include/ceval.h
+++ b/Include/ceval.h
@@ -10,6 +10,9 @@ extern "C" {
PyAPI_FUNC(PyObject *) PyEval_CallObjectWithKeywords(
PyObject *, PyObject *, PyObject *);
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n);
+
/* Inline this */
#define PyEval_CallObject(func,arg) \
PyEval_CallObjectWithKeywords(func, arg, (PyObject *)NULL)
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..6419ea2 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,47 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
/* Clone of call_method() that returns NotImplemented when the lookup fails. */
static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
{
- va_list va;
- PyObject *args, *func = 0, *retval;
- va_start(va, format);
+ PyObject *args[2], *func = 0, *retval;
+ int na = 1;
func = lookup_maybe(o, nameid);
if (func == NULL) {
- va_end(va);
if (!PyErr_Occurred())
Py_RETURN_NOTIMPLEMENTED;
return NULL;
}
- if (format && *format)
- args = Py_VaBuildValue(format, va);
- else
- args = PyTuple_New(0);
-
- va_end(va);
-
- if (args == NULL)
- return NULL;
+ args[1] = p;
+ if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+ PyObject *mself = PyMethod_GET_SELF(func);
+ PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+ args[0] = mself;
+ na = 2;
+ Py_INCREF(mfunc);
+ Py_DECREF(func);
+ func = mfunc;
+ } else {
+ args[0] = NULL;
+ }
- assert(PyTuple_Check(args));
- retval = PyObject_Call(func, args, NULL);
+ if (PyFunction_Check(func)) {
+ retval = PyEval_FastFunction(func, &args[2], na);
+ } else {
+ PyObject* tuple = PyTuple_New(na);
+ if (tuple == NULL) {
+ retval = NULL;
+ } else {
+ memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+ Py_XINCREF(args[0]);
+ Py_XINCREF(args[1]);
+ retval = PyObject_Call(func, tuple, NULL);
+ Py_DECREF(tuple);
+ }
+ }
- Py_DECREF(args);
Py_DECREF(func);
-
return retval;
}
@@ -5624,20 +5635,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
if (do_other && \
PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
method_is_overloaded(self, other, &rop_id)) { \
- r = call_maybe(other, &rop_id, "(O)", self); \
+ r = call_maybe(other, &rop_id, self); \
if (r != Py_NotImplemented) \
return r; \
Py_DECREF(r); \
do_other = 0; \
} \
- r = call_maybe(self, &op_id, "(O)", other); \
+ r = call_maybe(self, &op_id, other); \
if (r != Py_NotImplemented || \
Py_TYPE(other) == Py_TYPE(self)) \
return r; \
Py_DECREF(r); \
} \
if (do_other) { \
- return call_maybe(other, &rop_id, "(O)", self); \
+ return call_maybe(other, &rop_id, self); \
} \
Py_RETURN_NOTIMPLEMENTED; \
}
diff --git a/Python/ceval.c b/Python/ceval.c
index 2f3d3ad..bf6aedc 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -4329,6 +4329,12 @@ call_function(PyObject ***pp_stack, int oparg
return x;
}
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n)
+{
+ return fast_function(func, &stack, n, n, 0);
+}
+
/* The fast_function() function optimize calls for which no argument
tuple is necessary; the objects are passed directly from the stack.
For the simplest case -- a function that takes only positional
And with that, we're down from 0.155 usec to 0.113 usec:
$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.113 usec per loop
So, it seems that slots aren't intrinsically slow. Provided that the implementation of slots in typeobject.c
is taught to use the exact same tricks that the interpreter does, then they are the exact same speed as non-slots. We could even go further and elide construction of the method object entirely:
diff --git a/Include/ceval.h b/Include/ceval.h
index 6811367..f0997ac 100644
--- a/Include/ceval.h
+++ b/Include/ceval.h
@@ -10,6 +10,9 @@ extern "C" {
PyAPI_FUNC(PyObject *) PyEval_CallObjectWithKeywords(
PyObject *, PyObject *, PyObject *);
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n);
+
/* Inline this */
#define PyEval_CallObject(func,arg) \
PyEval_CallObjectWithKeywords(func, arg, (PyObject *)NULL)
diff --git a/Objects/typeobject.c b/Objects/typeobject.c
index 4b99287..c4ffa70 100644
--- a/Objects/typeobject.c
+++ b/Objects/typeobject.c
@@ -1465,36 +1465,64 @@ call_method(PyObject *o, _Py_Identifier *nameid, char *format, ...)
/* Clone of call_method() that returns NotImplemented when the lookup fails. */
static PyObject *
-call_maybe(PyObject *o, _Py_Identifier *nameid, char *format, ...)
+call_maybe(PyObject *o, _Py_Identifier *nameid, PyObject* p)
{
- va_list va;
- PyObject *args, *func = 0, *retval;
- va_start(va, format);
+ PyObject *args[2], *func = 0, *retval;
+ int na = 2;
- func = lookup_maybe(o, nameid);
+ args[1] = p;
+ func = _PyType_LookupId(Py_TYPE(o), nameid);
if (func == NULL) {
- va_end(va);
if (!PyErr_Occurred())
Py_RETURN_NOTIMPLEMENTED;
return NULL;
}
+ if (PyFunction_Check(func)) {
+ Py_INCREF(func);
+ args[0] = o;
+ retval = PyEval_FastFunction(func, &args[2], na);
+ } else {
+ descrgetfunc f = Py_TYPE(func)->tp_descr_get;
+ if (f == NULL) {
+ Py_INCREF(func);
+ } else {
+ func = f(func, o, (PyObject *)(Py_TYPE(o)));
+ if (func == NULL) {
+ if (!PyErr_Occurred())
+ Py_RETURN_NOTIMPLEMENTED;
+ return NULL;
+ }
+ }
- if (format && *format)
- args = Py_VaBuildValue(format, va);
- else
- args = PyTuple_New(0);
-
- va_end(va);
-
- if (args == NULL)
- return NULL;
-
- assert(PyTuple_Check(args));
- retval = PyObject_Call(func, args, NULL);
+ if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
+ PyObject *mself = PyMethod_GET_SELF(func);
+ PyObject *mfunc = PyMethod_GET_FUNCTION(func);
+ args[0] = mself;
+ Py_INCREF(mfunc);
+ Py_DECREF(func);
+ func = mfunc;
+ } else {
+ args[0] = NULL;
+ na = 1;
+ }
+
+ if (PyFunction_Check(func)) {
+ retval = PyEval_FastFunction(func, &args[2], na);
+ } else {
+ PyObject* tuple = PyTuple_New(na);
+ if (tuple == NULL) {
+ retval = NULL;
+ } else {
+ memcpy(((PyTupleObject *)tuple)->ob_item, args, sizeof(PyObject*) * na);
+ Py_XINCREF(args[0]);
+ Py_XINCREF(args[1]);
+ retval = PyObject_Call(func, tuple, NULL);
+ Py_DECREF(tuple);
+ }
+ }
+ }
- Py_DECREF(args);
Py_DECREF(func);
-
return retval;
}
@@ -5624,20 +5652,20 @@ FUNCNAME(PyObject *self, PyObject *other) \
if (do_other && \
PyType_IsSubtype(Py_TYPE(other), Py_TYPE(self)) && \
method_is_overloaded(self, other, &rop_id)) { \
- r = call_maybe(other, &rop_id, "(O)", self); \
+ r = call_maybe(other, &rop_id, self); \
if (r != Py_NotImplemented) \
return r; \
Py_DECREF(r); \
do_other = 0; \
} \
- r = call_maybe(self, &op_id, "(O)", other); \
+ r = call_maybe(self, &op_id, other); \
if (r != Py_NotImplemented || \
Py_TYPE(other) == Py_TYPE(self)) \
return r; \
Py_DECREF(r); \
} \
if (do_other) { \
- return call_maybe(other, &rop_id, "(O)", self); \
+ return call_maybe(other, &rop_id, self); \
} \
Py_RETURN_NOTIMPLEMENTED; \
}
diff --git a/Python/ceval.c b/Python/ceval.c
index 2f3d3ad..bf6aedc 100644
--- a/Python/ceval.c
+++ b/Python/ceval.c
@@ -4329,6 +4329,12 @@ call_function(PyObject ***pp_stack, int oparg
return x;
}
+PyAPI_FUNC(PyObject *)
+PyEval_FastFunction(PyObject *func, PyObject **stack, int n)
+{
+ return fast_function(func, &stack, n, n, 0);
+}
+
/* The fast_function() function optimize calls for which no argument
tuple is necessary; the objects are passed directly from the stack.
For the simplest case -- a function that takes only positional
With this extra optimisation, we're down from 0.113 usec to 0.0972 usec:
$ make python.exe
...
$ ./python.exe -mtimeit -s 'from x import A; a = A(); b = A()' 'a + b'
10000000 loops, best of 3: 0.0972 usec per loop
In conclusion, slots don't need to be slow - the above diff makes them fast (at least for some binary operators; applying similar transformations to other slots is left as an exercise to the reader).