Discussion:
archer c7 v2, policing, hostapd, test openwrt build
(too old to reply)
Dave Taht
2015-03-23 00:24:05 UTC
Permalink
so I had discarded policing for inbound traffic management a long
while back due to it not
handling varying RTTs very well, the burst parameter being hard, maybe
even impossible to tune, etc.

And I'd been encouraging other people to try it for a while, with no
luck. So anyway...

1) A piece of good news is - using the current versions of cake and
cake2, that I can get on linux 3.18/chaos calmer, on the archer c7v2
shaping 115mbit download with 12mbit upload... on a cable modem...
with 5% cpu to spare. I haven't tried a wndr3800 yet...

htb + fq_codel ran out of cpu at 94mbit...

2) On the same test rig I went back to try policing. With a 10k burst
parameter, it cut download rates in half...

However, with a 100k burst parameter, on the rrul and tcp_download
tests, at a very short RTT (ethernet) I did get full throughput and
lower latency.

How to try it:

run sqm with whatever settings you want. Then plunk in the right rate
below for your downlink.

tc qdisc del dev eth0 handle ffff: ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip
src 0.0.0.0/0 police rate 115000kbit burst 100k drop flowid :1

I don't know how to have it match all traffic, including ipv6
traffic(anyone??), but that was encouraging.

However, the core problem with policing is that it doesn't handle
different RTTs very well, and the exact same settings on a 16ms
path.... cut download throughput by a factor of 10. - set to
115000kbit I got 16mbits on rrul. :(

Still...

I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months. However I tend to think
backporting the FIB patches and making cake run faster might be more
fruitful. (or finding faster hardware)

3) There may be some low hanging fruit in how hostapd operates. Right
now I see it chewing up cpu, and when running, costing 50mbit of
throughput at higher rates, doing something odd, over and over again.

clock_gettime(CLOCK_MONOTONIC, {1240, 843487389}) = 0
recvmsg(12, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\0\0\1\20\0\25\0\0\0\0\0\0\0\0\0\0;\1\0\0\0\10\0\1\0\0\0\1\0\10\0&"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 272
clock_gettime(CLOCK_MONOTONIC, {1240, 845060156}) = 0
clock_gettime(CLOCK_MONOTONIC, {1240, 845757477}) = 0
_newselect(19, [3 5 8 12 15 16 17 18], [], [], {3, 928211}) = 1 (in
[12], left {3, 920973})

I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.

4) I got a huge improvement in p2p wifi tcp throughput between linux
3.18 and linux 3.18 + the minstrel-blues and andrew's minimum variance
patches - a jump of over 30% on the ubnt nanostation m5.

5) Aside from that, so far the archer hasn't crashed on me, but I
haven't tested the wireless much yet on that platform. My weekend's
test build:

http://snapon.lab.bufferbloat.net/~cero3/ubnt/ar71xx/
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Jonathan Morton
2015-03-23 00:31:44 UTC
Permalink
Post by Dave Taht
I don't know how to have it match all traffic, including ipv6
traffic(anyone??), but that was encouraging.
I use "protocol all u32 match u32 0 0”.

- Jonathan Morton
Jonathan Morton
2015-03-23 01:10:26 UTC
Permalink
Post by Dave Taht
I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.
If it’s reseeding an RNG using the current time, that’s fairly bad practice, especially if it’s for any sort of cryptographic purpose. For general purposes, seed a good RNG once before first use, using /dev/urandom, then just keep pulling values from it as needed. Or, if cryptographic quality is required, use an actual crypto library’s RNG.

- Jonathan Morton
Dave Taht
2015-03-23 01:18:04 UTC
Permalink
I don't remember what I did. I remember sticking something in to
improve entropy, and ripping out a patch to hostapd that tried to
manufacture it.

right now I'm merely trying to stablize the new bits - dnssec for
dnsmasq 2.73rc1, babel-1.6 (not in this build) with procd and
ipv6_subtrees and atomic updates, get someone F/T on the minstrel
stuff, and get heads down on per station queuing by mid april. I was
not expecting to make chaos calmer with the last at all and am still
not, and next up is getting some profiling infrastructure in place
that actually works....
Post by Jonathan Morton
Post by Dave Taht
I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.
If it’s reseeding an RNG using the current time, that’s fairly bad practice, especially if it’s for any sort of cryptographic purpose. For general purposes, seed a good RNG once before first use, using /dev/urandom, then just keep pulling values from it as needed. Or, if cryptographic quality is required, use an actual crypto library’s RNG.
- Jonathan Morton
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Jonathan Morton
2015-03-23 01:34:48 UTC
Permalink
Post by Dave Taht
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months.
I have a hazy picture in my mind, now, of how it could be made to work.

A policer doesn’t actually maintain a queue, but it is possible to calculate when the currently-arriving packet would be scheduled for sending if a shaped FIFO was present, in much the same way that cake actually performs such scheduling at the head of a real queue. The difference between that time and the current time is a virtual sojourn time which can be fed into the Codel algorithm. Then, when Codel says to drop a packet, you do so.

Because there’s no queue management, timer interrupts nor flow segregation, the overhead should be significantly lower than an actual queue. And there’s a reasonable hope that involving Codel will give better results than either a brick-wall or a token bucket.

- Jonathan Morton
David Lang
2015-03-23 01:45:51 UTC
Permalink
Post by Jonathan Morton
Post by Dave Taht
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months.
I have a hazy picture in my mind, now, of how it could be made to work.
A policer doesn’t actually maintain a queue, but it is possible to calculate when the currently-arriving packet would be scheduled for sending if a shaped FIFO was present, in much the same way that cake actually performs such scheduling at the head of a real queue. The difference between that time and the current time is a virtual sojourn time which can be fed into the Codel algorithm. Then, when Codel says to drop a packet, you do so.
Because there’s no queue management, timer interrupts nor flow segregation, the overhead should be significantly lower than an actual queue. And there’s a reasonable hope that involving Codel will give better results than either a brick-wall or a token bucket.
are we running into performance issues with fq_codel? I thought all the problems
were with HTB or ingress shaping.

David Lang
Dave Taht
2015-03-23 02:00:23 UTC
Permalink
htb and ingress shaping are the biggest problems yes.
Post by David Lang
Post by Jonathan Morton
Post by Dave Taht
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months.
I have a hazy picture in my mind, now, of how it could be made to work.
A policer doesn’t actually maintain a queue, but it is possible to
calculate when the currently-arriving packet would be scheduled for sending
if a shaped FIFO was present, in much the same way that cake actually
performs such scheduling at the head of a real queue. The difference
between that time and the current time is a virtual sojourn time which can
be fed into the Codel algorithm. Then, when Codel says to drop a packet,
you do so.
1) Sorta. See act_police for how it is presently done. The code is
chock full of locks that
I don't think are needed, and very crufty and old.

From a control theory perspective, we are aiming for a target *rate*
rather than a target
*delay*, but we can use the same codel like methods to try to achieve
that rate - hit
flows once over a calculated interval, decrease the interval if it doesn't work.

on top of that you don't want to brick wall things but you do want to
identify and
only hit once individual flows, so once you go into drop(or mark) mode, hash
the 5 tuple, store the tuple + time in a small ring buffer, hit a
bunch of flows,
but never 2 packets from the same flow in a row.

keep a longer term rate (bytes/200ms) and a short term rate (say
bytes/20ms) to bracket what you need,
keep track of how much you exceeded the set rate for how long...

and it seemed doable to do it without inducing delays, but "bobbing"
up and down around the
set rate at a level that would not induce much extra latency on the
download on reasonable
timescales.

Still seemed like a lot of work for potentially no gain, at the time I
was thinking about it hard.
Couldn't figure out how to stably grown and shrink the size of the
needed ring buffer to
hit flows with...

Decided I'd rather go profile where we were going wrong, profiling was
broken, got burned
out on it all.

Thought my first step would be to add an ecn mode to act_police and
see what happened.

2) I'd really like to get rid of the needed act_mirred stuff and be
able to attach a true shaping
qdisc directly to the ingress portion of the qdisc.

3) /me has more brain cells working than he did 9 months ago.
Post by David Lang
Post by Jonathan Morton
Because there’s no queue management, timer interrupts nor flow
segregation, the overhead should be significantly lower than an actual
queue. And there’s a reasonable hope that involving Codel will give better
results than either a brick-wall or a token bucket.
are we running into performance issues with fq_codel? I thought all the
problems were with HTB or ingress shaping.
David Lang
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Jonathan Morton
2015-03-23 02:10:52 UTC
Permalink
are we running into performance issues with fq_codel? I thought all the problems were with HTB or ingress shaping.
Cake is, in part, a response to the HTB problem; it is a few percent more efficient so far than an equivalent HTB+fq_codel combination. It will have a few other novel features, too.

Bobbie is a response to the ingress-shaping problem. A policer (with no queue) can be run without involving an IFB device, which we believe has a large overhead.

- Jonathan Morton
Dave Taht
2015-03-23 02:15:30 UTC
Permalink
Post by Jonathan Morton
are we running into performance issues with fq_codel? I thought all the problems were with HTB or ingress shaping.
Cake is, in part, a response to the HTB problem; it is a few percent more efficient so far than an equivalent HTB+fq_codel combination. It will have a few other novel features, too.
Bobbie is a response to the ingress-shaping problem. A policer (with no queue) can be run without involving an IFB device, which we believe has a large overhead.
yep.

3rd option is to go make better hardware. Not getting very far on this
https://www.kickstarter.com/projects/onetswitch/onetswitch-open-source-hardware-for-networking

going to have to try harder. Honestly had hoped that broadcom, qca,
cisco, or somebody making chips would have got it right by now.
Post by Jonathan Morton
- Jonathan Morton
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Dave Taht
2015-03-23 02:18:58 UTC
Permalink
Post by Dave Taht
Post by Jonathan Morton
are we running into performance issues with fq_codel? I thought all the problems were with HTB or ingress shaping.
Cake is, in part, a response to the HTB problem; it is a few percent more efficient so far than an equivalent HTB+fq_codel combination. It will have a few other novel features, too.
And we have a much more efficient version of codel, which may help a
bit also. Still, really need to get back to profiling *the entire
path* on these itty bitty platforms. Look forward to getting the FIB
patches working on them, but that's a hairy lot of patches to
backport. Might be easier

wasn't really planning on making chaos calmer's freeze
Post by Dave Taht
Post by Jonathan Morton
Bobbie is a response to the ingress-shaping problem. A policer (with no queue) can be run without involving an IFB device, which we believe has a large overhead.
yep.
3rd option is to go make better hardware. Not getting very far on this
https://www.kickstarter.com/projects/onetswitch/onetswitch-open-source-hardware-for-networking
going to have to try harder. Honestly had hoped that broadcom, qca,
cisco, or somebody making chips would have got it right by now.
Post by Jonathan Morton
- Jonathan Morton
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Sebastian Moeller
2015-03-23 06:09:04 UTC
Permalink
Hi Jonathan, hi Dave,
Post by David Lang
are we running into performance issues with fq_codel? I thought all
the problems were with HTB or ingress shaping.
Cake is, in part, a response to the HTB problem; it is a few percent
more efficient so far than an equivalent HTB+fq_codel combination. It
will have a few other novel features, too.
Bobbie is a response to the ingress-shaping problem. A policer (with
no queue) can be run without involving an IFB device, which we believe
has a large overhead.
This is testable, if nobody beats me to it I will try this week. The main idea is to replace the ingress shaping on ge00 with egress shaping on the interface between the client and the router, so most likely se00 in cerowrt. This should effectively behave as the current sqm setup with ingress shaping, though only for hosts on se00. Of ifb truly is costly this setup should show better bandwidth use in rrul tests than the default. It obviously degrade local performance of se00 and hence be not a true solution unless one is happy to fully dedicate a box as shaper ;)

Best Regards
Sebastian
- Jonathan Morton
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
Jonathan Morton
2015-03-23 13:43:34 UTC
Permalink
Post by Sebastian Moeller
It obviously degrade local performance of se00 and hence be not a true solution unless one is happy to fully dedicate a box as shaper ;)
Dedicating a box as a router/shaper isn’t so much of a problem, but shaping traffic between wired and wireless - and sharing the incoming WAN bandwidth between them, too - is. It’s a valid test, though, for this particular purpose.

- Jonathan Morton
Sebastian Moeller
2015-03-23 16:09:41 UTC
Permalink
Hi Jonathan,
Post by Sebastian Moeller
It obviously degrade local performance of se00 and hence be not a true solution unless one is happy to fully dedicate a box as shaper ;)
Dedicating a box as a router/shaper isn’t so much of a problem, but shaping traffic between wired and wireless - and sharing the incoming WAN bandwidth between them, too - is. outer
Exactly the sentiment I had, but less terse and actually understandable ;)
It’s a valid test, though, for this particular purpose.
Once I get around to test it, I should b able to share some numbers…

Best Regards
Sebastian
- Jonathan Morton
Sebastian Moeller
2015-03-24 00:00:44 UTC
Permalink
Hi Jonathan, hi List,


So I got around to a bit of rrul testing of the dual egress idea to asses the cost of IFB, but the results are complicated (so most likely I screwed up). On an wndr3700v2 on a 100Mbps/40Mbps link I get the following (excuse the two images either the plot is intelligble or the legend...):
Dave Taht
2015-03-24 00:05:21 UTC
Permalink
this is with cero or last weekend's build?
Post by Sebastian Moeller
Hi Jonathan, hi List,
Only in case of shaping the total bandwidth to the ~70Mbps this router can barely do can I see an effect of the dual egress instead of the IFB based ingress shaper. So column 7 (ipv4) and column 8 (ipv6) are larger than columns 9 (ipv4) and 10 (ipv6) showing that dual egress instead of egress and ingress effective upload increases by < 10 Mbps (while download and latency stay unaffected). That is not bad, but also does not look like the IFB is the cost driver in sqm-scripts, or does it? Also as a corollary of the data I would say, my old interpretation that we hit a limit at ~70Mbps combined traffic might not be correct in that ingress and egress might carry slightly different costs, but then thins difference is not going to make a wndr punch way above its weight…
Best Regards
Sebastian
Post by Sebastian Moeller
Hi Jonathan,
Post by Sebastian Moeller
It obviously degrade local performance of se00 and hence be not a true solution unless one is happy to fully dedicate a box as shaper ;)
Dedicating a box as a router/shaper isn’t so much of a problem, but shaping traffic between wired and wireless - and sharing the incoming WAN bandwidth between them, too - is. outer
Exactly the sentiment I had, but less terse and actually understandable ;)
It’s a valid test, though, for this particular purpose.
Once I get around to test it, I should b able to share some numbers…
Best Regards
Sebastian
- Jonathan Morton
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Sebastian Moeller
2015-03-24 00:07:41 UTC
Permalink
Hi Dave,
Post by Dave Taht
this is with cero or last weekend's build?
Oops, forgot to mention, this is with cerowrt 3.10.50-1, I only have one router and have not dared to switch to the new shiny (unstable?) thing yet. The test was going over se00 from a machine that should be able to deliver >= 100Mbps symmetric.

Best Regards
Sebastian
Post by Dave Taht
Post by Sebastian Moeller
Hi Jonathan, hi List,
Only in case of shaping the total bandwidth to the ~70Mbps this router can barely do can I see an effect of the dual egress instead of the IFB based ingress shaper. So column 7 (ipv4) and column 8 (ipv6) are larger than columns 9 (ipv4) and 10 (ipv6) showing that dual egress instead of egress and ingress effective upload increases by < 10 Mbps (while download and latency stay unaffected). That is not bad, but also does not look like the IFB is the cost driver in sqm-scripts, or does it? Also as a corollary of the data I would say, my old interpretation that we hit a limit at ~70Mbps combined traffic might not be correct in that ingress and egress might carry slightly different costs, but then thins difference is not going to make a wndr punch way above its weight…
Best Regards
Sebastian
Post by Sebastian Moeller
Hi Jonathan,
Post by Sebastian Moeller
It obviously degrade local performance of se00 and hence be not a true solution unless one is happy to fully dedicate a box as shaper ;)
Dedicating a box as a router/shaper isn’t so much of a problem, but shaping traffic between wired and wireless - and sharing the incoming WAN bandwidth between them, too - is. outer
Exactly the sentiment I had, but less terse and actually understandable ;)
It’s a valid test, though, for this particular purpose.
Once I get around to test it, I should b able to share some numbers…
Best Regards
Sebastian
- Jonathan Morton
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
Jonathan Morton
2015-03-24 03:16:19 UTC
Permalink
Post by Sebastian Moeller
So I got around to a bit of rrul testing of the dual egress idea to asses the cost of IFB, but the results are complicated (so most likely I screwed up).
IFB is normally used on the download direction (as a substitute for a lack of AQM at the ISP), so that’s the one which matters. Can you try a unidirectional test which exercises only the download direction? This should get the clearest signal - without CPU-load interference from the upload direction.

- Jonathan Morton
Sebastian Moeller
2015-03-24 07:47:48 UTC
Permalink
Hi Jonathan,
Post by Jonathan Morton
Post by Sebastian Moeller
So I got around to a bit of rrul testing of the dual egress idea to asses the cost of IFB, but the results are complicated (so most likely I screwed up).
IFB is normally used on the download direction (as a substitute for a lack of AQM at the ISP), so that’s the one which matters. Can you try a unidirectional test which exercises only the download direction?
I will try to get around to this later this week, not sure whether I manage though.
Post by Jonathan Morton
This should get the clearest signal - without CPU-load interference from the upload direction.
I agree, but if IFB redirection truly is costly enough to bother with fixing/avoiding it should also cause a noticeable effect on the full ingress-egress stress test, I would assume. But at least for my limited tests it did not… Or to put it differently, if avoiding the IFB does not increase bandwidth use under full load it is not going to help with getting a router’s combined shaping performance improve, or do I see something wrong. Now maybe it is a critical building block for better performance that is masked at full load by something else, that is why I tried the reduced bandwidth loads (35000 bidirectional) but even there the effect was rather mild… That said, I will retry with download shaping only (vie se00 egress) and simplest.qos (instead of simple.qos) to move the heavy filtering out of the way. I wonder whether anybody has a good idea of how to measure the router’s cpu usage during a rrul test (maybe the main effect of avoiding IFB is not to increase bandwidth usage, but to free up cpu cycles for performing other task, which still would be quite valuable, I guess)

Best Regards
Sebastian
Post by Jonathan Morton
- Jonathan Morton
Jonathan Morton
2015-03-24 08:13:43 UTC
Permalink
What I'm seeing on your first tests is that double egress gives you
slightly more download at the expense of slightly less upload throughout.
The aggregate is higher.

Your second set of tests tells me almost nothing, because it exercises the
upload more and the download less. Hence why I'm asking for effectively the
opposite test. The aggregate is still significantly higher with double
egress, though.

The ping numbers also tell me that there's no significant latency penalty
either way. Even when CPU saturated, it's still effectively controlling the
latency better than leaving the pipe open.

- Jonathan Morton
Sebastian Moeller
2015-03-24 08:46:34 UTC
Permalink
Hi Jonathan,
What I'm seeing on your first tests is that double egress gives you slightly more download at the expense of slightly less upload throughout. The aggregate is higher.
But it is only slightly higher, and the uplink is not really saturated, it should be at around 30Mbps not 10 Mbps. Just to be clear we are talking about the same data: columns 5 and 6 are without sqm running, showing the download approaching the theoretical maximum, but pummeling the upload while doing so, cerowrt 3.10.50-1 might hit other limits besides shaping that limit the performance in this situation; the initial double egress columns are 3 and 4.
Your second set of tests tells me almost nothing, because it exercises the upload more and the download less.
Why nothing? By reducing the total load on the shaper we can now see that avoiding the IFB overhead increases the upload by roughly 10Mbps, almost doubling it. IMHO this is the only real proof that ingress shaping via IFB has some cost associated. I realize that the IFB directly only affects the download direction but in this case the recovered CPU cycles seem to be used to increase the upload performance (which due to using simple instead of amplest.qos was costlier than it needed be).
Hence why I'm asking for effectively the opposite test. The aggregate is still significantly higher with double egress, though.
I have not performed any statistical test, so I am talking about subjective significance here, but I be the significance of the upload increase in the dual_symmetric_egress (columns 7 and 8) is going to be higher than the significance of the download increase in the double_egress situation (columns 3 and 4: larger mean difference similar variance, from quick visual inspection ;) )
The ping numbers also tell me that there's no significant latency penalty either way.
Yes, the ping numbers are pretty nice, but even with SQM disabled the worst case latency is not too bad (not sure why though, I remember it to be much worse). Also the IPv6 RTT to Sweden seems consistently better than the IPV4 RTT; but then again this is DTAG linked, with DTAG’s new network being designed in Sweden, so maybe this is piggy-backing on their testing set-up or so ;) .
Even when CPU saturated, it's still effectively controlling the latency better than leaving the pipe open.
Many thanks for the discussion & Best Regards
Sebastian
- Jonathan Morton
Sebastian Moeller
2015-03-29 01:14:58 UTC
Permalink
Hi Jonathan,

TL;DR: I do not think my measurements show that ingress handling via IFB is that costly (< 5% bandwidth), that avoiding it will help much. I do not think that conclusion will change much if more data is acquired (and I do not intend to collect more ;) ) Also the current diffserv implementation also costs around 5% bandwidth. Bandwidth cost here means that the total bandwidth at full saturation is reduced by that amount (with the total under load being ~90Mbps up+down while the sum of max bandwidth up+down without load ~115Mbps), latency under load seems not to suffer significantly once the router runs out of CPU though, which is nice ;)
What I'm seeing on your first tests is that double egress gives you slightly more download at the expense of slightly less upload throughout. The aggregate is higher.
Your second set of tests tells me almost nothing, because it exercises the upload more and the download less. Hence why I'm asking for effectively the opposite test.
Since netperf-wrapper currently does not have rrul_down_only and rrul_up_only tests (and I have not enough tome/skill to code these tests) I opted for using Rich Brown’s nice wrapper scripts from the Ceroscripts repository. There still is decent report of the “fate” of the concurrent ICMP probe, but no fancy graphs or sparse UDP streams, but for our question this should be sufficient...
Here is the result for dual egress with simplest.qos from a client connected to cerowrt’s se00:

simplest.qos: IPv6, download and upload sequentially:
***@happy-horse:~/CODE/CeroWrtScripts> ./betterspeedtest.sh -6 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4
2015-03-29 00:23:27 Testing against netperf-eu.bufferbloat.net (ipv6) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 85.35 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 37.900
10pct: 38.600
Median: 40.100
Avg: 40.589
90pct: 43.600
Max: 47.500
......................................................................................................................................................
Upload: 32.73 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.300
10pct: 37.800
Median: 38.200
Avg: 38.513
90pct: 39.400
Max: 47.400

simplest.qos: IPv6, download and upload simultaneously:
***@happy-horse:~/CODE/CeroWrtScripts> ./netperfrunner.sh -6 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4
2015-03-29 00:30:40 Testing netperf-eu.bufferbloat.net (ipv6) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 81.42 Mbps
Upload: 9.33 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.500
10pct: 38.700
Median: 39.500
Avg: 39.997
90pct: 42.000
Max: 45.500

simplest.qos: IPv4, download and upload sequentially:
***@happy-horse:~/CODE/CeroWrtScripts> ./betterspeedtest.sh -4 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4 ; ./netperfrunner.sh -4 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4
2015-03-29 00:33:52 Testing against netperf-eu.bufferbloat.net (ipv4) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 86.52 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 49.300
10pct: 50.300
Median: 51.400
Avg: 51.463
90pct: 52.700
Max: 54.500
......................................................................................................................................................
Upload: 33.45 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.300
10pct: 49.800
Median: 50.100
Avg: 50.161
90pct: 50.600
Max: 52.400

simplest.qos: IPv4, download and upload simultaneously:
2015-03-29 00:38:53 Testing netperf-eu.bufferbloat.net (ipv4) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 84.21 Mbps
Upload: 6.45 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.300
10pct: 50.000
Median: 51.100
Avg: 51.302
90pct: 52.700
Max: 56.100
The IPv6 route to Sweden is still 10ms shorter than the IPv4 one, no idea why, but I do not complain ;)


And again the same with simple.qos (in both directions) to assess the cost of our diffserv implementation:

simple.qos: IPv6, download and upload sequentially:
2015-03-29 00:44:06 Testing against netperf-eu.bufferbloat.net (ipv6) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 82.8 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 37.600
10pct: 38.500
Median: 39.600
Avg: 40.068
90pct: 42.600
Max: 47.900
......................................................................................................................................................
Upload: 32.8 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.300
10pct: 37.700
Median: 38.100
Avg: 38.256
90pct: 38.700
Max: 43.400
Compared to simplest.qos we loose like 2.5Mbs in downlink, not too bad, and nothing in the uplink tests, but there the wndr3700v2 is not yet out of oomph...

simple.qos: IPv6, download and upload simultaneously:
2015-03-29 00:49:07 Testing netperf-eu.bufferbloat.net (ipv6) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 77.2 Mbps
Upload: 9.43 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.800
10pct: 38.500
Median: 39.500
Avg: 40.133
90pct: 42.200
Max: 51.500
But here in the full saturating case we already “pay” 4Mbps, still not bad, but it looks like all the small change adds up (and I should try to find the time to look at all the tc filtering we do)

simple.qos: IPv4, download and upload sequentially:
2015-03-29 00:51:37 Testing against netperf-eu.bufferbloat.net (ipv4) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 84.28 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 49.400
10pct: 50.100
Median: 51.100
Avg: 51.253
90pct: 52.500
Max: 54.900
......................................................................................................................................................
Upload: 33.42 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.400
10pct: 49.800
Median: 50.100
Avg: 50.170
90pct: 50.600
Max: 51.800

simple.qos: IPv4, download and upload simultaneously:
2015-03-29 00:56:38 Testing netperf-eu.bufferbloat.net (ipv4) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 81.08 Mbps
Upload: 6.73 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.300
10pct: 50.100
Median: 51.100
Avg: 51.234
90pct: 52.500
Max: 56.200
Again this hammers the real upload while leaving the download mainly intact


And for fun/completeness’ sake of it the same for the standard setup with activated IFB and ingress shaping on pppoe-ge00:

simplest.qos: IPv6, download and upload sequentially:
2015-03-29 01:18:13 Testing against netperf-eu.bufferbloat.net (ipv6) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 82.76 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 37.800
10pct: 38.900
Median: 40.100
Avg: 40.590
90pct: 43.200
Max: 47.000
......................................................................................................................................................
Upload: 32.86 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.300
10pct: 37.700
Median: 38.100
Avg: 38.273
90pct: 38.700
Max: 43.200
So 85.35-82.76 = 2.59 Mbps cost for the IFB, comparable to the cost of our diffserv implementation.

simplest.qos: IPv6, download and upload simultaneously:
2015-03-29 01:23:14 Testing netperf-eu.bufferbloat.net (ipv6) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 78.61 Mbps
Upload: 10.53 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 37.700
10pct: 39.100
Median: 40.200
Avg: 40.509
90pct: 42.400
Max: 46.200
Weird, IFB still costs 81.42-78.61 = 2.81Mbs, but the uplink improved by 10.53-9.33 = 1.2 Mbps, reducing the IFB cost to 1.61Mbps...

simplest.qos: IPv4, download and upload sequentially:
2015-03-29 01:25:44 Testing against netperf-eu.bufferbloat.net (ipv4) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 84.06 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 49.700
10pct: 50.500
Median: 51.900
Avg: 51.866
90pct: 53.100
Max: 55.400
......................................................................................................................................................
Upload: 33.45 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.500
10pct: 49.800
Median: 50.200
Avg: 50.219
90pct: 50.600
Max: 52.100
Again IFB usage costs us 86.52-84.06 = 2.46 Mbps...

simplest.qos: IPv4, download and upload simultaneously:
2015-03-29 01:30:45 Testing netperf-eu.bufferbloat.net (ipv4) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 78.97 Mbps
Upload: 8.14 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 49.300
10pct: 50.300
Median: 51.500
Avg: 51.906
90pct: 53.700
Max: 71.700
And again download IFB cost 84.21-78.97 = 5.24, with upload recovery 8.14-6.45 = 1.69Mbps, so total IFB cost here 5.24-1.69 = 3.55 Mbps or (100*3.55 /(78.97+8.14)) = 4.1%, not great, but certainly not enough if regained to qualify this hardware for higher bandwidth tiers. Latency Max got noticeable worse, but up to 90pct the increase is just a few milliseconds...


And for simple.qos with IFB-based ingress shaping:

simple.qos: IPv6, download and upload sequentially:
2015-03-29 01:49:36 Testing against netperf-eu.bufferbloat.net (ipv6) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 80.24 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 37.700
10pct: 38.500
Median: 39.700
Avg: 40.285
90pct: 42.700
Max: 46.500
......................................................................................................................................................
Upload: 32.66 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 39.700
10pct: 40.300
Median: 40.600
Avg: 40.694
90pct: 41.200
Max: 43.200
IFB+diffserv cost: 85.35-80.24 = 5.11Mbps download only, upload is not CPU bound and hence not affected...

simple.qos: IPv6, download and upload simultaneously:
2015-03-29 01:54:37 Testing netperf-eu.bufferbloat.net (ipv6) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 73.68 Mbps
Upload: 10.32 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 40.300
10pct: 41.300
Median: 42.400
Avg: 42.904
90pct: 45.500
Max: 50.800
IFB+diffserv cost = (81.42 - 73.68) + ( 9.33 -10.32) = 6.75Mbps or 100*6.75/(73.68+10.32) = 8%, still not enough to qualify the wndr3700 to the nominally 100/40 Mbps tier I am testing, but enough to warrant looking at improving diffserv (or better yet switching to cake?)

simple.qos: IPv4, download and upload sequentially:
2015-03-29 01:57:07 Testing against netperf-eu.bufferbloat.net (ipv4) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 82.3 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 51.900
10pct: 52.800
Median: 53.700
Avg: 53.922
90pct: 55.000
Max: 60.300
......................................................................................................................................................
Upload: 33.43 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 51.800
10pct: 52.300
Median: 52.600
Avg: 52.657
90pct: 53.000
Max: 54.200
IFB+diffserv cost: 86.52-82.3 = 4.22 Mbps download only, upload is not CPU bound and hence not affected...

simple.qos: IPv4, download and upload simultaneously:
2015-03-29 03:02:08 Testing netperf-eu.bufferbloat.net (ipv4) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 76.71 Mbps
Upload: 7.94 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 51.700
10pct: 52.500
Median: 53.900
Avg: 54.145
90pct: 56.100
Max: 59.700
IFB+diffserv cost = (81.42 - 76.71) + ( 9.33 -7.94) = 6.1 Mbps or 100*6.1/(76.71+7.94) = 7.2%.
The aggregate is still significantly higher with double egress, though.
The ping numbers also tell me that there's no significant latency penalty either way. Even when CPU saturated, it's still effectively controlling the latency better than leaving the pipe open.
Yes, that is a pretty nice degradation mode. Now if the upload would to have to bear the brunt of the lacking CPU cycles…


Best Regards
Sebastian
- Jonathan Morton
Jonathan Morton
2015-03-29 06:17:13 UTC
Permalink
Post by Sebastian Moeller
I do not think my measurements show that ingress handling via IFB is that costly (< 5% bandwidth), that avoiding it will help much.
Also the current diffserv implementation also costs around 5% bandwidth.
That’s useful information. I may be able to calibrate that against similar tests on other hardware.

But presumably, if you remove the ingress shaping completely, it can then handle full line rate downstream? What’s the comparable overhead figure for that? You see, if we were to use a policer instead of ingress shaping, we’d not only be getting IFB and ingress Diffserv mangling out of the way, but HTB as well.

- Jonathan Morton
Sebastian Moeller
2015-03-29 11:16:26 UTC
Permalink
Hi Jonathan
Post by Jonathan Morton
Post by Sebastian Moeller
I do not think my measurements show that ingress handling via IFB is that costly (< 5% bandwidth), that avoiding it will help much.
Also the current diffserv implementation also costs around 5% bandwidth.
That’s useful information. I may be able to calibrate that against similar tests on other hardware.
But presumably, if you remove the ingress shaping completely, it can then handle full line rate downstream? What’s the comparable overhead figure for that?
Without further ado:

***@happy-horse:~/CODE/CeroWrtScripts> ./betterspeedtest.sh -6 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4 ; ./netperfrunner.sh -6 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4 ; ./betterspeedtest.sh -4 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4 ; ./netperfrunner.sh -4 -H netperf-eu.bufferbloat.net -t 150 -p netperf-eu.bufferbloat.net -n 4
2015-03-29 09:49:00 Testing against netperf-eu.bufferbloat.net (ipv6) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 91.68 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 39.600
10pct: 44.300
Median: 52.800
Avg: 53.230
90pct: 60.400
Max: 98.700
.......................................................................................................................................................
Upload: 34.72 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 39.400
10pct: 39.700
Median: 40.200
Avg: 44.311
90pct: 43.600
Max: 103.000
2015-03-29 09:54:01 Testing netperf-eu.bufferbloat.net (ipv6) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 91.03 Mbps
Upload: 8.79 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 40.200
10pct: 45.900
Median: 53.100
Avg: 53.019
90pct: 59.900
Max: 80.100
2015-03-29 09:56:32 Testing against netperf-eu.bufferbloat.net (ipv4) with 4 simultaneous sessions while pinging netperf-eu.bufferbloat.net (150 seconds in each direction)
......................................................................................................................................................
Download: 93.48 Mbps
Latency: (in msec, 150 pings, 0.00% packet loss)
Min: 51.900
10pct: 56.600
Median: 60.800
Avg: 62.473
90pct: 69.800
Max: 87.900
.......................................................................................................................................................
Upload: 35.23 Mbps
Latency: (in msec, 151 pings, 0.00% packet loss)
Min: 51.900
10pct: 52.200
Median: 52.600
Avg: 65.873
90pct: 108.000
Max: 116.000
2015-03-29 10:01:33 Testing netperf-eu.bufferbloat.net (ipv4) with 4 streams down and up while pinging netperf-eu.bufferbloat.net. Takes about 150 seconds.
Download: 93.2 Mbps
Upload: 5.69 Mbps
Latency: (in msec, 152 pings, 0.00% packet loss)
Min: 51.900
10pct: 56.100
Median: 60.400
Avg: 61.568
90pct: 67.200
Max: 93.100


Note that I shaped the connection at upstream 95p%: 35844 of 37730 Kbps and downstream at 90%: 98407 of 109341 Kbps; the line has 16bytes per packet overlay and uses pppoe so the MTU is 2492 and the packet size itself is 1500 + 14 + 16 = 1530 (I am also not sure whether the 4 byte ethernet frame check sequence is transmitted and needs accounting for, so I just left it out), so with TCP over IPv4 adding 40 bytes overhead, and TCP over IPv6 adding 60 bytes.

So without SQM I expect:
IPv6:
Upstream: (((1500 - 8 - 40 -20) * 8) * (37730 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 35313.3 Kbps; measured: 35.23 Mbps
Downstream: (((1500 - 8 - 40 -20) * 8) * (109341 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 102337.5 Kbps; measured: 91.68 Mbps (but this is known to be too optimistic, as DTAG currently subtracts ~7% for G.INP error correction at the BRAS level)

IPv4:
Upstream: (((1500 - 8 - 20 -20) * 8) * (37730 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 35806.5 Kbps; measured: 35.23 Mbps
Downstream: (((1500 - 8 - 20 -20) * 8) * (109341 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 103766.8 Kbps; measured: 91.68 (but this is known to be too optimistic, as DTAG currently subtracts ~7% for G.INP error correction at the BRAS level)

So the upstream throughput comes pretty close, but downstream is off, but do to the unknown G.INP “reservation”/BRAS throttle I do not have a good expectation what this value should be.


And with SQM I expect:
IPv6 (simplest.qos):
Upstream: (((1500 - 8 - 40 -20) * 8) * (35844 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 33548.1 Kbps; measured: 32.73 Mbps (dual egress); 32.86 Mbps (IFB ingress)
Downstream: (((1500 - 8 - 40 -20) * 8) * (98407 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 92103.8 Kbps; measured: 85.35 Mbps (dual egress); 82.76 Mbps (IFB ingress)

IPv4 (simplest.qos):
Upstream: (((1500 - 8 - 20 -20) * 8) * (35844 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 34016.7 Kbps; measured: 33.45 Mbps (dual egress); 33.45 Mbps (IFB ingress)
Downstream: (((1500 - 8 - 20 -20) * 8) * (98407 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 93390.2 Kbps; measured: 86.52 Mbps (dual egress); 84.06 Mbps (IFB ingress)

So with our shaper, we stay a bit short of the theoretical values, but the link was not totally quiet so I expect some loads compared o the theoretical values.
Post by Jonathan Morton
You see, if we were to use a policer instead of ingress shaping, we’d not only be getting IFB and ingress Diffserv mangling out of the way, but HTB as well.
But we still would run HTB for egress I assume, and the current results with policers Dave hinted at do not seem like good candidates for replacing shaping…

Best Regards
Sebastian
Post by Jonathan Morton
- Jonathan Morton
Jonathan Morton
2015-03-29 12:48:10 UTC
Permalink
Okay, so it looks like you get another 5% without any shaping running. So in summary:

- With no shaping at all, the router is still about 10% down compared to downstream line rate.
- Upstream is fine *if* unidirectional. The load of servicing downstream traffic hurts upstream badly.
- Turning on HTB + fq_codel loses you 5%.
- Using ingress filtering via IFB loses you another 5%.
- Mangling the Diffserv field loses you yet another 5%.

Those 5% penalties add up. People might grudgingly accept a 10% loss of bandwidth to be sure of lower latency, and faster hardware would do better than that, but losing 25% is a bit much.

I should be able to run similar tests through my Pentium-MMX within a couple of days, so we can see whether I get similar overhead numbers out of that; I can even try plugging in your shaping settings, since they’re (just) within the line rate of the 100baseTX cards installed in it. I could also compare cake’s throughput to that of HTB + fq_codel; I’ve already seen an improvement with older versions of cake, but I want to see what the newest version gets too.

Come to think of it, I should probably try swapping the same cards into a faster machine as well, to see how much they influence the result.
Post by Sebastian Moeller
Post by Jonathan Morton
You see, if we were to use a policer instead of ingress shaping, we’d not only be getting IFB and ingress Diffserv mangling out of the way, but HTB as well.
But we still would run HTB for egress I assume, and the current results with policers Dave hinted at do not seem like good candidates for replacing shaping…
The point of this exercise was to find out whether a theoretical, ideal policer on ingress might - in theory, mind - give a noticeable improvement of efficiency and thus throughput.

The existing policers available are indeed pretty unsuitable, as Dave’s tests proved, but there may be a way to do better by adapting AQM techniques to the role. In particular, Codel’s approach of gradually increasing a sparse drop rate seems like it would work better than the “brick wall” imposed by a plain token bucket.

Your results suggest that investigating this possibility might still be worthwhile. Whether anything will come of it, I don’t know.

- Jonathan Morton
Sebastian Moeller
2015-03-29 14:16:48 UTC
Permalink
Hi Jonathan,
Post by Jonathan Morton
- With no shaping at all, the router is still about 10% down compared to downstream line rate.
Yes, roughly, but as I tried to explain, this is known quirk currently with VDSL2 lines of DTAG using vectoring, somehow the worst case overhead of the G.INP error correction is subtracted from the shaped rate at the BRAS, so this ~10% is non-recoverable from the user side (it is confusing though).
Post by Jonathan Morton
- Upstream is fine *if* unidirectional. The load of servicing downstream traffic hurts upstream badly.
Yes, that is a theme though all the different variants of the tests I did.
Post by Jonathan Morton
- Turning on HTB + fq_codel loses you 5%.
I assume that this partly is caused by the need to shape below the physical link bandwidth, it might be possible to get closer to the limit (if the true bottleneck bandwidth is known, but see above).
Post by Jonathan Morton
- Using ingress filtering via IFB loses you another 5%.
- Mangling the Diffserv field loses you yet another 5%.
Those 5% penalties add up. People might grudgingly accept a 10% loss of bandwidth to be sure of lower latency, and faster hardware would do better than that, but losing 25% is a bit much.
But IPv4 simple.qos IFB ingress shaping: ingress 82.3 Mbps versus 93.48 Mbps (no SQM) => 100 * 82.3 / 93.48 = 88.04%, so we only loose 12% (for the sum of diffserv classification, IFB ingress shaping and HTB) which seems more reasonable (that or my math is wrong).
But anyway I do not argue that we should not aim at decreasing overheads, but just that even without these overheads we are still a (binary) order of magnitude short of the goal, a shaper that can do up to symmetric 150Mbps shaping let alone Dave’s goal of symmetric 300 Mbps shaping.
Post by Jonathan Morton
I should be able to run similar tests through my Pentium-MMX within a couple of days, so we can see whether I get similar overhead numbers out of that;
That will be quite interesting, my gut feeling is that the percentages will differ considerably between machines/architectures.
Post by Jonathan Morton
I can even try plugging in your shaping settings, since they’re (just) within the line rate of the 100baseTX cards installed in it. I could also compare cake’s throughput to that of HTB + fq_codel; I’ve already seen an improvement with older versions of cake, but I want to see what the newest version gets too.
I really hope to try cake soon ;)
Post by Jonathan Morton
Come to think of it, I should probably try swapping the same cards into a faster machine as well, to see how much they influence the result.
Post by Sebastian Moeller
Post by Jonathan Morton
You see, if we were to use a policer instead of ingress shaping, we’d not only be getting IFB and ingress Diffserv mangling out of the way, but HTB as well.
But we still would run HTB for egress I assume, and the current results with policers Dave hinted at do not seem like good candidates for replacing shaping…
The point of this exercise was to find out whether a theoretical, ideal policer on ingress might - in theory, mind - give a noticeable improvement of efficiency and thus throughput.
I think we only have 12% left on the table and there is a need to keep the shaped/policed ingress rate below the real bottleneck rate with a margin, to keep instances of buffering “bleeding” back into the real bottleneck rare…,
Post by Jonathan Morton
The existing policers available are indeed pretty unsuitable, as Dave’s tests proved, but there may be a way to do better by adapting AQM techniques to the role. In particular, Codel’s approach of gradually increasing a sparse drop rate seems like it would work better than the “brick wall” imposed by a plain token bucket.
Your results suggest that investigating this possibility might still be worthwhile. Whether anything will come of it, I don’t know.
Good point.

Best Regards
Sebastian
Post by Jonathan Morton
- Jonathan Morton
Jonathan Morton
2015-03-29 15:13:22 UTC
Permalink
Post by Sebastian Moeller
Post by Jonathan Morton
- Turning on HTB + fq_codel loses you 5%.
I assume that this partly is caused by the need to shape below the physical link bandwidth, it might be possible to get closer to the limit (if the true bottleneck bandwidth is known, but see above).
Downstream: (((1500 - 8 - 40 -20) * 8) * (98407 * 1000) / ((1500 + 14 + 16) * 8)) / 1000 = 92103.8 Kbps; measured: 85.35 Mbps (dual egress); 82.76 Mbps (IFB ingress)
I interpret that as meaning: you have set HTB at 98407 Kbps, and after subtracting overheads you expect to get 92103 Kbps goodput. You got pretty close to that on the raw line, and the upstream number gets pretty close to your calculated figure, so I can’t account for the missing 6700 Kbps (7%) due to link capacity simply not being there. HTB, being a token-bucket-type shaper, should compensate for short lulls, so subtle timing effects probably don’t explain it either.
Post by Sebastian Moeller
Post by Jonathan Morton
Those 5% penalties add up. People might grudgingly accept a 10% loss of bandwidth to be sure of lower latency, and faster hardware would do better than that, but losing 25% is a bit much.
But IPv4 simple.qos IFB ingress shaping: ingress 82.3 Mbps versus 93.48 Mbps (no SQM) => 100 * 82.3 / 93.48 = 88.04%, so we only loose 12% (for the sum of diffserv classification, IFB ingress shaping and HTB) which seems more reasonable (that or my math is wrong).
Getting 95% three times leaves you with about 86%, so it’s a useful rule-of-thumb figure. The more precise one (100% - 88.04% ^ -3) would be 4.16% per stage.

However, if the no-SQM throughput is really limited by the ISP rather than the router, then simply adding HTB + fq_codel might have a bigger impact on throughput for someone with a faster service; they would be limited to the same speed with SQM, but might have higher throughput without it. So your measurements really give 5% as a lower bound for that case.
Post by Sebastian Moeller
But anyway I do not argue that we should not aim at decreasing overheads, but just that even without these overheads we are still a (binary) order of magnitude short of the goal, a shaper that can do up to symmetric 150Mbps shaping let alone Dave’s goal of symmetric 300 Mbps shaping.
Certainly, better hardware will perform better. I personally use a decade-old PowerBook for my shaping needs; a 1.5GHz PowerPC 7447 (triple issue, out of order, 512KB+ on-die cache) is massively more powerful than a 680MHz MIPS 24K (single issue, in order, a few KB cache), and it shows when I conduct LAN throughput tests. But I don’t get the chance to push that much data over the Internet.

The MIPS 74K in the Archer C7 v2 is dual issue, out of order; that certainly helps. Multi-core (or at least multi-thread) would probably also help by reducing context switch overhead, and allowing more than one device’s interrupts to get serviced in parallel. I happen to have one router with a MIPS 34K, which is multi-thread, but the basic pipeline is that of the 24K and the clock speed is much lower.

Still, it’s also good to help people get the most out of what they’ve already got. Cake is part of that, but efficiency (by using a simpler shaper than HTB and eliminating one qdisc-to-qdisc interface) is only one of its goals. Ease of configuration, and providing state-of-the-art behaviour, are equally important to me.
Post by Sebastian Moeller
Post by Jonathan Morton
The point of this exercise was to find out whether a theoretical, ideal policer on ingress might - in theory, mind - give a noticeable improvement of efficiency and thus throughput.
I think we only have 12% left on the table and there is a need to keep the shaped/policed ingress rate below the real bottleneck rate with a margin, to keep instances of buffering “bleeding” back into the real bottleneck rare…,
That’s 12% as a lower bound - and that’s already enough to be noticeable in practice. Obviously we can’t be sure of getting all of it back, but we might get enough to bring *you* up to line rate.

- Jonathan Morton

David Lang
2015-03-23 17:08:33 UTC
Permalink
Post by Jonathan Morton
are we running into performance issues with fq_codel? I thought all the problems were with HTB or ingress shaping.
Cake is, in part, a response to the HTB problem; it is a few percent more efficient so far than an equivalent HTB+fq_codel combination. It will have a few other novel features, too.
Bobbie is a response to the ingress-shaping problem. A policer (with no queue) can be run without involving an IFB device, which we believe has a large overhead.
Thanks for the clarification, I hadn't put the pieces together to understand
this.

David Lang
Sebastian Moeller
2015-03-23 16:17:32 UTC
Permalink
Hi Dave,

I take it policing is still not cutting it then, and the “hunt” for a wndr3[7|8]000 is still on? It look the archer c7v2 does roughly twice as good as the old cerowrt reference model, a decent improvement, but not yet present-safe let alone future-safe...

Best Regards
Sebastian
Post by Dave Taht
so I had discarded policing for inbound traffic management a long
while back due to it not
handling varying RTTs very well, the burst parameter being hard, maybe
even impossible to tune, etc.
And I'd been encouraging other people to try it for a while, with no
luck. So anyway...
1) A piece of good news is - using the current versions of cake and
cake2, that I can get on linux 3.18/chaos calmer, on the archer c7v2
shaping 115mbit download with 12mbit upload... on a cable modem...
with 5% cpu to spare. I haven't tried a wndr3800 yet...
htb + fq_codel ran out of cpu at 94mbit...
2) On the same test rig I went back to try policing. With a 10k burst
parameter, it cut download rates in half...
However, with a 100k burst parameter, on the rrul and tcp_download
tests, at a very short RTT (ethernet) I did get full throughput and
lower latency.
run sqm with whatever settings you want. Then plunk in the right rate
below for your downlink.
tc qdisc del dev eth0 handle ffff: ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip
src 0.0.0.0/0 police rate 115000kbit burst 100k drop flowid :1
I don't know how to have it match all traffic, including ipv6
traffic(anyone??), but that was encouraging.
However, the core problem with policing is that it doesn't handle
different RTTs very well, and the exact same settings on a 16ms
path.... cut download throughput by a factor of 10. - set to
115000kbit I got 16mbits on rrul. :(
Still...
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months. However I tend to think
backporting the FIB patches and making cake run faster might be more
fruitful. (or finding faster hardware)
3) There may be some low hanging fruit in how hostapd operates. Right
now I see it chewing up cpu, and when running, costing 50mbit of
throughput at higher rates, doing something odd, over and over again.
clock_gettime(CLOCK_MONOTONIC, {1240, 843487389}) = 0
recvmsg(12, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\0\0\1\20\0\25\0\0\0\0\0\0\0\0\0\0;\1\0\0\0\10\0\1\0\0\0\1\0\10\0&"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 272
clock_gettime(CLOCK_MONOTONIC, {1240, 845060156}) = 0
clock_gettime(CLOCK_MONOTONIC, {1240, 845757477}) = 0
_newselect(19, [3 5 8 12 15 16 17 18], [], [], {3, 928211}) = 1 (in
[12], left {3, 920973})
I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.
4) I got a huge improvement in p2p wifi tcp throughput between linux
3.18 and linux 3.18 + the minstrel-blues and andrew's minimum variance
patches - a jump of over 30% on the ubnt nanostation m5.
5) Aside from that, so far the archer hasn't crashed on me, but I
haven't tested the wireless much yet on that platform. My weekend's
http://snapon.lab.bufferbloat.net/~cero3/ubnt/ar71xx/
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Dave Taht
2015-03-23 16:27:04 UTC
Permalink
Post by Sebastian Moeller
Hi Dave,
I take it policing is still not cutting it then
I didn't think it would, but I was looking for an illustrative example
to use as a cluebat on people that think policing works. I have a
string of articles to write
about so many different technologies...

... and I'd felt that maybe if I merely added ecn to an existing
policer I'd get a good result, just haven't - like so many things -
got round to it. I do have reasonable hopes for "bobbie", also...
Post by Sebastian Moeller
, and the “hunt” for a wndr3[7|8]000 is still on?
Yep. I figure we're gonna find an x86 box to do the higher end stuff
in the near term, unless one of the new dual a9 boxen works out.
Post by Sebastian Moeller
It look the archer c7v2 does roughly twice as good as the old cerowrt reference model, a decent improvement, but not yet present-safe let alone future-safe...
Well, the big part of the upgrade was from linux 3.10 to linux 3.18. I
got nearly 600mbit forwarding rates out of that (up from 340 or so) on
the wndr3800. I have not rebuilt those with the latest code, my goal
is to find *some* platform still being made to use, and the tplink has
the benefit of also doing ac...

IF you have a spare wndr3800 to reflash with what I built friday, goferit...

I think part of the bonus performance we are also getting out of cake
is in getting rid of a bunch of firewall and tc classification rules.

(New feature request for cake might be to do dscp squashing and get
rid of that rule...! I'd like cake to basically be a drop in
replacement for the sqm scripts.
I wouldn't mind if it ended up being called sqm, rather than cake, in
the long run, with what little branding we have being used. Google for
"cake shaper"
if you want to get a grip on how hard marketing "cake" would be...)

.
Post by Sebastian Moeller
Best Regards
Sebastian
Post by Dave Taht
so I had discarded policing for inbound traffic management a long
while back due to it not
handling varying RTTs very well, the burst parameter being hard, maybe
even impossible to tune, etc.
And I'd been encouraging other people to try it for a while, with no
luck. So anyway...
1) A piece of good news is - using the current versions of cake and
cake2, that I can get on linux 3.18/chaos calmer, on the archer c7v2
shaping 115mbit download with 12mbit upload... on a cable modem...
with 5% cpu to spare. I haven't tried a wndr3800 yet...
htb + fq_codel ran out of cpu at 94mbit...
2) On the same test rig I went back to try policing. With a 10k burst
parameter, it cut download rates in half...
However, with a 100k burst parameter, on the rrul and tcp_download
tests, at a very short RTT (ethernet) I did get full throughput and
lower latency.
run sqm with whatever settings you want. Then plunk in the right rate
below for your downlink.
tc qdisc del dev eth0 handle ffff: ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip
src 0.0.0.0/0 police rate 115000kbit burst 100k drop flowid :1
I don't know how to have it match all traffic, including ipv6
traffic(anyone??), but that was encouraging.
However, the core problem with policing is that it doesn't handle
different RTTs very well, and the exact same settings on a 16ms
path.... cut download throughput by a factor of 10. - set to
115000kbit I got 16mbits on rrul. :(
Still...
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months. However I tend to think
backporting the FIB patches and making cake run faster might be more
fruitful. (or finding faster hardware)
3) There may be some low hanging fruit in how hostapd operates. Right
now I see it chewing up cpu, and when running, costing 50mbit of
throughput at higher rates, doing something odd, over and over again.
clock_gettime(CLOCK_MONOTONIC, {1240, 843487389}) = 0
recvmsg(12, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\0\0\1\20\0\25\0\0\0\0\0\0\0\0\0\0;\1\0\0\0\10\0\1\0\0\0\1\0\10\0&"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 272
clock_gettime(CLOCK_MONOTONIC, {1240, 845060156}) = 0
clock_gettime(CLOCK_MONOTONIC, {1240, 845757477}) = 0
_newselect(19, [3 5 8 12 15 16 17 18], [], [], {3, 928211}) = 1 (in
[12], left {3, 920973})
I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.
4) I got a huge improvement in p2p wifi tcp throughput between linux
3.18 and linux 3.18 + the minstrel-blues and andrew's minimum variance
patches - a jump of over 30% on the ubnt nanostation m5.
5) Aside from that, so far the archer hasn't crashed on me, but I
haven't tested the wireless much yet on that platform. My weekend's
http://snapon.lab.bufferbloat.net/~cero3/ubnt/ar71xx/
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave Täht
Let's make wifi fast, less jittery and reliable again!

https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
David Lang
2015-03-23 17:07:48 UTC
Permalink
Post by Dave Taht
Post by Sebastian Moeller
Hi Dave,
I take it policing is still not cutting it then
I didn't think it would, but I was looking for an illustrative example
to use as a cluebat on people that think policing works. I have a
string of articles to write
about so many different technologies...
... and I'd felt that maybe if I merely added ecn to an existing
policer I'd get a good result, just haven't - like so many things -
got round to it. I do have reasonable hopes for "bobbie", also...
Post by Sebastian Moeller
, and the “hunt” for a wndr3[7|8]000 is still on?
Yep. I figure we're gonna find an x86 box to do the higher end stuff
in the near term, unless one of the new dual a9 boxen works out.
Post by Sebastian Moeller
It look the archer c7v2 does roughly twice as good as the old cerowrt reference model, a decent improvement, but not yet present-safe let alone future-safe...
Well, the big part of the upgrade was from linux 3.10 to linux 3.18. I
got nearly 600mbit forwarding rates out of that (up from 340 or so) on
the wndr3800. I have not rebuilt those with the latest code, my goal
is to find *some* platform still being made to use, and the tplink has
the benefit of also doing ac...
IF you have a spare wndr3800 to reflash with what I built friday, goferit...
I have a few spare 3800s if some of you developers need one.

unfortunantly I don't have a fast connection to test on.

David Lang
Post by Dave Taht
I think part of the bonus performance we are also getting out of cake
is in getting rid of a bunch of firewall and tc classification rules.
(New feature request for cake might be to do dscp squashing and get
rid of that rule...! I'd like cake to basically be a drop in
replacement for the sqm scripts.
I wouldn't mind if it ended up being called sqm, rather than cake, in
the long run, with what little branding we have being used. Google for
"cake shaper"
if you want to get a grip on how hard marketing "cake" would be...)
.
Post by Sebastian Moeller
Best Regards
Sebastian
Post by Dave Taht
so I had discarded policing for inbound traffic management a long
while back due to it not
handling varying RTTs very well, the burst parameter being hard, maybe
even impossible to tune, etc.
And I'd been encouraging other people to try it for a while, with no
luck. So anyway...
1) A piece of good news is - using the current versions of cake and
cake2, that I can get on linux 3.18/chaos calmer, on the archer c7v2
shaping 115mbit download with 12mbit upload... on a cable modem...
with 5% cpu to spare. I haven't tried a wndr3800 yet...
htb + fq_codel ran out of cpu at 94mbit...
2) On the same test rig I went back to try policing. With a 10k burst
parameter, it cut download rates in half...
However, with a 100k burst parameter, on the rrul and tcp_download
tests, at a very short RTT (ethernet) I did get full throughput and
lower latency.
run sqm with whatever settings you want. Then plunk in the right rate
below for your downlink.
tc qdisc del dev eth0 handle ffff: ingress
tc qdisc add dev eth0 handle ffff: ingress
tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip
src 0.0.0.0/0 police rate 115000kbit burst 100k drop flowid :1
I don't know how to have it match all traffic, including ipv6
traffic(anyone??), but that was encouraging.
However, the core problem with policing is that it doesn't handle
different RTTs very well, and the exact same settings on a 16ms
path.... cut download throughput by a factor of 10. - set to
115000kbit I got 16mbits on rrul. :(
Still...
I have long maintained it was possible to build a better fq_codel-like
policer without doing htb rate shaping, ("bobbie"), and I am tempted
to give it a go in the coming months. However I tend to think
backporting the FIB patches and making cake run faster might be more
fruitful. (or finding faster hardware)
3) There may be some low hanging fruit in how hostapd operates. Right
now I see it chewing up cpu, and when running, costing 50mbit of
throughput at higher rates, doing something odd, over and over again.
clock_gettime(CLOCK_MONOTONIC, {1240, 843487389}) = 0
recvmsg(12, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000000},
msg_iov(1)=[{"\0\0\1\20\0\25\0\0\0\0\0\0\0\0\0\0;\1\0\0\0\10\0\1\0\0\0\1\0\10\0&"...,
16384}], msg_controllen=0, msg_flags=0}, 0) = 272
clock_gettime(CLOCK_MONOTONIC, {1240, 845060156}) = 0
clock_gettime(CLOCK_MONOTONIC, {1240, 845757477}) = 0
_newselect(19, [3 5 8 12 15 16 17 18], [], [], {3, 928211}) = 1 (in
[12], left {3, 920973})
I swear I'd poked into this and fixed it in cerowrt 3.10, but I guess
I'll have to go poking through the patch set. Something involving
random number obtaining, as best as I recall.
4) I got a huge improvement in p2p wifi tcp throughput between linux
3.18 and linux 3.18 + the minstrel-blues and andrew's minimum variance
patches - a jump of over 30% on the ubnt nanostation m5.
5) Aside from that, so far the archer hasn't crashed on me, but I
haven't tested the wireless much yet on that platform. My weekend's
http://snapon.lab.bufferbloat.net/~cero3/ubnt/ar71xx/
--
Dave TÀht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
--
Dave TÀht
Let's make wifi fast, less jittery and reliable again!
https://plus.google.com/u/0/107942175615993706558/posts/TVX3o84jjmb
_______________________________________________
Cerowrt-devel mailing list
https://lists.bufferbloat.net/listinfo/cerowrt-devel
Jonathan Morton
2015-03-23 18:16:02 UTC
Permalink
Post by David Lang
I have a few spare 3800s if some of you developers need one.
unfortunantly I don't have a fast connection to test on.
It might be an idea if I had one, since then I could at least reproduce everyone else’s results. Can you reasonably ship to Europe?

I don’t have a fast Internet connection either, but I do have enough computing hardware lying around to set up lab tests at >100Mbps quite well (though I could stand to get hold of a few extra GigE NICs). Verification on a real connection is of course good, but netem should make a reasonable substitute if configured sanely.

- Jonathan Morton
Loading...