Discussion:
Cero-state this week and last
(too old to reply)
Dave Taht
2012-04-06 02:27:10 UTC
Permalink
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.

I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.

Some news:

I have been spending time fixing some infrastructural problems.

1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.

The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.

I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration

Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.

The tgrid is still looking quite bad at the moment.

http://buildbot.openwrt.org:8010/tgrid

There's still a huge backlog of breakage.

But I hope it gets better. Certainly building a full cluster of build
boxes or vms (***@HOME!!) would help a lot more.

If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>

2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).

To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.

3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....

http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64

Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.

I'm sorry it's been three weeks without a viable build for others to test.

4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/

+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)

- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.

At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....

+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entroy (
d***@reed.com
2012-04-06 02:33:47 UTC
Permalink
A small suggestion.

Create a regression test suite, and require contributors to *pass* the test with each submitted patch set.

Be a damned Nazi about checkins that don't meet this criterion - eliminate the right to check in code for anyone who contributes something that breaks functionality.

Every project leader discovers this. Programmers are *lazy* and refuse to check their inputs unless you shame them into compliance.

-----Original Message-----
From: "Dave Taht" <***@gmail.com>
Sent: Thursday, April 5, 2012 10:27pm
To: cerowrt-***@lists.bufferbloat.net
Subject: [Cerowrt-devel] Cero-state this week and last



I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.

I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.

Some news:

I have been spending time fixing some infrastructural problems.

1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.

The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.

I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration

Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.

The tgrid is still looking quite bad at the moment.

http://buildbot.openwrt.org:8010/tgrid

There's still a huge backlog of breakage.

But I hope it gets better. Certainly building a full cluster of build
boxes or vms (***@HOME!!) would help a lot more.

If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>

2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).

To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.

3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....

http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64

Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.

I'm sorry it's been three weeks without a viable build for others to test.

4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/

+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)

- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.

At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....

+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really
Dave Taht
2012-04-06 02:50:50 UTC
Permalink
Post by d***@reed.com
A small suggestion.
Create a regression test suite, and require contributors to *pass* the test
with each submitted patch set.
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.

A parallel full build is about 3 hours but requires a bit of monitoring

Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.

There are:

20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels

That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...

Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...

For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.

I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.

They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.

The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.

It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.

There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.

It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.

http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entroy (
d***@lang.hm
2012-04-06 03:07:10 UTC
Permalink
Post by Dave Taht
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.
distcc doesn't work for this?
Post by Dave Taht
A parallel full build is about 3 hours but requires a bit of monitoring
can this monitoring be automated?

David Lang
Post by Dave Taht
Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.
20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels
That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...
Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...
For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.
I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.
They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.
The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.
It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.
There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.
It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.
http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entroy (
Dave Taht
2012-04-08 15:53:10 UTC
Permalink
Post by d***@lang.hm
Post by Dave Taht
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.
distcc doesn't work for this?
There are some things that distcc works well on - kernel builds, a
major c++ application, stuff like that.

Other things are subject to Amdahl's law and hopelessly serial,
notably creating toolchains, link steps, etc.

Much of the early stages of a build has been as parallized as much as
humanly possible.

The bulk of the the build currently is on packages.

http://buildbot.openwrt.org:8010/builders/ar71xx/builds/127

to use distcc effectively in embedded you hit a limiting factor in
that you need to continuously redistribute and install all the
toolchain and dependent packages to all the building hosts, OR, use a
shared filesystem, which rapidly becomes a limiting factor of it's
own, not just on I/O but on single points of failure.

Merely parallelizing package building across a multi-cpu box is hard
as your number of cpus go up. The number of potential interactions
between dependencies go up, as does your I/O. Getting all the
dependencies right is a big job.

Currently the scaling factor for make -j (with no distcc) is roughly
1/2 the number of cpus for the two data points I have, and appears
rather bound on I/O. I get roughly the same results with a an old 4
core box with great I/O (4 disks, hardware raid) vs a more modern 8
core box with merely mirrored drives.

If I were to throw, say, a 48 cpu box at this problem, and do builds
entirely out of ram (doable) I don't know how much further up a
parallel build would scale. Certainly all the dependencies would have
to get worked out.

I sure wouldn't mind having a couple of these:

http://www.penguincomputing.com/hardware/linux_servers/configurator/intel/relion2800

or these:

http://www.penguincomputing.com/hardware/linux_servers/configurator/amd/altus1804

to play with.

The ROI of stuff like that... vs the cost of electricity and rack
space of the hardware we've had donated to this project - would
probably pay off inside of a year or two, and that doesn't count the
very real productivity improvement for everyone that could get a full
build turned around in under 19 hours.

Regrettably up-front capital like that is hard to come by, as
bufferbloat.net is not an 'exciting new age startup' with billions of
dollars per year of potential market cap. We're merely trying to save
billions of people a lot of headache, frustration, and time, and get
the technology into everything without any form of direct form of
recompense. It's kind of a harder sell. For some reason.

It's cheaper short term, if not long term, to bleed out electricity
and rack space monthly, and try to have mental processes that cope
with overnight builds, and scavange more free hardware wherever we
can.

Incidentally I did cost out using amazon EC2, etc, last year, and that
was highway robbery, given the amount of cpu cycles this task can
consume.
Post by d***@lang.hm
Post by Dave Taht
A parallel full build is about 3 hours but requires a bit of monitoring
can this monitoring be automated?
make -j 8
watch some seemingly random package fail to build
build it's dependencies
build it
make -j 8
watch another somewhat random package fail to build

repeat until done

The next problem is that heavy parallization messes up your logging
messages, so it's very hard to find where the error occurred.

These are solvable problems, with someone focused on the task, but if
you change '8' to '9', something else tends to break,
and it's architecture dependent as well.
Post by d***@lang.hm
David Lang
Post by Dave Taht
Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this
criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.
20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels
That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...
Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...
For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.
I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.
They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.
The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.
It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.
There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.
It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.
http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entroy (
d***@lang.hm
2012-04-09 02:57:24 UTC
Permalink
Post by Dave Taht
If I were to throw, say, a 48 cpu box at this problem, and do builds
entirely out of ram (doable) I don't know how much further up a
parallel build would scale. Certainly all the dependencies would have
to get worked out.
http://www.penguincomputing.com/hardware/linux_servers/configurator/intel/relion2800
http://www.penguincomputing.com/hardware/linux_servers/configurator/amd/altus1804
to play with.
The ROI of stuff like that... vs the cost of electricity and rack
space of the hardware we've had donated to this project - would
probably pay off inside of a year or two, and that doesn't count the
very real productivity improvement for everyone that could get a full
build turned around in under 19 hours.
Regrettably up-front capital like that is hard to come by, as
bufferbloat.net is not an 'exciting new age startup' with billions of
dollars per year of potential market cap. We're merely trying to save
billions of people a lot of headache, frustration, and time, and get
the technology into everything without any form of direct form of
recompense. It's kind of a harder sell. For some reason.
It's cheaper short term, if not long term, to bleed out electricity
and rack space monthly, and try to have mental processes that cope
with overnight builds, and scavange more free hardware wherever we
can.
this sounds like the sort of thing that kickstarter (or similar) is
designed for.

you may also want to think about asking some of these companies for
donations directly.
Post by Dave Taht
Incidentally I did cost out using amazon EC2, etc, last year, and that
was highway robbery, given the amount of cpu cycles this task can
consume.
yeah, I did some pricing on this sort of thing and found that if you
needed a system as much as 2 hours a day, you were better off just getting
a dedicated box from someplace like serverbeach.

David Lang

d***@reed.com
2012-04-06 04:09:32 UTC
Permalink
I understand this. In the end of the day, however, *regression tests* matter, as well as tests to verify that new functionality actually works.

I've managed projects with 200 daily "committers". Unless those committers get immediate feedback on what they break (accidentally) and design the tests for their new functionality so that others don't break what they carefully craft, projects go south and never recover.

You don't have that rate of committers here, but it's not really an excuse to say - "we have to jam in code without testing it because we don't have a discipline of testing and it's a waste of time".

50% of what a developer should be doing (if not more) is making sure that they don't break more than they improve.

I realize this is tough, not fun, and sometimes very frustrating. But cool "new stuff" is far less important than keeping stuff stable.

I'm not trying to be negative - this is stuff I learned at huge personal cost in very high stress environments where people were literally screaming at me every hour of every day.

The cerowrt/bufferbloat stuff is worth doing, and it's worth doing right - I'm a fan.

-----Original Message-----
From: "Dave Taht" <***@gmail.com>
Sent: Thursday, April 5, 2012 10:50pm
To: ***@reed.com
Cc: cerowrt-***@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] Cero-state this week and last
Post by d***@reed.com
A small suggestion.
Create a regression test suite, and require contributors to *pass* the test
with each submitted patch set.
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.

A parallel full build is about 3 hours but requires a bit of monitoring

Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.

There are:

20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels

That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...

Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...

For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.

I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.

They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.

The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.

It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.

There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.

It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.

http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entr
Dave Taht
2012-04-08 17:31:01 UTC
Permalink
I understand this.  In the end of the day, however, *regression tests*
matter, as well as tests to verify that new functionality actually works.
The problems with developing global test suites, particularly when dealing with
embedded hardware, are manyfold. I certainly would like to have a full
test suite
that I could run on any router rather than the ad-hoc collection of
tests I run now.

One of the problems we have is that we are testing for new problems,
and by definition,
you don't know what those are, and after you fix it, you need to
develop a viable test.

I can think hundreds of things fixed in the past year that I'd like to
test for, not just on this
hardware, or this software, but over the internet, E2E.

Example: in june of last year, there was a 10 year old bug in how ECN
enabled packets
were prioritized by the *default* pfifo_fast qdisc. I'll argue that
this has skewed every study
published about it in the last decade as well, and all that data and
academic papers need
to be reanalyzed - or preferably, thrown out, and we need to start over.

The harder problem than writing the tests are:

1) Which problems are important? What tests are valid? What is repeatable?
2) Who will write the test?
3) How can the test be deployed?
4) How often does it need to run?
5) How can the data be analyzed?
6) How does that work get paid for?

I am happy that I have ONE ECN related study to rely on from steve
bauer at MIT. I'd love to have more, and one of my other projects
(thumbgps) is going to give us a baseline for investigating a bunch of
similar issues. I'm happy to have a solution emerging to problems 3
and 4 above. 1,2,5 and 6 remain unsolved.
I've managed projects with 200 daily "committers".  Unless those committers
get immediate feedback on what they break (accidentally) and design the
tests for their new functionality so that others don't break what they
carefully craft, projects go south and never recover.
If I need to establish cred here, I've been a part of projects with
far more committers and managed up to about 70 staff myself.

When working with the open source community, which is mostly
volunteers, there is no way to be dictatorial. Consensus needs to
sought. Needed stuff that nobody wants to do has to get paid for.

The linux kernel - which is at best, 1/10th of the overall code base
in openwrt - goes through about 10,000 changesets every quarter. The
3.2 development cycle had nearly 1300 committers.

http://lwn.net/Articles/472852/

They all have their own processes for quality control, and somehow
manage to produce a usable system on a reliable basis.

And that's just the kernel portion of the problem!

A better word for the kind of work that goes on in developing an OS
like a redhat or openwrt is 'packaging'. You don't actually have
conventional 'developers', packagers are rather different catagory of
developer, skilled in make, cross-compilation techniques, and many
have a good familiarity with architecture-level issues (like
endianness and bit size and innards of a given architecture like arm
or mips). They are capable of basic coding in dozens of languages, all
on the same day, but are only highly skilled in one or two, at best.

in this way they tend to be something of a hybrid between sysadmin and
developer. And they do care, a lot, about the quality of the
engineering, and try to push patches back to the developer, and make
sysadmins (and users) lives easier.

Packagers do have standards for quality control, and do test on their
own platforms, but: as the potential test matrix has
several thousand permutations, having a tool like a buildbot that
tries to give at least some wider coverage to the most common
combinations, and leverages technology to give them adaquate feedback
to iteratively get it right.

Like anything else, it can always be done better. In an ideal world, a
packager could do a test commit, get it built against all those
permutations, have run for 24 hours, exaustively checking all the
in-built functionality, have it's memory and cpu use analyzed, and get
the result back a few seconds later. Aside from needing Dr Who to help
out on parts of that, it's hard, expensive, and something of a tragedy
of the commons - somebody has to pay for all that infrastructure,
electricity, and testing for everyone to benefit.

The gradually improving recent buildbot-disaster is a case-in-point.
It was working. A bunch of machines died over time. Nobody was paying
attention. It became a disaster. We fixed it with bailing wire, scotch
tape and stolen resources.

I'd certainly love it if we had budget to where doing the logical
thing - build everything, all the time - was practical. Same goes for
developing regression tests, having racks of hardware that can be
reburnt and re-tested every day,

This sort of pattern repeats in nearly every low budget project
(volunteer or corporate sponsored), but unfortunately Elon doesn't
hang out with us, and isn't going to fly in with the liquid oxygen.

As for regression testing, regressions against what? (the answer is
too large to fit in the margins of this email)

Certainly multiple companies make wireless test suites (one has been
actively helping out, actually), there are dozens of benchmark suites,
there are zillions of subcomponent tests...

and in this market, razor thin margins on the vendors side, as well as
the ISPs. Now, I like to think that our governments and society are
waking up to the chaos that can ensue if the internet goes down, corps
are realizing that ipv4 can't last forever and ipv6 has to be made
deployable e2e, and maybe there's a shift in thinking that making the
Internet just work is a civil engineering job that *has* to be done
right ( http://esr.ibiblio.org/?p=4213 http://esr.ibiblio.org/?p=4196
)...

but at the end of the day we just have to do the best engineering we
can with the resources available.
You don't have that rate of committers here, but it's not really an excuse
to say -
Well, in some ways we do. Adding in a new kernel requires depending on
a multitude of other people on having got it right.
Same goes for the other thousand packages.

It has taken a year and a ton of effort (from multiple volunteers) to
get from where the cerowrt kernel lagged the mainline kernel by 3
versions, down to where it is only going to lag by 1. That effort was
necessarily if we wanted to be able to do work on both x86 and a
router simultaneously while investigating bufferbloat, security, and
ipv6, and be able to move forward (And back and forth) with a minimum
of backporting. That portion of the effort has eaten more of my time
this year than I care to think about.

At the time we started hacking on cerowrt, most commercial embedded
products were based on 5 year old kernels, or older, due to how
difficult it is to track the mainline, and a perceived lack of demand
from consumers for new stuff, despite the ISPs increasing frustration
with what's being shipped today not meeting their needs or
expectations.

We are trying to change that - in part by listening to the screams of
ISPs like comcast - but in also trying out new technologies such as
fixes for bufferbloat, ipv6, radical concepts like ccnx and openhip -
to be geek and early adoptor attractors - to get more of the needed
work done.

Still, an effort well beyond the original scope of the "wide" project
seems needed to get ipv6 rolled out. The theoretical breakthroughs
required to fix bufferbloat seem almost trivial in comparison.
"we have to jam in code without testing it because we don't have a
discipline of testing and it's a waste of time".
It's a matter of having enough distributed testing.
50% of what a developer should be doing (if not more) is making sure that
they don't break more than they improve.
so try 'packaging' rather than developing, and wrap your head around
the test matrix problem.
I realize this is tough, not fun, and sometimes very frustrating.  But cool
"new stuff" is far less important than keeping stuff stable.
This is a classic tension. I note that we're trying to fix the
internet here, before *it's* stability goes unstable.

So a great deal of change and r&d is needed, and yes, it needs to be
managed well, but stability only qualifies as a goal in limited ways.
I'm not trying to be negative - this is stuff I learned at huge personal
cost in very high stress environments where people were literally screaming
at me every hour of every day.
I have been in those too. I would say that the amount of stress I've
put myself under, trying to ship something by the end of this month -
compares closely. Personally I would like to like to offload about 95%
of what I currently do, so I could focus on what's truly important.
I'm glad we have more and more volunteers, self identifying problems,
leaping forward and going out on their own, to go fix them.

Still the seat I wish I was sitting in now, with resources I wish I
could command, is Bob Taylor, circa 1968 or so.

http://en.wikipedia.org/wiki/Robert_Taylor_%28computer_scientist%29

He's always been a real inspiration to me.
The cerowrt/bufferbloat stuff is worth doing, and it's worth doing right - I'm a fan.
THX!
-----Original Message-----
Sent: Thursday, April 5, 2012 10:50pm
Subject: Re: [Cerowrt-devel] Cero-state this week and last
Post by d***@reed.com
A small suggestion.
Create a regression test suite, and require contributors to *pass* the test
with each submitted patch set.
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.
A parallel full build is about 3 hours but requires a bit of monitoring
Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.
20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels
That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...
Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...
For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.
I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.
They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.
The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.
It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.
There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.
It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.
http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rates with the hack in there for more
entroy (
d***@reed.com
2012-04-09 01:57:21 UTC
Permalink
Thanks for the incredibly thoughtful response. I get the "packager" issue. It really compounds the problem if the upstream folks don't bother to focus on quality at the level one needs, coupled with the goals of the upstream folks being different than the packager.

Regarding Bob Taylor's resources... I worked closely with Bob and his various team members in a number of dimensions, including consulting for him. You are right that CeroWRT does not have that kind of resource. (I have been trying hard in other venues relating to radio issues that are really important to me to find a way to assemble a coordinated set of resources at such a scale, and I've failed so far. Still trying.)

However, I think that the technical issue one could work on in this respect is a way to create high-level system tests of routing functionality and performance that would be independent of hardware configuration and also capable of creating a network environment that would avoid regression. Jay Lepreau created a very nice platform framework at Utah that the network "innovation" community might be able to "copy" (where virtualized "networks" could be configured and tested). I'd be happy personally, for example, to provide some resources on my various home networks (the one in my home, and the ones in my "cloud" instances) to run "system tests" on new releases of CeroWRT and other systems - *if* it was run in a way that did not disrupt my other work, using a bounded percentage of capacity and devices.

I participated in the PlanetLab project that HP and Intel supported, coordinated by Princeton and others.

This was a model for a kind of "co-op" that incorporated networked resources.

We need to create a generic networking innovation framework that is *independent* of ISOC, IETF, Verizon, Cisco, ATT, Alcatel-Lucent, etc. Those guys may *help* but they should not be able to block experimentation or innovation (which was the point of PlanetLab).

-----Original Message-----
From: "Dave Taht" <***@gmail.com>
Sent: Sunday, April 8, 2012 1:31pm
To: ***@reed.com
Cc: cerowrt-***@lists.bufferbloat.net
Subject: Re: [Cerowrt-devel] Cero-state this week and last
Post by d***@reed.com
I understand this. In the end of the day, however, *regression tests*
matter, as well as tests to verify that new functionality actually works.
The problems with developing global test suites, particularly when dealing with
embedded hardware, are manyfold. I certainly would like to have a full
test suite
that I could run on any router rather than the ad-hoc collection of
tests I run now.

One of the problems we have is that we are testing for new problems,
and by definition,
you don't know what those are, and after you fix it, you need to
develop a viable test.

I can think hundreds of things fixed in the past year that I'd like to
test for, not just on this
hardware, or this software, but over the internet, E2E.

Example: in june of last year, there was a 10 year old bug in how ECN
enabled packets
were prioritized by the *default* pfifo_fast qdisc. I'll argue that
this has skewed every study
published about it in the last decade as well, and all that data and
academic papers need
to be reanalyzed - or preferably, thrown out, and we need to start over.

The harder problem than writing the tests are:

1) Which problems are important? What tests are valid? What is repeatable?
2) Who will write the test?
3) How can the test be deployed?
4) How often does it need to run?
5) How can the data be analyzed?
6) How does that work get paid for?

I am happy that I have ONE ECN related study to rely on from steve
bauer at MIT. I'd love to have more, and one of my other projects
(thumbgps) is going to give us a baseline for investigating a bunch of
similar issues. I'm happy to have a solution emerging to problems 3
and 4 above. 1,2,5 and 6 remain unsolved.
Post by d***@reed.com
I've managed projects with 200 daily "committers". Unless those committers
get immediate feedback on what they break (accidentally) and design the
tests for their new functionality so that others don't break what they
carefully craft, projects go south and never recover.
If I need to establish cred here, I've been a part of projects with
far more committers and managed up to about 70 staff myself.

When working with the open source community, which is mostly
volunteers, there is no way to be dictatorial. Consensus needs to
sought. Needed stuff that nobody wants to do has to get paid for.

The linux kernel - which is at best, 1/10th of the overall code base
in openwrt - goes through about 10,000 changesets every quarter. The
3.2 development cycle had nearly 1300 committers.

http://lwn.net/Articles/472852/

They all have their own processes for quality control, and somehow
manage to produce a usable system on a reliable basis.

And that's just the kernel portion of the problem!

A better word for the kind of work that goes on in developing an OS
like a redhat or openwrt is 'packaging'. You don't actually have
conventional 'developers', packagers are rather different catagory of
developer, skilled in make, cross-compilation techniques, and many
have a good familiarity with architecture-level issues (like
endianness and bit size and innards of a given architecture like arm
or mips). They are capable of basic coding in dozens of languages, all
on the same day, but are only highly skilled in one or two, at best.

in this way they tend to be something of a hybrid between sysadmin and
developer. And they do care, a lot, about the quality of the
engineering, and try to push patches back to the developer, and make
sysadmins (and users) lives easier.

Packagers do have standards for quality control, and do test on their
own platforms, but: as the potential test matrix has
several thousand permutations, having a tool like a buildbot that
tries to give at least some wider coverage to the most common
combinations, and leverages technology to give them adaquate feedback
to iteratively get it right.

Like anything else, it can always be done better. In an ideal world, a
packager could do a test commit, get it built against all those
permutations, have run for 24 hours, exaustively checking all the
in-built functionality, have it's memory and cpu use analyzed, and get
the result back a few seconds later. Aside from needing Dr Who to help
out on parts of that, it's hard, expensive, and something of a tragedy
of the commons - somebody has to pay for all that infrastructure,
electricity, and testing for everyone to benefit.

The gradually improving recent buildbot-disaster is a case-in-point.
It was working. A bunch of machines died over time. Nobody was paying
attention. It became a disaster. We fixed it with bailing wire, scotch
tape and stolen resources.

I'd certainly love it if we had budget to where doing the logical
thing - build everything, all the time - was practical. Same goes for
developing regression tests, having racks of hardware that can be
reburnt and re-tested every day,

This sort of pattern repeats in nearly every low budget project
(volunteer or corporate sponsored), but unfortunately Elon doesn't
hang out with us, and isn't going to fly in with the liquid oxygen.

As for regression testing, regressions against what? (the answer is
too large to fit in the margins of this email)

Certainly multiple companies make wireless test suites (one has been
actively helping out, actually), there are dozens of benchmark suites,
there are zillions of subcomponent tests...

and in this market, razor thin margins on the vendors side, as well as
the ISPs. Now, I like to think that our governments and society are
waking up to the chaos that can ensue if the internet goes down, corps
are realizing that ipv4 can't last forever and ipv6 has to be made
deployable e2e, and maybe there's a shift in thinking that making the
Internet just work is a civil engineering job that *has* to be done
right ( http://esr.ibiblio.org/?p=4213 http://esr.ibiblio.org/?p=4196
)...

but at the end of the day we just have to do the best engineering we
can with the resources available.
Post by d***@reed.com
You don't have that rate of committers here, but it's not really an excuse
to say -
Well, in some ways we do. Adding in a new kernel requires depending on
a multitude of other people on having got it right.
Same goes for the other thousand packages.

It has taken a year and a ton of effort (from multiple volunteers) to
get from where the cerowrt kernel lagged the mainline kernel by 3
versions, down to where it is only going to lag by 1. That effort was
necessarily if we wanted to be able to do work on both x86 and a
router simultaneously while investigating bufferbloat, security, and
ipv6, and be able to move forward (And back and forth) with a minimum
of backporting. That portion of the effort has eaten more of my time
this year than I care to think about.

At the time we started hacking on cerowrt, most commercial embedded
products were based on 5 year old kernels, or older, due to how
difficult it is to track the mainline, and a perceived lack of demand
from consumers for new stuff, despite the ISPs increasing frustration
with what's being shipped today not meeting their needs or
expectations.

We are trying to change that - in part by listening to the screams of
ISPs like comcast - but in also trying out new technologies such as
fixes for bufferbloat, ipv6, radical concepts like ccnx and openhip -
to be geek and early adoptor attractors - to get more of the needed
work done.

Still, an effort well beyond the original scope of the "wide" project
seems needed to get ipv6 rolled out. The theoretical breakthroughs
required to fix bufferbloat seem almost trivial in comparison.
Post by d***@reed.com
"we have to jam in code without testing it because we don't have a
discipline of testing and it's a waste of time".
It's a matter of having enough distributed testing.
Post by d***@reed.com
50% of what a developer should be doing (if not more) is making sure that
they don't break more than they improve.
so try 'packaging' rather than developing, and wrap your head around
the test matrix problem.
Post by d***@reed.com
I realize this is tough, not fun, and sometimes very frustrating. But cool
"new stuff" is far less important than keeping stuff stable.
This is a classic tension. I note that we're trying to fix the
internet here, before *it's* stability goes unstable.

So a great deal of change and r&d is needed, and yes, it needs to be
managed well, but stability only qualifies as a goal in limited ways.
Post by d***@reed.com
I'm not trying to be negative - this is stuff I learned at huge personal
cost in very high stress environments where people were literally screaming
at me every hour of every day.
I have been in those too. I would say that the amount of stress I've
put myself under, trying to ship something by the end of this month -
compares closely. Personally I would like to like to offload about 95%
of what I currently do, so I could focus on what's truly important.
I'm glad we have more and more volunteers, self identifying problems,
leaping forward and going out on their own, to go fix them.

Still the seat I wish I was sitting in now, with resources I wish I
could command, is Bob Taylor, circa 1968 or so.

http://en.wikipedia.org/wiki/Robert_Taylor_%28computer_scientist%29

He's always been a real inspiration to me.
Post by d***@reed.com
The cerowrt/bufferbloat stuff is worth doing, and it's worth doing right - I'm a fan.
THX!
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:50pm
Subject: Re: [Cerowrt-devel] Cero-state this week and last
Post by d***@reed.com
A small suggestion.
Create a regression test suite, and require contributors to *pass* the test
with each submitted patch set.
A linear complete build of openwrt takes 17 hours on good hardware.
It's hard to build in parallel.
A parallel full build is about 3 hours but requires a bit of monitoring
Incremental package builds are measured in minutes, however...
Post by d***@reed.com
Be damned politically incorrect about checkins that don't meet this criterion - eliminate
the right to check in code for anyone who contributes something that breaks
functionality.
The number of core committers is quite low, too low, at present.
However the key problem here is that
the matrix of potential breakage is far larger than any one contribute
can deal with.
20 + fairly different cpu architectures *
150+ platforms *
3 different libcs *
3 different (generation) toolchains *
5-6 different kernels
That matrix alone is hardly concievable to deal with. In there are
arches that are genuinely weird (avr anyone), arches that have
arbitrary endian, arches that are 32 bit and 64 bit...
Add in well over a thousand software packages (everything from Apache
to zile), and you have an idea of how much code has dependencies on
other code...
For example, the breakage yesterday (or was it the day before) was in
a minor update to libtool, as best as I recall. It broke 3 packages
that cerowrt has available as options.
I'm looking forward, very much, to seeing the buildbot produce a
known, good build, that I can layer my mere 67 patches and two dozen
packages on top of without having to think too much.
Post by d***@reed.com
Every project leader discovers this.
Cerowrt is an incredibly tiny superset of the openwrt project. I help
out where I can.
Post by d***@reed.com
Programmers are *lazy* and refuse to
check their inputs unless you shame them into compliance.
Volunteer programmers are not lazy.
They do, however, have limited resources, and prefer to make progress
rather than make things perfect. Difficult to pass check-in tests
impeed progress.
The fact that you or I can build an entire OS, in a matter of hours,
today, and have it work, most often buffuddles me. This is 10s of
millions of lines of code, all perfect, most of the time.
It used to take 500+ people to engineer an os in 1992, and 4 days to
build. I consider this progress.
There are all sorts of processes in place, some can certainly be
improved. For example, discussed last week was methods for dealing
with and approving the backlog of submitted patches by other
volunteers.
It mostly just needs more eyeballs. And testing. There's a lot of good
stuff piled up.
http://patchwork.openwrt.org/project/openwrt/list/
Post by d***@reed.com
-----Original Message-----
Sent: Thursday, April 5, 2012 10:27pm
Subject: [Cerowrt-devel] Cero-state this week and last
I attended the ietf conference in Paris (virtually), particularly ccrg
and homenet.
I do encourage folk to pay attention to homenet if possible, as laying
out what home networks will look like in the next 10 years is proving
to be a hairball.
ccrg was productive.
I have been spending time fixing some infrastructural problems.
1) After be-ing blindsided by more continuous integration problems in
the last month than in the last 5, I found out that one of the root
causes was that the openwrt build cluster had declined in size from 8
boxes to 1(!!), and time between successful automated builds was in
some cases over a month.
The risk of going 1 to 0 build slaves seemed untenable. So I sprang
into action, scammed two boxes and travis has tossed them into the
cluster. Someone else volunteered a box.
I am a huge proponent of continuous integration on complex projects.
http://en.wikipedia.org/wiki/Continuous_integration
Building all the components of an OS like openwrt correctly, all the
time, with the dozens of developers involved, with a minimum delta
between commit, breakage, and fix, is really key to simplifying the
relatively simple task we face in bufferbloat.net of merely layering
on components and fixes improving the state of the art in networking.
The tgrid is still looking quite bad at the moment.
http://buildbot.openwrt.org:8010/tgrid
There's still a huge backlog of breakage.
But I hope it gets better. Certainly building a full cluster of build
If anyone would like to help hardware wise, or learn more about how to
manage a build cluster using buildbot, please contact travis
<thepeople AT openwrt.org>
2) Bloatlab #1 has been completely rewired and rebuilt and most of
the routers in there reflashed to Cerowrt-3.3.1-2 or later. They
survived some serious network abuse over the last couple days
(ironically the only router that crashed was the last rc6 box I had in
the mix - and not due to a network fault! I ran it out of flash with a
logging tool).
To deal with the complexity in there (there's also a sub-lab for some
sdnat and PCP testing), I ended up with a new ipv6 /48 and some better
ways to route that I'll write up soon.
3) I did finally got back to fully working builds for the ar71xx
(cerowrt) architecture a few days ago. I also have a working 3.3.1
kernel for the x86_64 build I use to test the server side.
(bufferbloat is NOT just a router problem. Fixing all sides of a
connection helps a lot). That + a new iproute2 + the debloat script
and YOU TOO can experience orders of magnitude less latency....
http://europa.lab.bufferbloat.net/debloat/ has that 3.3.1 kernel for x86_64
Most of the past week has been backwards rather than forwards, but it
was negative in a good way, mostly.
I'm sorry it's been three weeks without a viable build for others to test.
4) today's build: http://huchra.bufferbloat.net/~cero1/3.3/3.3.1-4/
+ Linux 3.3.1 (this is missing the sfq patch I liked, but it's good enough)
+ Working wifi is back
+ No more fiddling with ethtool tx rings (up to 64 from 2. BQL does
this job better)
+ TCP CUBIC is now the default (no longer westwood)
after 15+ years of misplaced faith in delay based tcp for wireless,
I've collected enough data to convince me the cubic wins. all the
time.
+ alttcp enabled (making it easy to switch)
+ latest netperf from svn (yea! remotely changable diffserv settings
for a test tool!)
- still horrible dependencies on time. You pretty much have to get on
it and do a rndc validation disable multiple times, restart ntp
multiple times, killall named multiple times to get anywhere if you
want to get dns inside of 10 minutes.
At this point sometimes I just turn off named in /etc/xinetd.d/named
and turn on port 53 for dnsmasq... but
usually after flashing it the first time, wait 10 minutes (let it
clean flash), reboot, wait another 10, then it works. Drives me
crazy... Once it's up and has valid time and is working, dnssec works
great but....
+ way cool new stuff in dnsmasq for ra and AAAA records
- huge dependency on keeping bind in there
- aqm-scripts. I have not succeed in making hfsc work right. Period.
+ HTB (vs hfsc) is proving far more tractable. SFQRED is scaling
better than I'd dreamed. Maybe eric dreamed this big, I didn't.
- http://www.bufferbloat.net/issues/352
+ Added some essential randomness back into the entropy pool
- hostapd really acts up at high rat
Continue reading on narkive:
Loading...