Working Group Minutes/EWG 2013-07-29

From OpenStreetMap Foundation

Attendees

IRC nick Real name
apmon Kai Krueger
pnorman Paul Norman
zere Matt Amos

Summary

  • ACTION: zere to get the published minutes up to date.
  • Clients who expect an instant changeset upload to API get response
    • Issue is that some clients expect that they can perform and upload and then immediately another (e.g: map) request and are guaranteed to get the latest version of the data.
    • This breaks when there is latency between the upload target and the source of the map response. E.g: cache or federated services.
    • There was a discussion about whether this is better handled server-side or client side, with no clear consensus.
    • It was noted that clients do have access to information which can resolve this problem, i.e: the <diffResult/> response.
  • ci.osm.org appears to be down
    • The continuous integration server is down, but the admins are aware and working to fix it.
    • There were suggestions to increase the size of the build & test farm available to CI more stuff.
    • The way forward is to reach out to CI maintainer and offer help / resources to expand.

IRC Log

17:15:58 <zere> uh, so i wasn't here the last two times, and unfortunately i haven't looked at or written up the minutes...
17:16:04 <zere> so first actino is on me:
17:16:15 <zere> #action zere to write up all the minutes before the next meeting
17:16:31 <zere> was there any unfinished business from the last meeting?
17:20:43 <pnorman> well, as is obvious from the minutes... :)
17:22:06 <pnorman> got a link to them? I'd need to skim to say what was outstanding
17:23:29 <zere> raw output from meetbot is in http://matt.dev.openstreetmap.org/osm-ewg/2013/
17:23:49 <pnorman> I've got a new item about clients who expect an instant changeset upload to API get response
17:24:53 <zere> ok, let's start with that
17:25:04 <zere> #topic clients who expect an instant changeset upload to API get response
17:25:26 <pnorman> background: see links in https://github.com/systemed/iD/issues/1646 although this isn't exclusively iD
17:25:58 <zere> oh, this is about clients which ignore <diffResult/> and instead do a new map? call after an upload?
17:26:04 <pnorman> yes
17:26:51 <pnorman> clients which ignore the result and assume that a map? (or other api call) has up to the second data *will* break if they get a backend that is getting data from a read-only slave
17:27:42 <zere> yes, or anything else which is more than a few seconds out of date
17:29:43 <zere> so it would seem that there's a couple of solutions to this. either do it at the server side by forcing synchronous writes to read-only slaves, or by making the client aware of these issues so that it can retry / merge new data when it's not fully up to date.
17:29:58 <pnorman> yes, replication diffs will also cause the problem, and I wouldn't want to say that those two are the only two sources of this problem
17:30:31 <zere> forcing synchronous writes to slaves is definitely going to hurt write performance.
17:30:40 <pnorman> synchronous writes to read-only slaves is out - that reduces our write io capacity, and that's the current bottleneck for the DB
17:31:13 <zere> and i wouldn't want to rule out doing some other form of caching on the server in the future (i.e: tiled map call and friends).
17:31:33 <zere> which pretty much leaves handing the complexity on the client.
17:31:39 <pnorman> It's a client problem, and largely an awareness problem. devs need to be aware that there may be a time lag between an upload and when it makes it to the API they're reading from
17:31:52 <zere> unless there's another solution... did we miss something?
17:34:14 <pnorman> it is really hard to test these kinds of things
17:37:21 <apmon> Is adding support to the protocol, to specify a data element that must be synced in the state for the server to return a valid respons?
17:37:57 <pnorman> apmon: I just woke up, my brain is throwing a parser error at that
17:38:10 <apmon> Well, the thing we discussed last time
17:38:33 <apmon> i.e. the client sends a osm-type, id, version number to the server together with the map request
17:39:03 <apmon> if the server doesn't have that data in its state yet, then it e.g. returns a "too old" response, or forwards it to the master, or does something else sensible with it
17:40:58 <pnorman> I don't like it, I can easily imagine us moving all the map? calls off of ramoth onto smaug
17:42:59 <apmon> how does moving the map calls impact this, other than it becomes more important to find a solution?
17:43:07 <pnorman> well, what do you forward to?
17:43:53 <apmon> I presume the master will always have the ability to reply to map calls, as it might be necessary
17:44:24 <pnorman> i can easily imagine none of the backends being pointed at the the postgres master
17:44:25 <zere> for any single type it's already solved, right? i mean for node/#{id} or anything which only contains a single element, the version number already solves this. but it will potentially be a problem for any other call we have
17:44:38 <zere> it's not limited to map? although that might be the most obvious
17:46:02 <pnorman> I suppose if the client has a list of ids and version numbers they can already identify if the map? response is too old using the exact same logic as would be done on the server
17:46:07 <apmon> pnorman: If there isn't a possibility to get the current data, then you simply have to return an error "Data is too old"
17:46:59 <apmon> as someone pointed out last time, it is sufficient to ask for a single element in a diff import, as if that is present and commits are attomic, everything else in that diff import is in the servers state as well
17:47:26 <zere> yeah, but as pnorman says, that logic applies on the client too.
17:48:25 <apmon> yes, although it would simplify the client logic to do it on the server side
17:49:31 <zere> i guess by analogy, it's like a raster tile client passing up some sort of version tuple and demanding a re-render
17:49:52 <zere> we *can* do it, but it really seems like we're setting a bear-trap for later on when we have more clients
17:50:27 <apmon> The more clients there are, the more it makes sense to put it in the server
17:50:30 <pnorman> the only time this seems to come up is if a client ignores the <diffResult/>
17:50:36 <apmon> to not have to duplicate the logic every time
17:51:23 <zere> sorry, no. by that logic, we should have an RPC-type API with calls like "moveNodeTo(lon,lat)"
17:52:55 <zere> actually, since clients have in the past got lat/lon logic the wrong way around, perhaps we should be sending an editor screen as an image with an image map for interaction with the server?
17:53:44 <apmon> Do you want to push things like operational details, e.g. which server has the most up-to-date info out to every client as well?
17:54:10 <pnorman> what if no server has the most up to date info?
17:54:22 <zere> doing this in the server will mean each API request potentially has to do at least one extra DB query which, most of which we expect to pass. so it's just adding load.
17:54:33 <apmon> that is an operational detail as well, and doesn't belong in the client
17:55:30 <apmon> if the client needs to fetch a node/id each time, that isn't going to make things better
17:56:09 <zere> in practice, i expect any future caches to be in some sense transparent, like the tile caches. but i think it's inevitable that someone will end up teaching a client about which servers are up-to-date. already happened with t@h, didn't it?
17:56:10 <apmon> and relying on that the info is in the map? call is not sufficient, as someone might have deleted that node by the time the map? call comes
17:56:29 <zere> that's also information.
17:56:45 <apmon> no, t@h did it completely server side
17:57:08 <apmon> it used ha-proxy with health monitoring on database age
17:57:24 <pnorman> I know for what I'm working on the map? call contains all the information available
17:57:35 <apmon> so the servers transparantly droped out of the cluster once they became out of date
17:58:01 <zere> well, *too* out of date - it sounds like there was tolerance for some out of dateness
17:58:07 <apmon> pnorman: even if some one deleted that object in the mean time?
17:58:25 <apmon> yes, and it was about 15 minutes for t@h if I remember correctly
17:58:34 <zere> and i remember people complaining about how sometimes all or almost all of the servers were out of the cluster.
17:58:50 <apmon> that is because it was run on crappy hardware
17:58:58 <apmon> it did remarkably well for what it was run on!
17:59:09 <zere> right, so imagine we're talking about 2-3 seconds instead of 15 minutes. suddenly it becomes a lot harder - especially as many transactions last longer than that
17:59:35 <pnorman> apmon: if it's deleted and the server knows about it, it's not returned in the map call. if it's deleted and the server doesn't know about it, it's not deleted. if it's not deleted, it's not deleted
18:00:18 <apmon> if it isn't deleted and the server doesn't know about it, it appears deleted (non existent) to the client
18:00:32 <zere> pnorman: what apmon means is if you get a map? response without the node. e,g: you upload rev 2, make a map call, someone beats you to it and deletes in rev 3, you get no node (rev 2 or rev 3) in your map? response.
18:01:05 <pnorman> zere: where is the server's replication at?
18:01:27 <zere> pnorman: sorry, i don't understand
18:01:47 <pnorman> zere: to what revision has the server consumed diffs, or consumed postgres replication
18:02:14 <apmon> you will need version 1 nodes. Then you can't tell the difference
18:02:27 <apmon> but perhaps that is an edge case we don't have to worry about too much
18:02:41 <zere> assume for the sake of argument that no replication has yet occurred - all the steps in the above happened fast enough that replication hasn't caught up yet.
18:03:14 <pnorman> ah, I see
18:03:42 <pnorman> In that case, I don't think its an issue because when it comes time to upload you'll have to resolve it as if someone deleted it while you were editing
18:03:53 <zere> what i was saying before is that the lack of the node is also information - you know it's not the rev 2 you just committed, or the rev 1 which was there before. so you know it's new.
18:04:20 <zere> unless rev 1 was deleted, and you undeleted it. in which case pick a different node ;-)
18:05:00 <apmon> so you need to pick rev 2 or above elements
18:07:15 <apmon> I disagree with that something like this belongs into the client, but if the decision is that it does, then there is probably nothing to discuss for ewg on this?
18:07:36 <pnorman> the solution is to use the <diffResult/>
18:07:50 <pnorman> anyways, apmon had a CI issue from last week
18:09:32 <zere> i think the real solution is to use get-and-watch semantics on tiles, but that's nowhere close to being implementable soon.
18:10:42 <apmon> apart form that ci.osm.org appears down at the moment
18:10:53 <zere> #topic CI issues
18:11:03 <zere> #info ci.osm.org appears to be down
18:11:04 <apmon> my question is can we expand it more such that it becomes more useful to OSM developers
18:11:24 <zere> i shall tap shaunmcdonald on the shoulder about that, i think he runs ci.osm.org
18:11:25 <apmon> i.e. can we help shaun expand the service?
18:12:01 <zere> it is (was) a jenkins server, right?
18:12:06 <apmon> yes
18:12:39 <zere> i'm not really familiar with the setup for that, but i believe it can technically handle any build/test? it's scriptable, right?
18:12:48 <apmon> yes
18:12:52 <apmon> and it does that already
18:13:16 <zere> cool. so it's a matter of getting more projects using it?
18:13:23 <apmon> one thing that I'd like to see (and have discussed with shaun), is to extend it to a build farm
18:13:44 <apmon> i.e. to help verify that e.g. osm2pgsql works on linux, osx and possibly windows
18:13:48 <zere> ah, for the various flavours of distro / mac os, etc...
18:13:57 <apmon> yes
18:14:36 <zere> that shouldn't be a problem... but presumably it gets complicated by things like updates
18:14:55 <apmon> This would be useful for various projects that run native code. E.g. osrm, cgimap, mapnik
18:15:16 <pnorman> would it be suitable for making sure cgimap works with boost back to 1.43?
18:15:45 <apmon> yes, if you have appropriate linux distros in your build farm
18:16:12 <zere> that's what i mean about updates... presumably the only sane way to do it is to have an update cycle which keeps it up-to-date with the distro. if the distro updates boost one day, then it might break the build - and you'd no longer be testing the old version.
18:16:26 <apmon> e.g. ubuntu 12.04 and 10.04 VMs (not sure which versions of boost they actually have)
18:16:39 <pnorman> multiple!
18:16:50 <zere> but then to test the variants of boost, fcgi, pqxx, etc... gets exponentially impossible ;-)
18:17:20 <apmon> I don't think released versions of e.g. ubuntu update packages like boost other than patch bugfixes
18:18:05 <apmon> if you stick to a few common distros, I don't think it is too bad
18:18:21 <zere> sure, but then you're testing 1.48.1-1, 1.48.1-2, 1.48.1-3, so it's the same problem, just in miniature.
18:18:27 <zere> right
18:18:29 <apmon> e.g. RHEL, Ubuntu LTS, the latest fedora and ubuntu, osx and windows
18:19:18 <apmon> if things break between minor distro patch fixes, then the distros screwed up
18:19:34 <zere> so the next question is cost. RHEL (CentOS), Ubuntu, Fedora, are all free. OSX and windows aren't.
18:19:41 <zere> does OSX even run on VMs?
18:20:08 <zere> (because i'm pretty sure it doesn't run on vanilla PC bare metal)
18:20:19 <apmon> not sure, but there might be people who have osx and windows machines that are willing to provide remote login to them to support htis
18:20:34 <apmon> this is e.g. how the samba build farm works
18:20:58 <apmon> people just provide ssh access to their machines, and the build farm then shedules builds on them remotely
18:21:38 <apmon> so all of the linux distros can run in VMs on the main server, and the rest are done via remote login to "donated" machines
18:21:38 <zere> seems like it is possible https://www.virtualbox.org/ticket/9388
18:22:27 <zere> ok. is that sort of thing easy to set up? and robust to those machines going away?
18:22:57 <zere> and presumably you'd need root/admin to install packages?
18:24:18 <apmon> I think it isn't too difficult to set up.
18:24:49 <apmon> You only initially need to have root access to install packages. Once you have a build environment I don't think it is neccessary
18:25:04 <zere> ok. sounds simple. do you know if shaun needs anything else to make it happen?
18:25:16 <apmon> No I don't
18:25:36 <apmon> I think the last time I spoke to him about this a while ago, he mainly needed time to do it... ;-)
18:25:58 <zere> then i'm sure he'll be thrilled that you're volunteering ;-)
18:25:59 <apmon> but if other people can chime in, it might be easier to achieve
18:26:27 <apmon> I am...
18:26:49 <apmon> As I would highly appreciate that infrastructure for osm2pgsql and mod_tile/renderd
18:27:18 <apmon> I think DennisL also mentioned he might be able to help
18:27:33 <zere> well, perhaps the first step is to ping him and see what he needs and whether there are any tasks which can be done which would help him
18:28:16 <zere> just a pity he logged off 50 minutes ago...
18:29:07 <apmon> Yes, I'll see if I can catch him about this again
18:29:15 <zere> great, thanks!
18:29:22 <zere> #topic AoB
18:29:30 <zere> was there anything else anyone wanted to discuss?
18:29:37 <apmon> It would be nice to be able to eventually send a mail to dev to offer this to a wider osm developer community
18:30:11 <pnorman> I think cgimap /node/# and /nodes?nodes=... are well enough tested to be used on api.osm.org
18:33:47 <pnorman> Not sure where to go from here on those calls. TomH, zere?
18:39:57 <zere> i'm not sure it's worth doing just those two... might get some benefit out of /nodes?nodes=... i suppose.
18:40:52 <apmon> benefits of incremental testing / deploying?
18:41:24 <zere> actually, what am i saying - i want to see cgimap take over (insert evil laugh), so i should be pushing for those to be deployed too!
18:41:34 <apmon> but given that the rest of the test are also nicely progressing? It might not be long to have a more substantial set to deploy?
18:41:36 <zere> i guess first step is to take them out of "experimental"
18:41:58 <zere> shouldn't be too hard. then tag a new version.
18:42:01 <pnorman> at this point I'm finding more oddities in the rails port than cgimap
18:42:33 <pnorman> (e.g. https://github.com/openstreetmap/openstreetmap-website/issues/384)
18:42:36 <apmon> that would be a good first step
18:43:36 <pnorman> Okay, as I complete tests and become confident that calls are okay, I'll do pull requests for moving them out of --enable-experimental?
18:44:30 <zere> yup. that sounds great. thanks!
18:45:10 <pnorman> want to make a branch for 0.2.0 so that any fixes can be backported to it?
18:46:13 <zere> i guess that rather depends. if we need to start backporting, sure. if we don't, i'd rather just get the latest version on osm.org.
18:47:36 <pnorman> I was thinking we could be novel and have some kind of release plan :)
18:49:12 <zere> i'm not sure what would go on it...
18:49:19 <zere> 0.2.0 current release
18:49:23 <zere> 0.3.0 ...???
18:49:28 <zere> 1.0.0 PROFIT!
18:50:00 <pnorman> meeting is probably over? cgimap probably isn't ewg, although it needs to be discussed
18:50:07 <zere> hahaha
18:50:11 <zere> yeah, you're right.
18:50:20 <zere> hope to see everyone next week!