greptilian logo

IRC log for #rest, 2015-03-12

https://trygvis.io/rest-wiki/

| Channels | #rest index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
00:26 mezod_ joined #rest
00:50 vanHoesel joined #rest
01:07 shrink0r joined #rest
01:13 warehouse13 joined #rest
01:16 begriffs joined #rest
01:52 mezod joined #rest
02:51 lemur joined #rest
03:09 ewalti joined #rest
03:39 fumanchu_ joined #rest
04:07 mgomezch joined #rest
04:23 ewalti joined #rest
04:45 _ollie joined #rest
05:03 fumanchu joined #rest
05:05 mgomezch_ joined #rest
05:09 jaawerth_ joined #rest
05:09 zama_ joined #rest
05:50 shrink0r joined #rest
07:28 ewalti joined #rest
07:57 dEPy joined #rest
08:28 vanHoesel joined #rest
08:41 vanHoesel joined #rest
08:48 azer_ joined #rest
09:29 ewalti joined #rest
09:52 shrink0r joined #rest
09:56 vanHoesel joined #rest
10:09 Left_Turn joined #rest
11:29 azr_ joined #rest
11:37 Left_Turn joined #rest
11:43 pdurbin whartung: I miss having you in a channel I log where I can pick your brain about EJB.
11:44 trygvis EJB? people still use that? :)
11:44 pdurbin sigh. what do you use trygvis?
11:45 trygvis often spring and spring-mvc
11:46 pdurbin so you would never use something like javax.ejb.TransactionAttributeType.REQUIRES_NEW
11:47 trygvis we do, but it's all handled by spring
11:48 trygvis but we rarely do that, it usually leads to clusterfucks
11:48 pdurbin hmm. well, starting to use that reduced the time a method took from 3.5 *hours* to 15 *minutes*
11:48 pdurbin here's the commit: https://github.com/IQSS/dataverse/commit/a1d9da4
11:49 pdurbin I also had to put the method I was calling (indexAll) into a new separate bean so there would be an EJB boundary so the  @TransactionAttribute(REQUIRES_NEW) annotation actually has an effect
11:49 pdurbin this is all very strange and mysterious and magical and spooky to me
11:49 pdurbin and whartung usually has good insight on this stuff :)
11:51 trygvis to me the main difference between spring and spec ejb is that you get to move faster, and you don't try to stick to a spec that'll only give you a need for workarounds
11:51 trygvis on the spring side you can end up debugging more stuff, but for us it is worth it
11:52 trygvis we use jpa annotations, but we know we're using hibernate so we sneak in some hibernateisms once in a while
11:52 mezod joined #rest
11:53 pdurbin fair enough, but you haven't heard of this problem? the idea is that the method is building up a huge single transaction and gets slower and slower as it runs. and the fix is to introduce an EJB boundary and annotated the methods called by the main method with "force a new transaction"
11:54 pdurbin indexAll calls indexDataset over and over, for example. so you annotate indexDataset with the "force new transaction" magic
11:55 trygvis usually you want to build big transactions, up to a certain size to precent excessive disk flushing
11:56 trygvis indexAll might sound like something that hits the "up to a certain size" limit
11:56 pdurbin huh. well in this case the big single transaction seemed to be killing performance
11:57 trygvis perhaps you want to configure batch size in your jpa provider
11:57 trygvis are you running out of memory?
11:58 pdurbin not sure. knowing what I know now, that would be a good thing to look at
11:58 trygvis try this: if(dataverseIndexCount % 1000 == ) entityManager.flush();
11:58 trygvis if(dataverseIndexCount % 1000 == 0) entityManager.flush();
11:59 pdurbin flush every so often, you're saying
11:59 trygvis yep
11:59 pdurbin is all this stuff covered in the Jave EE tutorial? or elsewhere?
11:59 trygvis but, where are you actually using time or cpu? it seems you're reading from a database just to stuff it in a solr index
12:00 trygvis http://stackoverflow.com/questions/9994699/solr-reindex-recommended-batch-size
12:00 trygvis usually you don't want to commit often, only send to the server
12:01 trygvis but you have two methods there that I have no idea what do; indexDataset() and indexDataverse()
12:01 pdurbin at a high level, yes, the work is to read from postgres and put some data into solr
12:01 trygvis postgresql <3
12:01 trygvis ok, so if you're writing to solr you don't want to flush the entityManager (it doesn't have any changes to flush) but flush solr
12:01 pdurbin :)
12:02 pdurbin oh oh oh, sorry
12:02 pdurbin we do write to postgres too
12:02 trygvis you could possibly also just run without a transaction as you're rebuilding everything
12:02 trygvis s,as,if,
12:03 trygvis it is usually like running with scissors, but if your database is empty it's usually ok :)
12:03 pdurbin we write to postgres the timestamp at which we indexed into solr without error. we store this timestamp so later we can compare it to another timestamp to see if solr has the latest data. to see if solr is in sync with the data in postgres
12:04 trygvis ok, then you want to flush the entity manager for every thousand write
12:04 pdurbin before the fix the writing of the timestamps to the database was being done in a single transaction and the process/job was getting slower and slower as it ran
12:05 pdurbin that makes sense but this should have only been about 1600 writes in total
12:05 pdurbin so it's shocking that it was taking 3.5 hours
12:06 trygvis but it is strange that adding that tx boundary made it all better, that means that the reindexing process isn't what is causing the badness
12:06 pdurbin oh I suspect there's more badness I haven't found yet
12:06 pdurbin it's all very spooky
12:07 trygvis it could be as easy to try to do a flush before you call indexAll()
12:11 pdurbin yeah, could try adding flush
12:15 vanHoesel joined #rest
12:21 interop_madness joined #rest
12:57 vanHoesel joined #rest
13:02 composed joined #rest
13:02 composed Hmm Douglas Crockford says the statelessness of HTTP was its biggest mistake.
13:02 composed And talks about WebSockets with Node.JS
13:04 composed WebSockets are interesting. One sends an HTTP request negotiating to drop to a *lower* level of protocol
13:04 composed Our stack has many warts folks
13:05 asdf` i'm not sure i'd call websockets 'lower level'
13:07 composed asdf`: they are much closer to TCP than HTTP
13:07 composed asdf`: but they're layered on top of HTTP because HTTP is everywhere
13:11 trygvis composed: statelessness of HTTP is one of its biggest features, and required when trying to build "the web"
13:12 trygvis comparing streaming messages from one node to another to http is apples to oranges
13:12 composed trygvis: well maybe it was required in the 80s
13:12 composed trygvis: right now we have movements like "encrypt everything"
13:13 composed trygvis: so little to nothing of "the web" as originally planned gets used, and there are tons of security measures to avoid mixing domains for the rest of it because of security implications
13:13 trygvis if HTTP becomes a problem, some other technology will prevail
13:13 bigbluehat composed: where'd you see the crockford quote?
13:13 bigbluehat sounds like he believe the network is always available
13:14 composed bigbluehat: why would he believe that
13:14 bigbluehat yeah...good question
13:14 trygvis available, reliable, fast, etc
13:14 bigbluehat non-latent
13:14 trygvis shiby!
13:14 composed bigbluehat: so you pose the question and ponder the answer. You strawmanned Crockford :P
13:14 trygvis err, shiny!
13:15 bigbluehat hehe
13:15 composed Using a stateful connection doesn't mean it can't tolerate disruption.
13:15 bigbluehat true.
13:15 composed It just means there's a well identifiable session, and if the session ends, it can start over later
13:16 bigbluehat and you have shared state
13:16 bigbluehat so your state machine has 2 heads
13:16 bigbluehat composed: curious about your "little to nothing of "the web" as originally planned..." comment
13:16 * bigbluehat being a fan of RFC 2068 ;)
13:17 composed bigbluehat: shared state typically refers to some state that multiple entities can mutate independently.
13:17 _ollie what Douglas Crockford effectively says: "HTTPs biggest mistake is that it's not a solution to a problem I have" which is BS :)
13:17 composed Because of shared state simply means "two things know about one thing" then the web is also one giant shared state
13:18 trygvis _ollie: +1!
13:18 vanHoesel joined #rest
13:18 bigbluehat anyone have a URL for these statements by crockford?
13:18 composed _ollie: how'd you implement website login without session cookies, which represent the client emitting state to the server to manage
13:19 composed bigbluehat: one sec
13:19 bigbluehat "URL or it didn't happen" ;)
13:19 trygvis bigbluehat: welcome to #rest :)
13:19 bigbluehat composed: I don't think I'd say that state is shared anywhere on the web
13:19 composed bigbluehat:  minute 41-42 and onward https://www.youtube.com/watch?v=QgwSUtYSUqA
13:19 bigbluehat it's transferred...but not shared
13:19 bigbluehat tnx trygvis :D
13:19 bigbluehat ...been too long
13:20 trygvis you've been missing out on some real trolls
13:20 bigbluehat aw man... ;)
13:20 composed bigbluehat: it comes down to what he says: it's easy to decide "everything is stateless" because it makes it easy to write a standard that's stateless
13:20 composed But it's impossible to write a simple website login without SOME shared state
13:21 composed Like an emitted session id
13:21 _ollie composed: it seems like you misunderstood the statelessness requirement in REST…
13:21 bigbluehat is it?
13:21 composed The trick with designing big networked systems is not to eliminate state but isolate it.
13:22 composed There will be some component that is aware of state. The rest maybe won't be
13:22 composed HTTP says the client can be that component
13:22 bigbluehat hilarious statements by Mr. C.
13:22 composed But truth is the client can't have ALL the state, or basically every service that has a user account can't exist.
13:22 bigbluehat "HTTP was designed completely wrong"
13:22 _ollie REST requires the request to contain all necessary information and the server not magically identifying itby some means, that's all…
13:22 _ollie thus, a cookie is perfectly fine
13:23 composed bigbluehat: question is why. "Hilarity" is not objective metric for correctness
13:23 bigbluehat no...it's terribly objective ;)
13:23 bigbluehat HTTP doesn't work for the way he wants to write apps
13:23 bigbluehat ...it doesn't mean it was designed "completely wrong"
13:24 bigbluehat it was just designed for a different thing than he's using it for
13:24 composed bigbluehat: no I'm serious, I prefer we don't fall to this immature level of analysis. Let's have some argument from an engineering point of view.
13:24 bigbluehat so *he* wants something else
13:24 bigbluehat aka...it wasn't designed for *him*
13:24 bigbluehat ^^ just did ;)
13:24 composed _ollie: fielding says cookies directly oppose the REST style.
13:25 composed bigbluehat: "I think he's funny" is not an engineering viewpoint I'm afraid. He has very specific arguments about constantly passing back and forth context that doesn't change with stateless designs, which is a real bottleneck
13:26 trygvis composed: if you think you need a session id you're starting off wrong
13:26 composed Ideally you want to be able to start over a session, but there's no need to eliminate sessions entirely.
13:26 _ollie composed: where?
13:26 composed trygvis: ok I'm asking you, how do you do it right
13:26 composed trygvis: how do you log into a site without a session id that you pass back every time
13:27 trygvis again, you're starting off wrong. I don't log into a site
13:27 trygvis I supply credentials on every request (like the spec caters for)
13:27 composed trygvis: fine, I want to have favorite channels on YouTube without having a YouTube-specific app. How do we do this.
13:28 composed trygvis: if you supply credentials on *every* request, it means you can't cache *any* response at a proxy
13:28 composed Because credentials will be unique
13:28 composed trygvis: furthermore a session id is in fact "credentials"
13:28 composed So it's the same thing by another name
13:29 trygvis if the server want to me to read some shared, public, cacheable data it can point me to another host (realms are per host)
13:29 trygvis you can implement it like that, but it is not the same thing
13:30 composed trygvis: fine, so after all we need to split things by use case, and one of those use cases absolutely needs statefulness
13:31 composed trygvis: and the same use case that requires statefulness (session, credentials, security, domain isolation) is curiously the same exact use case to have an API for. Not many APIs do much useful without credentials.
13:31 composed so what is the bottom line here. Crockford is not so hilarious after all
13:31 trygvis jez, it's you again
13:32 composed what?
13:37 azr joined #rest
13:53 shrink0r_ joined #rest
14:07 ewalti joined #rest
14:43 nkoza joined #rest
15:07 ewalti joined #rest
15:07 ewalti joined #rest
15:09 ewalti joined #rest
15:25 ewalti joined #rest
15:46 JudasBricot joined #rest
16:05 azr joined #rest
16:20 whartung hey pdurbin
16:28 ewalti joined #rest
16:32 azr joined #rest
16:32 pdurbin whartung: hey. trygvis talked me off the ledge. fun with EJB
16:32 whartung yea I glanced at but then got hit with TL;DR
16:34 pdurbin it short, I was surprised by the fix
16:34 pdurbin which made things 14 times faster
16:34 whartung care to summarize?
16:34 ewalti joined #rest
16:36 pdurbin basically indexAll was taking 3.5 hours on very little data. only creating ~1600 Solr documents based on data in postgres. as the Solr documents are created we update the database with a timestamp per row
16:37 pdurbin the fix was to put indexAll in a new bean (to create an EJB boundary) and add @TransactionAttribute(REQUIRES_NEW) to the methods that indexAll is calling over and over, such as indexDataset
16:37 pdurbin https://github.com/IQSS/dataverse/commit/a1d9da4
16:37 pdurbin this reduced the time from 3.5 hours to 15 minutes
16:38 pdurbin I guess the thing that disturbs me is that this fix is completely unintuitive to me.
16:38 whartung how did this speed it up? Is Solr part of the transaction?
16:38 pdurbin It's like magic.
16:38 pdurbin I don't know if Solr is part of the transaction.
16:38 pdurbin I think I need to study EJB transactions.
16:38 trygvis pdurbin: I doubt that your solution is a 'correct' fix
16:39 pdurbin ok, let's call it a solution then :)
16:39 trygvis I doubt it is unless you have done some magic to configure solr as a part of your XA
16:39 trygvis it'll float your boat for "a while" :)
16:39 trygvis but I'm out
16:40 whartung yea that doesn't make any sense whatsoever
16:40 whartung where did you get the idea to even try it?
16:40 whartung and I'm skeptical that Solr is transactional
16:41 pdurbin whartung: people here with way more experience with EJB than I have said something along the lines of, "We think indexAll is being treated at a single transaction. Let's add @TransactionAttribute(REQUIRES_NEW) and see if it helps." And it did.
16:42 pdurbin before the solution, you could sort of tell that indexAll was getting slower and slower as it ran
16:42 pdurbin I never witnessed the 3.5 hours it took. Too impatient.
16:46 pdurbin I had figured out that writing those timestamps was slowing things down, but that's about it.
16:46 pdurbin anyway, EJB moves in mysterious ways
16:46 pdurbin and I should probably read a book about it
16:47 whartung well, it doesn't, really. EJB is pretty bone stupid.
16:47 whartung you were only updating 1600 rows?
16:47 whartung Were you relying onthe EntityManager to flush the updates?
16:49 pdurbin about 1600 solr documents get created based on fewer rows that that in the database. let's say half, 800 rows in the database
16:50 whartung but you update those rows, right?
16:50 pdurbin right
16:52 whartung and you're just using the entitymanager, right? fetch the entity, change it, an dlet the EM flush it when it's good and ready?
16:52 pdurbin two timestamps actually. because each row in the database (more or less) becomes two solr documents. so we record a timestamp for each of the two solr documents per row
16:52 whartung do you fetch all of your data upfront?
16:55 pdurbin whartung: yes. I fetch a list of all datasets up front.
16:55 pdurbin then iterate over them
16:55 whartung are they eager? are there any lazy relationships?
16:56 pdurbin I don't know.
16:56 whartung well, do your root dataset rows relate to other rows, to other collections?
16:56 pdurbin to update the timestamp I do use entitymanager. I do an em.merge
17:30 fragamus joined #rest
17:37 azr joined #rest
17:42 shrink0r joined #rest
18:31 pdurbin whartung: I think at some point you said I should read the EJB 3 JSR PDF.
18:32 saml nooooooooooo
18:32 saml EJB
18:32 whartung heh. The JSR is interesting for sure, but it's a bit thick
18:32 saml java
18:32 ewalti joined #rest
18:32 * fumanchu wonders what saml codes in
18:32 pdurbin ok. maybe I'll read the 1000 page Java EE tutorial instead. :)
18:33 saml node.js hehehhehehehehehehehe
18:33 saml don't use node.js
18:33 saml it hurts feelings
18:33 saml do you use undertow?
18:35 whartung Here's my thinking pdurbin
18:35 whartung First, it has nothing to do with Solr
18:35 pdurbin ok, nothing to do with solr. makes sense
18:35 whartung That solr interface is not XA at all, again you can see that by the fact that you manually call commit.
18:35 pdurbin yeah
18:36 pdurbin I mean. it's a web service. It's like calling into the twitter api.
18:36 whartung Not saying XA is impossible over HTTP, but…unlikely for a gardern variety HTTP interface.
18:36 whartung so with joa
18:36 whartung jpa
18:37 whartung when you're doing a bunch of changes, jpa caches it's updates in ram.
18:38 whartung nominally, it will flush all of the work on transaction commit.
18:38 pdurbin ok. trygvis was asking if a lot of memory was being used
18:38 whartung but you just said it was only 1600 rows
18:38 whartung "that's nothing(™)"
18:39 whartung sec...
18:42 whartung so
18:42 whartung simple case
18:43 whartung you create 1000 entities, then the transaction commits, and 1000 insert statements flood out to the db server.
18:43 whartung so, over time, a transcation can build up in ram, leaving a footprint.
18:43 whartung but, 1600 rows isn't a lot, typically
18:43 whartung now
18:44 whartung the other time jpa pushes sql to the db is wheneer it queries the server.
18:44 whartung so you can load in a list of entities, change the entity, and then access the next one
18:44 whartung but when you access the entity, it has lazy associations
18:44 whartung which causes a new query in the background to hit the db server.
18:45 whartung so when that happens, the pending updates will be flushed first.
18:45 whartung so instead of 1000 inserts at the end of the xtn, you get your sql all mixed up of inserts and selects.
18:46 whartung but even if that happens, while the transaction is open, the internal footprint will grow.
18:46 whartung are you using postgres?
18:46 pdurbin yes. postgres
18:47 whartung are you updating a single row over and over and over?
18:47 pdurbin not on purpose if I am
18:47 whartung ok
18:47 vanHoesel joined #rest
18:47 pdurbin I mean, the row does get updated twice.
18:47 pdurbin because we store two timestamps
18:48 pdurbin each for a solr document that gets indexed
18:48 whartung its been a while since I tested, but in the past, updating a single row, over and over and over in pg in a single xtn can be slow, because each row creates a new "ghost" row in the DB, and each new update has to crawl that list. So, if you updated a single row 1000 times, you end up haveing to scan 1000 rows for the next update.
18:48 pdurbin contentIndexTime vs. permissionIndexTime timestamps. two of them. same row
18:48 whartung but doesn't sound like that's happening here.
18:49 * pdurbin is scared of ghost rows
18:49 whartung nah, they go away on commit.
18:49 whartung feature, not a bug.
18:51 pdurbin phew
18:51 whartung So, there's that. That suggests that in the large xtn scenarion in your case, the overhead at the JPA/DB level of managing all that change is expensive. This would manifest by a CPU being pegged, when in this case, it shouldn't be -- should be mostly I/O
18:52 whartung all your work appears to be in a single thread, so contnetion doesn't seem like the issue.
18:52 whartung as a simple test, you can try performing JUST the db operations (skip the solr calls) and see how long it takes.
18:52 whartung 3.5hrs is still a crazy number
18:53 whartung in any case
18:53 pdurbin yeah
18:53 whartung updating 800 rows…big deal
18:53 pdurbin right
18:53 whartung "Oh no, all that data might almost fill a cache line in the CPU!"
18:54 * whartung used to have 88k floppy disks
18:54 whartung so that would be an intersting test
18:56 whartung because all breaking up the xtn is doing is lowering the memory impact of the overall xtn
18:57 pdurbin right. I mean, there are probably many ways to relieve the memory getting eaten up. Not that I've confirmed if it was memory or cpu.
18:57 whartung how big are these documents? is the data in the DB the actual data, or just references to files?
19:00 pdurbin the resulting Solr documents? not all that big I don't think
19:00 whartung the rows in the db
19:00 pdurbin oh, well, for every row we root around all over the db to gather the data required to construct the solr documents
19:01 whartung ok, but you have "800" of them. How much data is one of those "800"
19:02 pdurbin it's hard to answer but let's say not very much
19:04 whartung ok
19:04 whartung so you're not sucking in 1600 1M documents
19:04 whartung sending you GC for a tizzy
19:04 whartung that could be the other thing, need more memory, stuck in GC hell
19:04 whartung be intteresting to see the memory usage
19:05 pdurbin the first solr doc I'm looking at is 71 lines of JSON
19:06 whartung ooh
19:06 pdurbin yeah, as I continue to dig into the performance problem I'll look at memory and cpu and whatnot. this was just crazy. the 3.5 hours thing. now down to 15 minutes by adding "require new transactions"
19:06 whartung so basically by breaking up the xtn, each of those documents can come and go one by one rather than being all cached up waiting for the big single commit.
19:07 whartung I'd still do the 'db only' test
19:07 whartung for laffs
19:07 pdurbin take solr out for a bit. i hear ya
19:07 whartung 15m for 800 documents is still kind of crazy, imho
19:07 * pdurbin tells Solr it's not his fault
19:07 * whartung … yet
19:08 pdurbin oh sure. gotta make it faster still
19:09 whartung I would think that solr would index those faster than that
19:10 pdurbin solr is quite nice. I'm sure I'm just doing things wrong
19:12 pdurbin I can't decide if EJB is quite nice. :)
19:13 whartung my one complaint for EJB is that each EJB gets its own, private JNDI tree.
19:13 whartung this may not be a problem with WAR deployments.
19:13 whartung but when you start integrating EJBs from other jars, it's kind of a pain.
19:14 whartung now, they DO have a new, "canonical" JNDI name
19:14 whartung they're just awful names
19:14 whartung seems to me EJBs in a war may have less of an issue with this.
19:15 whartung and if you are using CDI for all your bean injections, it might be less of a problem -- I've not use it to that extent.
19:15 whartung doing that would, ostensibly, solve many ills.
19:16 whartung the JNDI tree is a legacy requirement since EJBs are individual deployable elements.
19:16 whartung bbl afk lunch
19:16 pdurbin bon appetit
19:31 JudasBricot joined #rest
19:34 shrink0r joined #rest
20:42 vanHoesel joined #rest
20:56 graste joined #rest
21:12 vanHoesel joined #rest
22:00 composed joined #rest
22:10 vanHoesel joined #rest
22:25 talios joined #rest
22:28 vanHoesel joined #rest
22:33 trygvis whartung: postgres has gotten a nice optimalization when non-indexed fields are updated called HOT
22:33 trygvis Heap Only Tuples
22:33 whartung yeaok
22:33 trygvis dunno how it will satisfy MVCC at the same time, but anyway
22:34 trygvis it also seems quite old: http://www.postgresql.org/message-id/27473.1189896544@sss.pgh.pa.us
22:34 whartung MVCC is mostly about locking
22:34 trygvis yes, but when a tx is updating a row I can't remember how postgresql does it. if it locks the row or not
22:36 trygvis anyway, I'm off for tonight. later
22:36 composed If it's an atomic operation it's by definition locked for a moment
22:36 composed While updated
22:37 whartung tt trygvis
22:39 trygvis composed: no, it's not. but any other tx that wrote to it can't complete unless the earlier tx fails
22:41 trygvis whartung: enjoy the troll
22:41 * trygvis is out for real now!
22:42 pdurbin trolling and running. I see how it is :)
22:49 warehouse13 joined #rest
23:17 _ollie joined #rest
23:18 vanHoesel joined #rest
23:37 rhyselsmore joined #rest
23:37 rhyselsmore joined #rest
23:42 vanHoesel joined #rest

| Channels | #rest index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

https://trygvis.io/rest-wiki/