IRC log for #rest, 2015-03-12

https://trygvis.io/rest-wiki/

All times shown according to UTC.

Time	Nick	Message
00:26		mezod_ joined #rest
00:50		vanHoesel joined #rest
01:07		shrink0r joined #rest
01:13		warehouse13 joined #rest
01:16		begriffs joined #rest
01:52		mezod joined #rest
02:51		lemur joined #rest
03:09		ewalti joined #rest
03:39		fumanchu_ joined #rest
04:07		mgomezch joined #rest
04:23		ewalti joined #rest
04:45		_ollie joined #rest
05:03		fumanchu joined #rest
05:05		mgomezch_ joined #rest
05:09		jaawerth_ joined #rest
05:09		zama_ joined #rest
05:50		shrink0r joined #rest
07:28		ewalti joined #rest
07:57		dEPy joined #rest
08:28		vanHoesel joined #rest
08:41		vanHoesel joined #rest
08:48		azer_ joined #rest
09:29		ewalti joined #rest
09:52		shrink0r joined #rest
09:56		vanHoesel joined #rest
10:09		Left_Turn joined #rest
11:29		azr_ joined #rest
11:37		Left_Turn joined #rest
11:43	pdurbin	whartung: I miss having you in a channel I log where I can pick your brain about EJB.
11:44	trygvis	EJB? people still use that? :)
11:44	pdurbin	sigh. what do you use trygvis?
11:45	trygvis	often spring and spring-mvc
11:46	pdurbin	so you would never use something like javax.ejb.TransactionAttributeType.REQUIRES_NEW
11:47	trygvis	we do, but it's all handled by spring
11:48	trygvis	but we rarely do that, it usually leads to clusterfucks
11:48	pdurbin	hmm. well, starting to use that reduced the time a method took from 3.5 hours to 15 minutes
11:48	pdurbin	here's the commit: https://github.com/IQSS/dataverse/commit/a1d9da4
11:49	pdurbin	I also had to put the method I was calling (indexAll) into a new separate bean so there would be an EJB boundary so the @TransactionAttribute(REQUIRES_NEW) annotation actually has an effect
11:49	pdurbin	this is all very strange and mysterious and magical and spooky to me
11:49	pdurbin	and whartung usually has good insight on this stuff :)
11:51	trygvis	to me the main difference between spring and spec ejb is that you get to move faster, and you don't try to stick to a spec that'll only give you a need for workarounds
11:51	trygvis	on the spring side you can end up debugging more stuff, but for us it is worth it
11:52	trygvis	we use jpa annotations, but we know we're using hibernate so we sneak in some hibernateisms once in a while
11:52		mezod joined #rest
11:53	pdurbin	fair enough, but you haven't heard of this problem? the idea is that the method is building up a huge single transaction and gets slower and slower as it runs. and the fix is to introduce an EJB boundary and annotated the methods called by the main method with "force a new transaction"
11:54	pdurbin	indexAll calls indexDataset over and over, for example. so you annotate indexDataset with the "force new transaction" magic
11:55	trygvis	usually you want to build big transactions, up to a certain size to precent excessive disk flushing
11:56	trygvis	indexAll might sound like something that hits the "up to a certain size" limit
11:56	pdurbin	huh. well in this case the big single transaction seemed to be killing performance
11:57	trygvis	perhaps you want to configure batch size in your jpa provider
11:57	trygvis	are you running out of memory?
11:58	pdurbin	not sure. knowing what I know now, that would be a good thing to look at
11:58	trygvis	try this: if(dataverseIndexCount % 1000 == ) entityManager.flush();
11:58	trygvis	if(dataverseIndexCount % 1000 == 0) entityManager.flush();
11:59	pdurbin	flush every so often, you're saying
11:59	trygvis	yep
11:59	pdurbin	is all this stuff covered in the Jave EE tutorial? or elsewhere?
11:59	trygvis	but, where are you actually using time or cpu? it seems you're reading from a database just to stuff it in a solr index
12:00	trygvis	http://stackoverflow.com/questions/9994699/solr-reindex-recommended-batch-size
12:00	trygvis	usually you don't want to commit often, only send to the server
12:01	trygvis	but you have two methods there that I have no idea what do; indexDataset() and indexDataverse()
12:01	pdurbin	at a high level, yes, the work is to read from postgres and put some data into solr
12:01	trygvis	postgresql <3
12:01	trygvis	ok, so if you're writing to solr you don't want to flush the entityManager (it doesn't have any changes to flush) but flush solr
12:01	pdurbin	:)
12:02	pdurbin	oh oh oh, sorry
12:02	pdurbin	we do write to postgres too
12:02	trygvis	you could possibly also just run without a transaction as you're rebuilding everything
12:02	trygvis	s,as,if,
12:03	trygvis	it is usually like running with scissors, but if your database is empty it's usually ok :)
12:03	pdurbin	we write to postgres the timestamp at which we indexed into solr without error. we store this timestamp so later we can compare it to another timestamp to see if solr has the latest data. to see if solr is in sync with the data in postgres
12:04	trygvis	ok, then you want to flush the entity manager for every thousand write
12:04	pdurbin	before the fix the writing of the timestamps to the database was being done in a single transaction and the process/job was getting slower and slower as it ran
12:05	pdurbin	that makes sense but this should have only been about 1600 writes in total
12:05	pdurbin	so it's shocking that it was taking 3.5 hours
12:06	trygvis	but it is strange that adding that tx boundary made it all better, that means that the reindexing process isn't what is causing the badness
12:06	pdurbin	oh I suspect there's more badness I haven't found yet
12:06	pdurbin	it's all very spooky
12:07	trygvis	it could be as easy to try to do a flush before you call indexAll()
12:11	pdurbin	yeah, could try adding flush
12:15		vanHoesel joined #rest
12:21		interop_madness joined #rest
12:57		vanHoesel joined #rest
13:02		composed joined #rest
13:02	composed	Hmm Douglas Crockford says the statelessness of HTTP was its biggest mistake.
13:02	composed	And talks about WebSockets with Node.JS
13:04	composed	WebSockets are interesting. One sends an HTTP request negotiating to drop to a lower level of protocol
13:04	composed	Our stack has many warts folks
13:05	asdf`	i'm not sure i'd call websockets 'lower level'
13:07	composed	asdf`: they are much closer to TCP than HTTP
13:07	composed	asdf`: but they're layered on top of HTTP because HTTP is everywhere
13:11	trygvis	composed: statelessness of HTTP is one of its biggest features, and required when trying to build "the web"
13:12	trygvis	comparing streaming messages from one node to another to http is apples to oranges
13:12	composed	trygvis: well maybe it was required in the 80s
13:12	composed	trygvis: right now we have movements like "encrypt everything"
13:13	composed	trygvis: so little to nothing of "the web" as originally planned gets used, and there are tons of security measures to avoid mixing domains for the rest of it because of security implications
13:13	trygvis	if HTTP becomes a problem, some other technology will prevail
13:13	bigbluehat	composed: where'd you see the crockford quote?
13:13	bigbluehat	sounds like he believe the network is always available
13:14	composed	bigbluehat: why would he believe that
13:14	bigbluehat	yeah...good question
13:14	trygvis	available, reliable, fast, etc
13:14	bigbluehat	non-latent
13:14	trygvis	shiby!
13:14	composed	bigbluehat: so you pose the question and ponder the answer. You strawmanned Crockford :P
13:14	trygvis	err, shiny!
13:15	bigbluehat	hehe
13:15	composed	Using a stateful connection doesn't mean it can't tolerate disruption.
13:15	bigbluehat	true.
13:15	composed	It just means there's a well identifiable session, and if the session ends, it can start over later
13:16	bigbluehat	and you have shared state
13:16	bigbluehat	so your state machine has 2 heads
13:16	bigbluehat	composed: curious about your "little to nothing of "the web" as originally planned..." comment
13:16	* bigbluehat	being a fan of RFC 2068 ;)
13:17	composed	bigbluehat: shared state typically refers to some state that multiple entities can mutate independently.
13:17	_ollie	what Douglas Crockford effectively says: "HTTPs biggest mistake is that it's not a solution to a problem I have" which is BS :)
13:17	composed	Because of shared state simply means "two things know about one thing" then the web is also one giant shared state
13:18	trygvis	_ollie: +1!
13:18		vanHoesel joined #rest
13:18	bigbluehat	anyone have a URL for these statements by crockford?
13:18	composed	_ollie: how'd you implement website login without session cookies, which represent the client emitting state to the server to manage
13:19	composed	bigbluehat: one sec
13:19	bigbluehat	"URL or it didn't happen" ;)
13:19	trygvis	bigbluehat: welcome to #rest :)
13:19	bigbluehat	composed: I don't think I'd say that state is shared anywhere on the web
13:19	composed	bigbluehat: minute 41-42 and onward https://www.youtube.com/watch?v=QgwSUtYSUqA
13:19	bigbluehat	it's transferred...but not shared
13:19	bigbluehat	tnx trygvis :D
13:19	bigbluehat	...been too long
13:20	trygvis	you've been missing out on some real trolls
13:20	bigbluehat	aw man... ;)
13:20	composed	bigbluehat: it comes down to what he says: it's easy to decide "everything is stateless" because it makes it easy to write a standard that's stateless
13:20	composed	But it's impossible to write a simple website login without SOME shared state
13:21	composed	Like an emitted session id
13:21	_ollie	composed: it seems like you misunderstood the statelessness requirement in REST…
13:21	bigbluehat	is it?
13:21	composed	The trick with designing big networked systems is not to eliminate state but isolate it.
13:22	composed	There will be some component that is aware of state. The rest maybe won't be
13:22	composed	HTTP says the client can be that component
13:22	bigbluehat	hilarious statements by Mr. C.
13:22	composed	But truth is the client can't have ALL the state, or basically every service that has a user account can't exist.
13:22	bigbluehat	"HTTP was designed completely wrong"
13:22	_ollie	REST requires the request to contain all necessary information and the server not magically identifying itby some means, that's all…
13:22	_ollie	thus, a cookie is perfectly fine
13:23	composed	bigbluehat: question is why. "Hilarity" is not objective metric for correctness
13:23	bigbluehat	no...it's terribly objective ;)
13:23	bigbluehat	HTTP doesn't work for the way he wants to write apps
13:23	bigbluehat	...it doesn't mean it was designed "completely wrong"
13:24	bigbluehat	it was just designed for a different thing than he's using it for
13:24	composed	bigbluehat: no I'm serious, I prefer we don't fall to this immature level of analysis. Let's have some argument from an engineering point of view.
13:24	bigbluehat	so he wants something else
13:24	bigbluehat	aka...it wasn't designed for him
13:24	bigbluehat	^^ just did ;)
13:24	composed	_ollie: fielding says cookies directly oppose the REST style.
13:25	composed	bigbluehat: "I think he's funny" is not an engineering viewpoint I'm afraid. He has very specific arguments about constantly passing back and forth context that doesn't change with stateless designs, which is a real bottleneck
13:26	trygvis	composed: if you think you need a session id you're starting off wrong
13:26	composed	Ideally you want to be able to start over a session, but there's no need to eliminate sessions entirely.
13:26	_ollie	composed: where?
13:26	composed	trygvis: ok I'm asking you, how do you do it right
13:26	composed	trygvis: how do you log into a site without a session id that you pass back every time
13:27	trygvis	again, you're starting off wrong. I don't log into a site
13:27	trygvis	I supply credentials on every request (like the spec caters for)
13:27	composed	trygvis: fine, I want to have favorite channels on YouTube without having a YouTube-specific app. How do we do this.
13:28	composed	trygvis: if you supply credentials on every request, it means you can't cache any response at a proxy
13:28	composed	Because credentials will be unique
13:28	composed	trygvis: furthermore a session id is in fact "credentials"
13:28	composed	So it's the same thing by another name
13:29	trygvis	if the server want to me to read some shared, public, cacheable data it can point me to another host (realms are per host)
13:29	trygvis	you can implement it like that, but it is not the same thing
13:30	composed	trygvis: fine, so after all we need to split things by use case, and one of those use cases absolutely needs statefulness
13:31	composed	trygvis: and the same use case that requires statefulness (session, credentials, security, domain isolation) is curiously the same exact use case to have an API for. Not many APIs do much useful without credentials.
13:31	composed	so what is the bottom line here. Crockford is not so hilarious after all
13:31	trygvis	jez, it's you again
13:32	composed	what?
13:37		azr joined #rest
13:53		shrink0r_ joined #rest
14:07		ewalti joined #rest
14:43		nkoza joined #rest
15:07		ewalti joined #rest
15:07		ewalti joined #rest
15:09		ewalti joined #rest
15:25		ewalti joined #rest
15:46		JudasBricot joined #rest
16:05		azr joined #rest
16:20	whartung	hey pdurbin
16:28		ewalti joined #rest
16:32		azr joined #rest
16:32	pdurbin	whartung: hey. trygvis talked me off the ledge. fun with EJB
16:32	whartung	yea I glanced at but then got hit with TL;DR
16:34	pdurbin	it short, I was surprised by the fix
16:34	pdurbin	which made things 14 times faster
16:34	whartung	care to summarize?
16:34		ewalti joined #rest
16:36	pdurbin	basically indexAll was taking 3.5 hours on very little data. only creating ~1600 Solr documents based on data in postgres. as the Solr documents are created we update the database with a timestamp per row
16:37	pdurbin	the fix was to put indexAll in a new bean (to create an EJB boundary) and add @TransactionAttribute(REQUIRES_NEW) to the methods that indexAll is calling over and over, such as indexDataset
16:37	pdurbin	https://github.com/IQSS/dataverse/commit/a1d9da4
16:37	pdurbin	this reduced the time from 3.5 hours to 15 minutes
16:38	pdurbin	I guess the thing that disturbs me is that this fix is completely unintuitive to me.
16:38	whartung	how did this speed it up? Is Solr part of the transaction?
16:38	pdurbin	It's like magic.
16:38	pdurbin	I don't know if Solr is part of the transaction.
16:38	pdurbin	I think I need to study EJB transactions.
16:38	trygvis	pdurbin: I doubt that your solution is a 'correct' fix
16:39	pdurbin	ok, let's call it a solution then :)
16:39	trygvis	I doubt it is unless you have done some magic to configure solr as a part of your XA
16:39	trygvis	it'll float your boat for "a while" :)
16:39	trygvis	but I'm out
16:40	whartung	yea that doesn't make any sense whatsoever
16:40	whartung	where did you get the idea to even try it?
16:40	whartung	and I'm skeptical that Solr is transactional
16:41	pdurbin	whartung: people here with way more experience with EJB than I have said something along the lines of, "We think indexAll is being treated at a single transaction. Let's add @TransactionAttribute(REQUIRES_NEW) and see if it helps." And it did.
16:42	pdurbin	before the solution, you could sort of tell that indexAll was getting slower and slower as it ran
16:42	pdurbin	I never witnessed the 3.5 hours it took. Too impatient.
16:46	pdurbin	I had figured out that writing those timestamps was slowing things down, but that's about it.
16:46	pdurbin	anyway, EJB moves in mysterious ways
16:46	pdurbin	and I should probably read a book about it
16:47	whartung	well, it doesn't, really. EJB is pretty bone stupid.
16:47	whartung	you were only updating 1600 rows?
16:47	whartung	Were you relying onthe EntityManager to flush the updates?
16:49	pdurbin	about 1600 solr documents get created based on fewer rows that that in the database. let's say half, 800 rows in the database
16:50	whartung	but you update those rows, right?
16:50	pdurbin	right
16:52	whartung	and you're just using the entitymanager, right? fetch the entity, change it, an dlet the EM flush it when it's good and ready?
16:52	pdurbin	two timestamps actually. because each row in the database (more or less) becomes two solr documents. so we record a timestamp for each of the two solr documents per row
16:52	whartung	do you fetch all of your data upfront?
16:55	pdurbin	whartung: yes. I fetch a list of all datasets up front.
16:55	pdurbin	then iterate over them
16:55	whartung	are they eager? are there any lazy relationships?
16:56	pdurbin	I don't know.
16:56	whartung	well, do your root dataset rows relate to other rows, to other collections?
16:56	pdurbin	to update the timestamp I do use entitymanager. I do an em.merge
17:30		fragamus joined #rest
17:37		azr joined #rest
17:42		shrink0r joined #rest
18:31	pdurbin	whartung: I think at some point you said I should read the EJB 3 JSR PDF.
18:32	saml	nooooooooooo
18:32	saml	EJB
18:32	whartung	heh. The JSR is interesting for sure, but it's a bit thick
18:32	saml	java
18:32		ewalti joined #rest
18:32	* fumanchu	wonders what saml codes in
18:32	pdurbin	ok. maybe I'll read the 1000 page Java EE tutorial instead. :)
18:33	saml	node.js hehehhehehehehehehehe
18:33	saml	don't use node.js
18:33	saml	it hurts feelings
18:33	saml	do you use undertow?
18:35	whartung	Here's my thinking pdurbin
18:35	whartung	First, it has nothing to do with Solr
18:35	pdurbin	ok, nothing to do with solr. makes sense
18:35	whartung	That solr interface is not XA at all, again you can see that by the fact that you manually call commit.
18:35	pdurbin	yeah
18:36	pdurbin	I mean. it's a web service. It's like calling into the twitter api.
18:36	whartung	Not saying XA is impossible over HTTP, but…unlikely for a gardern variety HTTP interface.
18:36	whartung	so with joa
18:36	whartung	jpa
18:37	whartung	when you're doing a bunch of changes, jpa caches it's updates in ram.
18:38	whartung	nominally, it will flush all of the work on transaction commit.
18:38	pdurbin	ok. trygvis was asking if a lot of memory was being used
18:38	whartung	but you just said it was only 1600 rows
18:38	whartung	"that's nothing(™)"
18:39	whartung	sec...
18:42	whartung	so
18:42	whartung	simple case
18:43	whartung	you create 1000 entities, then the transaction commits, and 1000 insert statements flood out to the db server.
18:43	whartung	so, over time, a transcation can build up in ram, leaving a footprint.
18:43	whartung	but, 1600 rows isn't a lot, typically
18:43	whartung	now
18:44	whartung	the other time jpa pushes sql to the db is wheneer it queries the server.
18:44	whartung	so you can load in a list of entities, change the entity, and then access the next one
18:44	whartung	but when you access the entity, it has lazy associations
18:44	whartung	which causes a new query in the background to hit the db server.
18:45	whartung	so when that happens, the pending updates will be flushed first.
18:45	whartung	so instead of 1000 inserts at the end of the xtn, you get your sql all mixed up of inserts and selects.
18:46	whartung	but even if that happens, while the transaction is open, the internal footprint will grow.
18:46	whartung	are you using postgres?
18:46	pdurbin	yes. postgres
18:47	whartung	are you updating a single row over and over and over?
18:47	pdurbin	not on purpose if I am
18:47	whartung	ok
18:47		vanHoesel joined #rest
18:47	pdurbin	I mean, the row does get updated twice.
18:47	pdurbin	because we store two timestamps
18:48	pdurbin	each for a solr document that gets indexed
18:48	whartung	its been a while since I tested, but in the past, updating a single row, over and over and over in pg in a single xtn can be slow, because each row creates a new "ghost" row in the DB, and each new update has to crawl that list. So, if you updated a single row 1000 times, you end up haveing to scan 1000 rows for the next update.
18:48	pdurbin	contentIndexTime vs. permissionIndexTime timestamps. two of them. same row
18:48	whartung	but doesn't sound like that's happening here.
18:49	* pdurbin	is scared of ghost rows
18:49	whartung	nah, they go away on commit.
18:49	whartung	feature, not a bug.
18:51	pdurbin	phew
18:51	whartung	So, there's that. That suggests that in the large xtn scenarion in your case, the overhead at the JPA/DB level of managing all that change is expensive. This would manifest by a CPU being pegged, when in this case, it shouldn't be -- should be mostly I/O
18:52	whartung	all your work appears to be in a single thread, so contnetion doesn't seem like the issue.
18:52	whartung	as a simple test, you can try performing JUST the db operations (skip the solr calls) and see how long it takes.
18:52	whartung	3.5hrs is still a crazy number
18:53	whartung	in any case
18:53	pdurbin	yeah
18:53	whartung	updating 800 rows…big deal
18:53	pdurbin	right
18:53	whartung	"Oh no, all that data might almost fill a cache line in the CPU!"
18:54	* whartung	used to have 88k floppy disks
18:54	whartung	so that would be an intersting test
18:56	whartung	because all breaking up the xtn is doing is lowering the memory impact of the overall xtn
18:57	pdurbin	right. I mean, there are probably many ways to relieve the memory getting eaten up. Not that I've confirmed if it was memory or cpu.
18:57	whartung	how big are these documents? is the data in the DB the actual data, or just references to files?
19:00	pdurbin	the resulting Solr documents? not all that big I don't think
19:00	whartung	the rows in the db
19:00	pdurbin	oh, well, for every row we root around all over the db to gather the data required to construct the solr documents
19:01	whartung	ok, but you have "800" of them. How much data is one of those "800"
19:02	pdurbin	it's hard to answer but let's say not very much
19:04	whartung	ok
19:04	whartung	so you're not sucking in 1600 1M documents
19:04	whartung	sending you GC for a tizzy
19:04	whartung	that could be the other thing, need more memory, stuck in GC hell
19:04	whartung	be intteresting to see the memory usage
19:05	pdurbin	the first solr doc I'm looking at is 71 lines of JSON
19:06	whartung	ooh
19:06	pdurbin	yeah, as I continue to dig into the performance problem I'll look at memory and cpu and whatnot. this was just crazy. the 3.5 hours thing. now down to 15 minutes by adding "require new transactions"
19:06	whartung	so basically by breaking up the xtn, each of those documents can come and go one by one rather than being all cached up waiting for the big single commit.
19:07	whartung	I'd still do the 'db only' test
19:07	whartung	for laffs
19:07	pdurbin	take solr out for a bit. i hear ya
19:07	whartung	15m for 800 documents is still kind of crazy, imho
19:07	* pdurbin	tells Solr it's not his fault
19:07	* whartung	… yet
19:08	pdurbin	oh sure. gotta make it faster still
19:09	whartung	I would think that solr would index those faster than that
19:10	pdurbin	solr is quite nice. I'm sure I'm just doing things wrong
19:12	pdurbin	I can't decide if EJB is quite nice. :)
19:13	whartung	my one complaint for EJB is that each EJB gets its own, private JNDI tree.
19:13	whartung	this may not be a problem with WAR deployments.
19:13	whartung	but when you start integrating EJBs from other jars, it's kind of a pain.
19:14	whartung	now, they DO have a new, "canonical" JNDI name
19:14	whartung	they're just awful names
19:14	whartung	seems to me EJBs in a war may have less of an issue with this.
19:15	whartung	and if you are using CDI for all your bean injections, it might be less of a problem -- I've not use it to that extent.
19:15	whartung	doing that would, ostensibly, solve many ills.
19:16	whartung	the JNDI tree is a legacy requirement since EJBs are individual deployable elements.
19:16	whartung	bbl afk lunch
19:16	pdurbin	bon appetit
19:31		JudasBricot joined #rest
19:34		shrink0r joined #rest
20:42		vanHoesel joined #rest
20:56		graste joined #rest
21:12		vanHoesel joined #rest
22:00		composed joined #rest
22:10		vanHoesel joined #rest
22:25		talios joined #rest
22:28		vanHoesel joined #rest
22:33	trygvis	whartung: postgres has gotten a nice optimalization when non-indexed fields are updated called HOT
22:33	trygvis	Heap Only Tuples
22:33	whartung	yeaok
22:33	trygvis	dunno how it will satisfy MVCC at the same time, but anyway
22:34	trygvis	it also seems quite old: http://www.postgresql.org/message-id/27473.1189896544@sss.pgh.pa.us
22:34	whartung	MVCC is mostly about locking
22:34	trygvis	yes, but when a tx is updating a row I can't remember how postgresql does it. if it locks the row or not
22:36	trygvis	anyway, I'm off for tonight. later
22:36	composed	If it's an atomic operation it's by definition locked for a moment
22:36	composed	While updated
22:37	whartung	tt trygvis
22:39	trygvis	composed: no, it's not. but any other tx that wrote to it can't complete unless the earlier tx fails
22:41	trygvis	whartung: enjoy the troll
22:41	* trygvis	is out for real now!
22:42	pdurbin	trolling and running. I see how it is :)
22:49		warehouse13 joined #rest
23:17		_ollie joined #rest
23:18		vanHoesel joined #rest
23:37		rhyselsmore joined #rest
23:37		rhyselsmore joined #rest
23:42		vanHoesel joined #rest

https://trygvis.io/rest-wiki/