(note: even if you aren't in edev, this might be an interesting read, as it has some stats on what causes some lag, and on user searching).

Looking for ways to optimize E2, I turned my attention to the user search. This is going to be kind of a hack of the way we do things, so I wanted to put this out for ideas of both edev and the higher-ups, as I might be missing things in my evaluation. User searches are one of the biggest eaters of SQL time we have here at E2. So I set about to figure out what the deal is with them.

So to start my study on user searches, i took some profiling data over a time of roughly two weeks. Data was collected in such a way to maintain privacy; only the search types and who was being searched on (for better data for prediction of cache invalidation stats) was taken. This is the data that was collected:

User search stats for a two week period:

47421 total searches

  • title DESC: 32 (0.07% of search space)
  • title ASC: 214 (0.45% of search space)
  • title: 36 (0.08% of search space)
  • createtime ASC: 739 (1.56% of search space)
  • reputation DESC: 2777 (5.86% of search space)
  • reputation ASC: 2004 (4.23% of search space)
  • createtime DESC: 41619 (87.76% of search space)
(this was done by dumpsearchstats.pl, a script run on the db server).
We can obviously see that 88% of seraches are "createtime DESC", which as we all know are the default user search when you click on homenodes. This brings up two questions.
  1. Is this the smartest type of search we could be doing?
  2. Are there ways that we could cache this data, seeing as it does not change very often




SQL timing:
Well, let's do a user search on me and see:
    SELECT * FROM node WHERE author_user=459692 AND type_nodetype=117 ORDER BY node_id DESC\g

411 rows in set (0.02-0.04 sec)



    SELECT * FROM node WHERE author_user=459692 AND type_nodetype=117 ORDER BY createtime DESC\g

411 rows in set (0.02-0.04 sec)



It seems that the searches are about even. Ten searches of each seems to yield about the same results. Because user searches have to sort through large amounts of data, this can be troublesome in laggy conditions. We'd rather shift the weight of the computation over to the web server machine, so let's look at caching.



Caching:
Seeing as we have 88 % of the same basic search type as the default search, what about managing a search cache for a user, at least for the top fifty nodes ordered by createtime. We could store them in an independant table and just pick it out when we need it. There are two concerns with caching, and that is invalidation, and generation. Assuming we only store the top fifty writeups, the general algorithm will look like this:

On User Search for Person
  Search for Cache (by user and searchtype)
     if found use cache
     if not found generate cache, store list, use cache.

On Writeup deletion
  If in first 50 writeup (delete is slow anyways), invalidate cache.

On Writeup addition
  Invalidate cache

**Note: are there any other places where we'd have to watch out for cache out of sync-ness?

Logically, we'd probably store them in a fast array or hash setting so that all we'd have to do is eval the data to get it out fast, then iterate through it. We can't use the standard setting, because those are already taken for $VARS. That's pretty much the easy part.

Invalidation We could set a quick bit to do that, so that we ignore the cache when we search. We'd have to overload writeup maintenance create and writeup maintenance delete (and I could get that last writeup cache working at the same time in those), to invalidate the caches. Re-validation would occur on the first user search of the user.

Initial Data We'd need to compile the intial data here to put in everyone's caches. We could do this over a time of about ten minutes one night, and start all of them rolling. This seeding process no one would really notice, and it's all one time CPU usage.

Other applications We could try to cache the cool archive, but it would be much harder to do so, seeing as C!s get tossed system wide more often than individual users put in writeups.

Generally I'm looking for comments. There might be a smarter way to tackle this problem, and if so, I'm open to suggestions. This is a big force of lag as wel lock the table to do the user searches now. Thanks.
Cacheing seems the way to go.

As far as cache invalidation, i don't know how this is all going to work, but is there a way to make it like a FIFO Stack, such that when a user creates a writeup, we could hit the DB to POP (shift?) the stack, and PUSH the new writeup onto it? This could save us a few nanoseconds in regeneration, however if we have forgotten a place where the cache could go out of sync, it could further the problem.

jb sez: A cache set up like a FIFO stack is really what I was planning. It's fairly trivial to just tack on the the beginnging the id of a new writeup. Deletion is more of a bitch, but I can deal with that. It'll be a quick string replace on the cache. At any rate, that's probably how it's going to be done. It'll be a perl-based solution, as I can't see anyway to cache them directly in SQL.

Deletion seems like a good time to set the invalid bit.

If we are storing the stats in there too, so that when a user searched themselves we get C! and vote status, that cache is going to be invalidated every few minutes for large users, seems impractical. Maybe just store C! status so when a regular user hits it they can see the # of C!'s, and we would need to invalidate on C! of that users nodes.

jb sez: We wouldn't be storing the vote data and such. It wouldn't be tough to pull that info out of the db. At the very least individual node lookups BY ID (which is the information that we'd have), is really trivial.



I'm not sure what extra info the Gods can see per-default when they user search, but probably also hide the hidden and marked for destruction bits in there too, but then if deletion invalidates the cache, whats the point of destruction caching.

Odd thing, is it faster to grab _just_ the nodes vote status, unsorted, into an associative array or hash table, and then when generating the page, merge the cache and the votes together useing a custom and preferably fast method?

jb sez: There's no sane way to cache stuff like votes and cools, really. We also do NOT want to be doing any sort of sorting in perl itself. That's kind of crazy and will eat our webserver alive. Let's let mySQL do it's own internal stuff, such as indexing and the like, to get the sorting speed boosts that perl with a million objects won't get.

Eraser_ replies to what jb sez: well then maybe just the C!'s, because those are fairly static, and make C!'s invalidate the cache. That would mean anyone who is not a god, and not the actual user being searched could just use the cache.

Log in or register to write something here or to contact authors.