[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: 0.93 correlated subqueries, scalability, (my) conclusion
- To: mckoidb@xxxxxxxxx
- Subject: Re: 0.93 correlated subqueries, scalability, (my) conclusion
- From: Tim Azzopardi <tim@xxxxxxxxxxxxx>
- Date: Fri, 08 Mar 2002 00:29:46 -0000
- Delivered-To: mailing list mckoidb@mckoi.com
- Mailing-List: contact mckoidb-help@mckoi.com; run by ezmlm
- Organization: http://www.TigerFive.com
Toby,
My Ref table is now up to 3.25 million rows. The files in the database /data directory take up 977 Mb.
Unsurprisingly disk access is very heavy on inserts now, the disk light is only on 75% of the time.
I'm commiting after each Url spidered which is an average of 280 rows inserted per url.)
=== 1.3GHz Athlon 750Mb ram os swap file 1150Mb
=== one 5200rpm 12ms 40Gb ATA100 IDE disk
=== jdk1.4
=== java -Xms256 -Xmx512 for JDBCQueryTool
=== McKoi 0.93pr1
=== data_cache_size=99123456
=== lookup_comparison_list=disabled
=== dont_synch_filesystem=enabled
=== running in embedded mode
=== 3.25 Million rows in the Ref table, Url table has 11,880 rows
DBVisualizer (jdk1.3.1)
select value
from Ref r, Url
where wordId = 286 and Url.urlId = r.urlId
and r.urlId in (select urlId from Ref r2 where r2.wordId = 523)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 2423)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 5142)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 6407)
1st run 111s
2nd run 0.2s
3rd run 0.23s
4th run 0.23s
5th run 0.25s
6th run 0.23s
Quit DBVisualizer and reran
1st run 1s OS Caching?
2nd run 0.2s
3rd run 0.2s
4th run 0.2s
Quit DBVisualizer and reran
select value
from Ref r, Url
where wordId = 286 and Url.urlId = r.urlId
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 523)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 2423)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 5142)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 6407)
Out of memory exception under DbVisualizer
Ran under JDBCQueryTool with -Xms256 -Xms512 (also now using jdk1.4)
1st run 310s (peak memory usage 283Mb)
2nd run 335s
Quit JDBCQueryTool & Rerun JDBCQueryTool
select wordId, count(*) from Ref where wordId in ( 286 ,523 ,2423,5142,6407) group by wordId
wordId count(*)
=================
286 6249 up from 5845
523 850 up from 301
2423 4587 up from 2607
5142 2474 up from 2021
6407 1282 up from 1236
1st run 13s (AMAZING os must be caching disk here)
2nd run 0.4s
3rd run 0.25s
(Didn't quit query tool)
select value, count(*)
from Ref, Url
where wordId in (286,523,2423,5142,6407)
and Url.urlId = Ref.urlId
group by Url.value
having count(*) = 5
1st run 3s
2nd run 1s
3rd run 1s
4th run 1s
Quit JDBCQueryTool & Rerun JDBCQueryTool
1st run 3s
2nd run 1s
3rd run 1s
4th run 1s
Reboot OS (switch off) XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1st run 129s
2nd run 1.2s
3rd run 1.1s
4th run 1.1s
select value
from Ref r, Url
where wordId = 286 and Url.urlId = r.urlId
and r.urlId in (select urlId from Ref r2 where r2.wordId = 523)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 2423)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 5142)
and r.urlId in (select urlId from Ref r2 where r2.wordId = 6407)
1st run 0.25s
2nd run 0.2s
3rd run 0.2s
4th run 0.2s
5th run 0.3s
6th run 0.2s
select value
from Ref r, Url
where wordId = 286 and Url.urlId = r.urlId
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 523)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 2423)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 5142)
and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 6407)
1st run 363s (mem usage up to 280Mb)
2nd run 399s (mem usage up to 450Mb !!!!!!)
(Memory usage for all other queries stays under 100Mb)
My conclusion is that McKoi 0.93pr1 scales very well, provided that you are careful
with your queries (as with any rdbms) and have as much memory as you need to cache
the whole database which I think is what is happening for most of my resuls here.
(And if McKoi doesn't cache the data then the OS can. Get more RAM if you need it).
In 20 hours or so, I have spidered 11,800 urls including http://news.bbc.co.uk.
(My guess is that the 20 hour process was mostly (slow ide) disk bound on inserts or waiting
for slow web sites) I've got a simple tomcat JSP page that looks a bit like google running
the type B queries against McKoi.
(e.g. SELECT value from Ref r, Url WHERE Url.urlId = r.urlId and wordId = 1444 and r.u
rlId in (select urlId from Ref r2 where r2.wordId = 15403) and r.urlId in (selec
t urlId from Ref r2 where r2.wordId = 22572) and r.urlId in (select urlId from R
ef r2 where r2.wordId = 10901))
I've tried a number of unrelated queries such as "russia moscow putin chechnya war terrorists",
"lionel jospin", "india pakistan nuclear army", all are returning results in under ten seconds.
The query "france spain italy germany holland denmark ireland norway sweden greece portugal"
returned in 40 seconds, the first time, instantly thereafter. "scotland edinburgh glasgow theatre play"
took about ten seconds the first time, instant thereafter. "bush putin china blair world trade centre september"
took 30 seconds to return no results, then instantly. "world trade centre september" took 10 seconds
to come back with 30 results. The query "one two three four five six seven eight nine ten eleven
twelve thirteen fourteen fifteen" came back with one result in 80 seconds.
Summary: "Real world" queries of five or six words are returning in an acceptable time.
Bearing in mind, that there are 3.25 million Rows in the Ref table, I think thats pretty good.
Keep up the good work,
Tim
07/03/2002 09:01:57, Tim Azzopardi <tim@tigerfive.com> wrote:
>I'm running in embedded mode. Will try you suggestions after day job (sigh)
>- Tim
>
>07/03/2002 05:43:18, Tobias Downer <toby@mckoi.com> wrote:
>
>>Tim,
>>
>>One question - are you running Mckoi embedded or client-server? I ask
>>because I'm very curious about the result below (as you were);
>>
>>> select value
>>> from Ref r, Url
>>> where wordId = 286 and Url.urlId = r.urlId
>>> and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 523)
>>> and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 2423)
>>> and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 5142)
>>> and r.urlId in (select urlId from Ref r2 where r2.urlId = r.urlId and r2.wordId = 6407)
>>> ==============================
>>> 27 rows (Results list urls with references to all five words)
>>> 1st run 127s
>>> 2nd run 10s
>>> 3rd run 9s
>>> 4th run 6s
>>> 5th run 6s
>>> 6th run 6s
>>> Quit the JDBCQueryTool and reran
>>> 1st run 565s
>>> 2nd run 431s
>>> 3rd run 267s
>>> 4th run 367s Hmmm, not sure what to make of this
>>
>>Quitting and rerunning the JDBCQueryTool in embedded mode would mean the
>>data in the tables would need to be re-cached from disk. However, if
>>you are running client-server and you didn't shut down the server when
>>you shut down the query tool, I really don't know how to account for
>>this result.
>>
>>The odd 100+ second queries can probably be accounted to the secondary
>>index building process and/or the caching. Actually, the secondary
>>index building process in Mckoi is probably not a good thing for the
>>engine to attempt when it is handling million+ row tables. You might
>>want to disable this and see how it effects your result. To disable
>>secondary indexes, edit your configuration and disable the
>>'lookup_comparison_list' property (change it to 'disabled').
>>
>>If you disable the lookup_comparison_list property, you may not see such
>>fast queries in the 5th/6th runs but you also won't see such poor
>>performance in the early queries either.
>>
>>Also, if you are using the engine in embedded mode, it might be an idea
>>to switch to client-server mode instead. When in client-server mode,
>>caching and secondary indexes have the most benefit because they are a
>>one off cost that offer benefits until the server is restarted.
>>
>>(Hmm, don't know what I was thinking suggesting Query C!)
>>
>>Toby.
>>
>>
>>---------------------------------------------------------------
>>Mckoi SQL Database mailing list http://www.mckoi.com/database/
>>To unsubscribe, send a message to mckoidb-unsubscribe@mckoi.com
>>
>>
>>
>
>
>
>
>
>---------------------------------------------------------------
>Mckoi SQL Database mailing list http://www.mckoi.com/database/
>To unsubscribe, send a message to mckoidb-unsubscribe@mckoi.com
>
>
>
search.png