To optimize the query with GROUP BY on the line on a big table?


Warning: count(): Parameter must be an array or an object that implements Countable in /home/styllloz/public_html/qa-theme/donut-theme/qa-donut-layer.php on line 274
0 like 0 dislike
3 views
MySQL. There is a table with the news, many records are already about 70 thousand and will grow.


The structure is this:

CREATE TABLE IF NOT EXISTS `news` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`id_section` int(11) NOT NULL,
`title` varchar(250) NOT NULL,
`description` text,
`image` varchar(250) DEFAULT NULL,
`url` varchar(250) NOT NULL,
`timestamp` int(10) unsigned NOT NULL,
`active` tinyint(1) unsigned DEFAULT '1',
PRIMARY KEY (`id`),
KEY `id_section` (`id_section`),
KEY `timestamp` (`timestamp`),
KEY `title` (`title`),
KEY `active` (`active`),
KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=69653 ;



Here's the problem: records are added automatically so that the same entry can be added multiple times for different id_section.


Therefore, if you receive data without specifying id_section (show news from all sections) climbs duplicate records. And, as a rule, they are consecutive. It's bad.


On the table of a small size solution was this query:

SELECT `news`.* FROM `news` WHERE (active = 1) GROUP BY `url` ORDER BY `timestamp` desc LIMIT 10 OFFSET 20



However, now this query executes in 4-5 seconds (!!!).
We need a solution that will allow to reach the target of at least 0.5 sec.

Note: no GROUP BY, this query performs a 0.7 second. Despite the fact that other queries of small tables, take microseconds.


Any proposal for optimization — not only for this request. Perhaps there are some special techniques for solving such problems.
by | 3 views

7 Answers

0 like 0 dislike
maybe for each added news to do a hash (from the url or the text or header) and push it into the field that is specified as unique, then the duplicate will weed out, the DBMS itself
by
0 like 0 dislike
why not make the relationship many-to-many between tables and news sections?
then the group would not be necessary.
by
0 like 0 dislike
By the way, EXPLAIN that in your query? Can database to patinirovanie for 70 000 it's the little things.
by
0 like 0 dislike
And where do duplicates occur? Automatic addition of news — meaning grabbing from other sources, hence the duplicates?
by
0 like 0 dislike
I thought about it. Turned out to be quite difficult.
In the table large amount of data. Added a new entry every 20 minutes. It turns out that every 20 minutes we need for each new entry to run through the entire dataset and understand — it was already such news or not. If was — take its ID and write "for the news there is another section".
I understand the idea? Keep only unique news, and bring duplication into the staging table. The problem of intensive computation "duplicates". Comparison a url (string of variable length).

Then you are wrong in several points:
70 000 entries is not a large amount of data, this is a very small amount of data. Great this 4-5 orders of magnitude greater.
Every 20 minute do SELECT `id` FROM `news` WHERE `url` = $url this is a less expensive operation than for every visitor to do GROUP BY `url`
Line, peremennye length — if length is limited, you can make the varchar field and it is index and all will work fine.
\r
\r
without GROUP BY, this query performs a 0.7 second

and on the field active index made?
by
0 like 0 dislike
try to group like this: GROUP BY MD5(url)
and get rid of limit'as, for example:
where id > 20 and id < 30
the example above is only suitable for solid id (ie that there are no gaps/omissions)
by
0 like 0 dislike
Query caching (http://habrahabr.ru/blogs/mysql/108418/) can be a temporary solution.
by

Related questions

0 like 0 dislike
2 answers
0 like 0 dislike
2 answers
0 like 0 dislike
2 answers
asked Jun 6, 2019 by Junart1
0 like 0 dislike
2 answers
110,608 questions
257,186 answers
0 comments
28,000 users