Handling a large number of identifiers in Solr

Question

Handling a large number of identifiers in Solr

I need to do an online search in Solr. The user needs to find a list of users who are online with certain criteria.

How I handle this: we store user IDs in a table and I send all online user IDs in a Solr request, for example

&fq=-id:(id1 id2 id3 ............id5000)

The problem with this approach is that when the identifiers become large, Solr takes too much time to resolve, and we need to send a large request over the network.

One solution might be to use a connection in Solr, but the online data changes regularly, and I cannot index the data every time (say, 5-10 minutes, this should be at least an hour).

Another solution that I think of shelling this request from Solr based on a specific parameter in the URL. I don’t know much about the internal functions of Solr, so I don’t know how to proceed.

+6

solr solr4

chicharito May 01, '13 at 9:01

source share

4 answers

We worked on this problem by implementing Sharding data.

Basically, without going into the code details:

Write your own index code
- use serial hashing to decide which identifier is sent to the Solr server
- index each user's data on the corresponding splinter (there may be several machines).
- make sure you have redundancy
Query solr shards
- Solr string queries using the shards parameter
- Run EmbeddedSolr and use it to execute a fined request
- Solr will request all the fragments and combine the results, as well as provide timeouts if you need to limit the request time for each fragment.

Even with everything I said above, I don’t think Solr is suitable for this. Solr is not well suited to search by indexes that are constantly changing, and also if you are mostly searching by identifiers than a search engine is not needed.

For our project, we mainly implement all index construction, load balancing, and the query mechanism and use Solr mainly as storage. But we started using Solr when sharding was flaky and not executed, I'm not sure what his condition is today.

Finally, if I built this system today from scratch without any work that we have done over the past 4 years, I would recommend using the cache to store all the users who are currently on the network (say memcached or redis ), and query time I just went over all of them and filtered out according to the criteria. Filtering by criteria can be cached independently and updated gradually, as well as repeating more than 5000 records does not necessarily take a lot of time if the matching logic is very simple.

+3

Asaf May 06 '13 at 9:14

source share

Any reliable solution will include bringing your data closer to SOLR (package) and internal use. DOES NOT launch a very large request during a search, which is a low latency. You must develop your own filter; The filter caches data from online users from time to time (say, every minute). If the data changes very often, consider implementing PostFilter.

Here you can find a good example of filter implementation: http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/

+2

lexk May 06 '13 at 5:56

source share

one solution might be to use a union in solr, but changing the online data regularly and I cant index the data every time (say 5-10 min, this should be at least hr)

I think you could use Solr joins very well, but after a bit of improvisation.

Solution, I propose the following:

 You can have 2 Indexes (Solr Cores) 1. Primary Index (The one you have now) 2. Secondary Index with only two fields , "ID" and "IS_ONLINE"

Now you can frequently refresh the Secondary Index (in seconds) and synchronize it with your table to store online users.

NOTE. This secondary Index, even if it is updated frequently, does not degrade performance if we make the necessary settings, such as using appropriate queries during delta import, etc.

Now you can perform a Solr join in the ID field on these two indices to achieve what you want. Here is a link on how to make Solr connections between Solr indexes / kernels.

0

Mavellin May 07 '13 at 5:57

source share

samkass · Accepted Answer · 2013-05-05T17:49:18+0000

With Solr4 soft commits, committing became cheap enough to make it possible to actually store the "online" flag directly in the user record and just have & fq = online: true in your request. This reduces the overhead associated with sending 5000 identifiers over wires and disassembling them, and allows Solr to optimize the request a bit. Whenever someone logs in or logs out, set your status and set commitWithin in the update. Anyway, it's worth it.

Handling a large number of identifiers in Solr

More articles: