Effectively upload and store tweets from several hundred Twitter profiles?

The site I'm working on should receive tweets from 150-300 people, store them locally, and then list them on the first page. Profiles sit in groups.

Pages will be displayed

  • the last 20 tweets (or 21-40, etc.) by date, group of profiles, one profile, search or "topic" (which looks like another group .. I think ..)
  • live tag cloud with context support (based on the last 300 tweets of the current search, a group of profiles or a single profile)
  • various statistics (groups, the most active, etc.), which depend on the type of page displayed.

We expect a fair flow of traffic. The last, similar site reached its peak at almost 40 thousand visits per day and ran problems with intro before I started caching pages as static files and disabled some functions (some, by chance ..). This was mainly due to the fact that loading the page will also extract the latest x tweets from profiles 3-6 that have not been updated for the longest.

With this new site, I can use cron to benefit from tweets to help. I will also denormalize db a bit, so it needs fewer connections, optimize it for faster selection instead of size.

Now the main question : how to find out which profiles to check for new tweets in an effective way? Some people will tweet more often than others, some will tweet in line (this happens a lot). I want to keep the first page of the site as "relevant" as possible. If this applies to, say, 300 profiles, and I check 5 every minute, some tweets will appear only an hour after the fact. I can check more often (up to 20K), but I want to optimize it as much as possible so as not to press the speed limit and not enough resources on the local server (it hit the mysql connection restriction with this other site).
Question 2: since cron only "works" once a minute, I believe that I need to check several profiles every minute - as indicated, at least 5, possibly more. To try to decompose it during this minute, I could sleep for a few seconds between batches or even single profiles. But if it takes more than 60 seconds, the script will be launched into itself. This is problem? If so, how can I avoid this?
Question 3: any other tips ?? Details files URL?

+4
source share
2 answers

I would not use cron, just use the Twitter streaming API with a filter for your 150-300 twitter users.

statuses / filter

Returns public states corresponding to one or more filter predicates. You must specify at least one predicate parameter, pointer, location, or track. You can specify several parameters that allow most clients to use a single connection with the Streaming API. Placing long parameters in the URL may cause the request to be rejected for the excessive length of the URL. Use the POST request header parameter to avoid long URLs.

The default access level allows you to use up to 200 track keywords, 400 for users and 10 1-degree locations. Increased access levels allow 80,000 to follow users (role "shadow"), 400,000 to follow users (role "bird"), 10,000 keywords (role "restricted track"), 200,000 keywords of the track (role partner track) and 200 10- ("locRestricted"). Increased levels of access to the track also convey a higher proportion of statuses before restricting the flow.

I believe that when specifying user IDs, you do infact to get all the tweets from the streaming api:

All streams that are not selected by the user ID have the status of remote users with low quality. The results, which are selected by the user ID, are currently displayed only from the following predicate, which allow receiving statuses from low-quality users.

This will allow you to get results in real time, without fear of speed limits. You just need to make sure that you can receive the data quickly. But with 300 users that shouldn't be a problem.

Update. How to use the API. Unfortunately, I never had the opportunity to play with the streaming API. However, I have php demonological scripts (yes, I know that this is not the power of php, but if all else you do is php, this can be done).

I would install a simple php script to use the status and then dump them (raw JSON) into the message queue. Then I pointed out another script in the message queue to capture the statuses and put them into the database. Thus, the db connection and processing time do not interfere with just receiving streaming data.

From the views, if he phirehose will correspond to the first part of this decision. Something like beanstalkd (with pheanstalk ) will work as a message queue.

+1
source

I would look at http://corp.topsy.com/developers/api/

I have no relationship with them, except that I played with api. I think this will give you exactly what you want, with a higher api limit level.

0
source

Source: https://habr.com/ru/post/1308929/


All Articles