This is a graph of the frequency ranking of an IP network on a logarithmic scale. After completing this part, I try to build a line of best fit in the log-log scales using Python 2.7 . I have to use the matplotlib "symlog" axis mask, otherwise some of the values will not be displayed correctly, and some values will be hidden.
The X values of the data I'm drawing are URLs, and the Y values are the corresponding URL frequencies.
My data is as follows:
'http://www.bing.com/search?q=d2l&src=IE-TopResult&FORM=IETR02&conversationid= 123 0.00052210688591'
`http:`
`http:`
`http:`
The data contains the URL in the first column, the corresponding frequency (the number of times the same URL is present) in the second, and finally, the bytes transmitted in the third. First, for this analysis, I use only the 1st and 2nd columns. There are a total of 2465 x values or unique URLs.
Below is my code
import os
import matplotlib.pyplot as plt
import numpy as np
import math
from numpy import *
import scipy
from scipy.interpolate import *
from scipy.stats import linregress
from scipy.optimize import curve_fit
file = open(filename1, 'r')
lines = file.readlines()
result = {}
x=[]
y=[]
for line in lines:
course,count,size = line.lstrip().rstrip('\n').split('\t')
if course not in result:
result[course] = int(count)
else:
result[course] += int(count)
file.close()
frequency = sorted(result.items(), key = lambda i: i[1], reverse= True)
x=[]
y=[]
i=0
for element in frequency:
x.append(element[0])
y.append(element[1])
z=[]
fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(len(x))
print z
logA = [x*np.log(x) if x>=1 else 1 for x in z]
logB = np.log(y)
plt.plot(z, y, color = 'r')
plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z))
ax.set_yscale('symlog')
ax.set_xscale('symlog')
slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()
You will see many imported libraries, since I played a lot with them, but none of my experiments gives the expected result. Thus, the code above correctly generates a rank graph. Which is the red line, but the blue line on the curve that should be the best fit is visually incorrect, as you can see. This is a generated graph.

This is the schedule that I expect. The dashed lines in the second graph are what I'm drawing wrong.

Any ideas on how I can solve this problem?