Best line of log magazine lines in python 2.7

This is a graph of the frequency ranking of an IP network on a logarithmic scale. After completing this part, I try to build a line of best fit in the log-log scales using Python 2.7 . I have to use the matplotlib "symlog" axis mask, otherwise some of the values ​​will not be displayed correctly, and some values ​​will be hidden.

The X values ​​of the data I'm drawing are URLs, and the Y values ​​are the corresponding URL frequencies.

My data is as follows:

'http://www.bing.com/search?q=d2l&src=IE-TopResult&FORM=IETR02&conversationid=  123 0.00052210688591'
`http://library.uc.ca/  118 4.57782298326e-05`
`http://www.bing.com/search?q=d2l+uofc&src=IE-TopResult&FORM=IETR02&conversationid= 114 4.30271029472e-06`
`http://www.nature.com/scitable/topicpage/genetics-and-statistical-analysis-34592   109 1.9483268261e-06`

The data contains the URL in the first column, the corresponding frequency (the number of times the same URL is present) in the second, and finally, the bytes transmitted in the third. First, for this analysis, I use only the 1st and 2nd columns. There are a total of 2465 x values ​​or unique URLs.

Below is my code

import os
import matplotlib.pyplot as plt
import numpy as np
import math
from numpy import *
import scipy
from scipy.interpolate import *
from scipy.stats import linregress
from scipy.optimize import curve_fit

file = open(filename1, 'r')
lines = file.readlines()

result = {}
x=[]
y=[]
for line in lines:
  course,count,size = line.lstrip().rstrip('\n').split('\t')
  if course not in result:
      result[course] = int(count)
  else:
      result[course] += int(count)
file.close()

frequency = sorted(result.items(), key = lambda i: i[1], reverse= True)
x=[]
y=[]
i=0
for element in frequency:
  x.append(element[0])
  y.append(element[1])


z=[]
fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(len(x))
print z
logA = [x*np.log(x) if x>=1 else 1 for x in z]
logB = np.log(y)
plt.plot(z, y, color = 'r')
plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z))
ax.set_yscale('symlog')
ax.set_xscale('symlog')
slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()

You will see many imported libraries, since I played a lot with them, but none of my experiments gives the expected result. Thus, the code above correctly generates a rank graph. Which is the red line, but the blue line on the curve that should be the best fit is visually incorrect, as you can see. This is a generated graph.

Correct Rank plot but incorrect curve fit

This is the schedule that I expect. The dashed lines in the second graph are what I'm drawing wrong.

Expected graph

Any ideas on how I can solve this problem?

+4
2

, , y = c*x^(m). , , :

log(y) = m*log(x) + c

np.polyfit(log(x), log(y), 1) m c. log_y_fit :

log_y_fit = m*log(x) + c

, , :

y_fit = exp(log_y_fit) = exp(m*log(x) + c)

, :

  • , x, log (x)

  • y,

, plt.plot(z, np.poly1d(np.polyfit(logA, logB, 1))(z)) :

m, c = np.polyfit(logA, logB, 1) # fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA + c) # calculate the fitted values of y 
plt.plot(z, y_fit, ':')

: plt.plot(z, np.exp(np.poly1d(np.polyfit(logA, logB, 1))(logA))), , .

, :

  • , logA z, < 1, z , < 1. , z 1, .

  • , x*log(x) logA. , .

:

fig=plt.figure()
ax = fig.add_subplot(111)

z=np.arange(1, len(x)+1) #start at 1, to avoid error from log(0)

logA = np.log(z) #no need for list comprehension since all z values >= 1
logB = np.log(y)

m, c = np.polyfit(logA, logB, 1) # fit log(y) = m*log(x) + c
y_fit = np.exp(m*logA + c) # calculate the fitted values of y 

plt.plot(z, y, color = 'r')
plt.plot(z, y_fit, ':')

ax.set_yscale('symlog')
ax.set_xscale('symlog')
#slope, intercept = np.polyfit(logA, logB, 1)
plt.xlabel("Pre_referer")
plt.ylabel("Popularity")
ax.set_title('Pre Referral URL Popularity distribution')
plt.show()

, :

Log-log graph with inline

:

+8

. , .

fig=plt.figure()
ax = fig.add_subplot(111)
z=np.arange(len(x)) + 1
print z
print y
rank = [np.log10(i) for i in z]
freq = [np.log10(i) for i in y]
m, b, r_value, p_value, std_err = stats.linregress(rank, freq)
print "slope: ", m
print "r-squared: ", r_value**2
print "intercept:", b
plt.plot(rank, freq, 'o',color = 'r')
abline_values = [m * i + b for i in rank]
plt.plot(rank, abline_values)

. .

0

Source: https://habr.com/ru/post/1676615/


All Articles