I have a list of websites for every member of the U.S. Congress that I programmatically scan to clear addresses. Many of the sites vary in their basic markup, but this was not an initial problem until I saw that hundreds of sites did not produce the expected results for the script I wrote.
After you spent some time evaluating the potential causes, I found that calling strip_tags()the results file_get_contents()repeatedly deleted most of the page source! It was not only HTML removal, but also the non-HTML removal that I wanted to clear!
So, I deleted the call strip_tags(), replaced the call with the removal of all non-alphanumeric characters, and gave this process another start. Other results appeared, but there were few. This time it was because my regular expressions did not match the desired patterns. Looking at the return code, I realized that I have leftovers from HTML attributes interspersed throughout the text, breaking my templates.
Is there any way around this? Is this the result of garbled HTML? Can I do something about this?
source
share