Removing unwanted markup

From Wiki
Revision as of 07:52, 16 July 2007 by Saji (talk | contribs) (New page: Not all the markup in HTML is needed. We need to remove them first. The following is based on the markup used in Informl. <pre> # Function: scrape_page.rb def scrape_the_page(pagePath,oF...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Not all the markup in HTML is needed. We need to remove them first. The following is based on the markup used in Informl.


# Function: scrape_page.rb

def scrape_the_page(pagePath,oFile,hFile)
items_to_remove = [
  "#menus",        #menus notice
  "div.markedup",
  "div.navigation",
  "head",          #table of contents
  "hr"
  ]

doc=Hpricot(open(pagePath))
@article = (doc/"#container").each do |content|
  #remove unnecessary content and edit links
  items_to_remove.each { |x| (content/x).remove }
end