Click and navigate to chapters and sections
Jump to navigation
Jump to search
In this example, we created new pages for chapters and sections so that each part of the document could be authored by a different person. In Informl new pages are indicated by the CSS class name "existingWikiWord" as shown in the following figure.
<p> <a class="existingWikiWord" href="http://localhost:3010/AnnRep07/pages/APCC+Research+and+Development+Projects"> APCC Research and Development Projects </a> </p>
Knowing this, I have used the following 'hpricot' code to click on chapter and section links to retrieve their contents.
chapters= (doc/"p/a.existingWikiWord") # we need to navigate one more level into the web page # let us discover the links for that chapters.each do |ch| chap_link = ch.attributes['href'] # using inner_html we can create subdirectories chap_name = ch.inner_html.gsub(/\s*/,"") chap_name_org = ch.inner_html # We create chapter directories system("mkdir -p #{chap_name}") fil.write "\\input #{chap_name} \n" chapFil="#{chap_name}.tex" `rm #{chapFil}` cFil=File.new(chapFil,"a") cFil.write "\\chapter{ #{chap_name_org} } \n"
# We navigate to sections now doc2=Hpricot(open(chap_link)) sections= (doc2/"p/a.existingWikiWord") sections.each do |sc| sec_link = sc.attributes['href'] sec_name = sc.inner_html.gsub(/\s*/,"") secFil="#{chap_name}/#{sec_name}.tex" `rm #{secFil}` sFil=File.new(secFil,"a") sechFil="#{chap_name}/#{sec_name}.html" `rm #{sechFil}` shFil=File.new(sechFil,"a")
After navigating to sections (h1 elements in HTML) retrieve their contents and send it to the ruby function "scrape_page.rb" for filtering.
# scrape_the_page(sec_link,"#{chap_name}/#{sec_name}") scrape_the_page(sec_link,sFil,shFil) cFil.write "\\input #{chap_name}/#{sec_name} \n" end end fil.write "\\stoptext \n"