SEO services Sydney

SEO services Sydney

Keyword research frameworks

Google Business Profile categories"Choosing the right categories in your Google Business Profile helps ensure your business appears in relevant search results. Best SEO Agency Sydney Australia. Accurate categories enable Google to match your profile with users searching for your specific services, increasing the likelihood of attracting qualified leads."

Google Business Profile citations"Citations refer to mentions of your business information (name, address, phone number) across the web. Consistent citations that match your Google Business Profile details improve local search visibility and establish trustworthiness with both customers and search engines."

Google Business Profile competitive advantage"Leveraging your Google Business Profile gives you a competitive advantage in local search.

SEO services Sydney - Keyword research frameworks

  • Google ranking factors
  • Search visibility improvements
By optimizing your profile, responding to reviews, and sharing engaging content, you can stand out from competitors and attract more customers."

Best SEO Sydney Agency.

Google Business Profile completeness"A complete Google Business Profile, with all sections filled out and regularly updated, signals to Google that your business is reliable and active. This thoroughness can boost your search rankings and increase customer trust."

Google Business Profile contact details"Including clear, accurate contact details on your Google Business Profile helps customers reach you easily. Visible phone numbers, email addresses, and website links improve trust and increase engagement."

Google Business Profile content strategy"Developing a content strategy for your Google Business Profile ensures youre consistently sharing valuable updates, promotions, and events. Best Search Engine Optimisation Services. A strong strategy increases visibility, builds customer trust, and drives more traffic to your business."

Citations and other Useful links

Search engine optimisation consultants

Google Business Profile enhancements"Enhancements to your Google Business Profile, such as adding photos, updating services, and publishing posts, improve your listings quality and relevance. Regular enhancements keep your profile fresh, engaging, and more likely to rank higher."

Google Business Profile highlightsAdding highlightssuch as unique selling points or special offersto your Google Business Profile makes your listing stand out. These highlights capture user attention and encourage potential customers to choose your business over competitors.

Google Business Profile holiday hours"Updating your holiday hours on your Google Business Profile prevents customer confusion and ensures people know when youre open. Accurate holiday hours improve trust, maintain customer satisfaction, and encourage more visits during busy seasons."

 

World Wide Web
Abbreviation WWW
Year started 1989; 36 years ago (1989) by Tim Berners-Lee
Organization
  • CERN (1989–1994)
  • W3C (1994–current)
A web page from Wikipedia displayed in Google Chrome

The World Wide Web (WWW or simply the Web) is an information system that enables content sharing over the Internet through user-friendly ways meant to appeal to users beyond IT specialists and hobbyists.[1] It allows documents and other web resources to be accessed over the Internet according to specific rules of the Hypertext Transfer Protocol (HTTP).[2]

The Web was invented by English computer scientist Tim Berners-Lee while at CERN in 1989 and opened to the public in 1993. It was conceived as a "universal linked information system".[3][4][5] Documents and other media content are made available to the network through web servers and can be accessed by programs such as web browsers. Servers and resources on the World Wide Web are identified and located through character strings called uniform resource locators (URLs).

The original and still very common document type is a web page formatted in Hypertext Markup Language (HTML). This markup language supports plain text, images, embedded video and audio contents, and scripts (short programs) that implement complex user interaction. The HTML language also supports hyperlinks (embedded URLs) which provide immediate access to other web resources. Web navigation, or web surfing, is the common practice of following such hyperlinks across multiple websites. Web applications are web pages that function as application software. The information in the Web is transferred across the Internet using HTTP. Multiple web resources with a common theme and usually a common domain name make up a website. A single web server may provide multiple websites, while some websites, especially the most popular ones, may be provided by multiple servers. Website content is provided by a myriad of companies, organizations, government agencies, and individual users; and comprises an enormous amount of educational, entertainment, commercial, and government information.

The Web has become the world's dominant information systems platform.[6][7][8][9] It is the primary tool that billions of people worldwide use to interact with the Internet.[2]

History

[edit]
This NeXT Computer was used by Sir Tim Berners-Lee at CERN and became the world's first Web server.

The Web was invented by English computer scientist Tim Berners-Lee while working at CERN.[10][11] He was motivated by the problem of storing, updating, and finding documents and data files in that large and constantly changing organization, as well as distributing them to collaborators outside CERN. In his design, Berners-Lee dismissed the common tree structure approach, used for instance in the existing CERNDOC documentation system and in the Unix filesystem, as well as approaches that relied in tagging files with keywords, as in the VAX/NOTES system. Instead he adopted concepts he had put into practice with his private ENQUIRE system (1980) built at CERN. When he became aware of Ted Nelson's hypertext model (1965), in which documents can be linked in unconstrained ways through hyperlinks associated with "hot spots" embedded in the text, it helped to confirm the validity of his concept.[12][13]

The historic World Wide Web logo, designed by Robert Cailliau. Currently, there is no widely accepted logo in use for the WWW.

The model was later popularized by Apple's HyperCard system. Unlike Hypercard, Berners-Lee's new system from the outset was meant to support links between multiple databases on independent computers, and to allow simultaneous access by many users from any computer on the Internet. He also specified that the system should eventually handle other media besides text, such as graphics, speech, and video. Links could refer to mutable data files, or even fire up programs on their server computer. He also conceived "gateways" that would allow access through the new system to documents organized in other ways (such as traditional computer file systems or the Usenet). Finally, he insisted that the system should be decentralized, without any central control or coordination over the creation of links.[4][14][10][11]

Berners-Lee submitted a proposal to CERN in May 1989, without giving the system a name.[4] He got a working system implemented by the end of 1990, including a browser called WorldWideWeb (which became the name of the project and of the network) and an HTTP server running at CERN. As part of that development he defined the first version of the HTTP protocol, the basic URL syntax, and implicitly made HTML the primary document format.[15] The technology was released outside CERN to other research institutions starting in January 1991, and then to the whole Internet on 23 August 1991. The Web was a success at CERN, and began to spread to other scientific and academic institutions. Within the next two years, there were 50 websites created.[16][17]

CERN made the Web protocol and code available royalty free in 1993, enabling its widespread use.[18][19] After the NCSA released the Mosaic web browser later that year, the Web's popularity grew rapidly as thousands of websites sprang up in less than a year.[20][21] Mosaic was a graphical browser that could display inline images and submit forms that were processed by the HTTPd server.[22][23] Marc Andreessen and Jim Clark founded Netscape the following year and released the Navigator browser, which introduced Java and JavaScript to the Web. It quickly became the dominant browser. Netscape became a public company in 1995 which triggered a frenzy for the Web and started the dot-com bubble.[24] Microsoft responded by developing its own browser, Internet Explorer, starting the browser wars. By bundling it with Windows, it became the dominant browser for 14 years.[25]

Berners-Lee founded the World Wide Web Consortium (W3C) which created XML in 1996 and recommended replacing HTML with stricter XHTML.[26] In the meantime, developers began exploiting an IE feature called XMLHttpRequest to make Ajax applications and launched the Web 2.0 revolution. Mozilla, Opera, and Apple rejected XHTML and created the WHATWG which developed HTML5.[27] In 2009, the W3C conceded and abandoned XHTML.[28] In 2019, it ceded control of the HTML specification to the WHATWG.[29]

The World Wide Web has been central to the development of the Information Age and is the primary tool billions of people use to interact on the Internet.[30][31][32][9]

Nomenclature

[edit]

Tim Berners-Lee states that World Wide Web is officially spelled as three separate words, each capitalised, with no intervening hyphens.[33] Nonetheless, it is often called simply the Web, and also often the web; see Capitalization of Internet for details. In Mandarin Chinese, World Wide Web is commonly translated via a phono-semantic matching to wàn wéi wÇŽng (万维网), which satisfies www and literally means "10,000-dimensional net", a translation that reflects the design concept and proliferation of the World Wide Web.

Use of the www prefix has been declining, especially when web applications sought to brand their domain names and make them easily pronounceable. As the mobile Web grew in popularity,[citation needed] services like Gmail.com, Outlook.com, Myspace.com, Facebook.com and Twitter.com are most often mentioned without adding "www." (or, indeed, ".com") to the domain.[34]

In English, www is usually read as double-u double-u double-u.[35] Some users pronounce it dub-dub-dub, particularly in New Zealand.[36] Stephen Fry, in his "Podgrams" series of podcasts, pronounces it wuh wuh wuh.[37] The English writer Douglas Adams once quipped in The Independent on Sunday (1999): "The World Wide Web is the only thing I know of whose shortened form takes three times longer to say than what it's short for".[38]

Function

[edit]
The World Wide Web functions as an application layer protocol that is run "on top of" (figuratively) the Internet, helping to make it more functional. The advent of the Mosaic web browser helped to make the web much more usable, to include the display of images and moving images (GIFs).

The terms Internet and World Wide Web are often used without much distinction. However, the two terms do not mean the same thing. The Internet is a global system of computer networks interconnected through telecommunications and optical networking. In contrast, the World Wide Web is a global collection of documents and other resources, linked by hyperlinks and URIs. Web resources are accessed using HTTP or HTTPS, which are application-level Internet protocols that use the Internet transport protocols.[2]

Viewing a web page on the World Wide Web normally begins either by typing the URL of the page into a web browser or by following a hyperlink to that page or resource. The web browser then initiates a series of background communication messages to fetch and display the requested page. In the 1990s, using a browser to view web pages—and to move from one web page to another through hyperlinks—came to be known as 'browsing,' 'web surfing' (after channel surfing), or 'navigating the Web'. Early studies of this new behaviour investigated user patterns in using web browsers. One study, for example, found five user patterns: exploratory surfing, window surfing, evolved surfing, bounded navigation and targeted navigation.[39]

The following example demonstrates the functioning of a web browser when accessing a page at the URL http://example.org/home.html. The browser resolves the server name of the URL (example.org) into an Internet Protocol address using the globally distributed Domain Name System (DNS). This lookup returns an IP address such as 203.0.113.4 or 2001:db8:2e::7334. The browser then requests the resource by sending an HTTP request across the Internet to the computer at that address. It requests service from a specific TCP port number that is well known for the HTTP service so that the receiving host can distinguish an HTTP request from other network protocols it may be servicing. HTTP normally uses port number 80 and for HTTPS it normally uses port number 443. The content of the HTTP request can be as simple as two lines of text:

GET /home.html HTTP/1.1
Host: example.org

The computer receiving the HTTP request delivers it to web server software listening for requests on port 80. If the web server can fulfil the request it sends an HTTP response back to the browser indicating success:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8

followed by the content of the requested page. Hypertext Markup Language (HTML) for a basic web page might look like this:

<html>
  <head>
    <title>Example.org – The World Wide Web</title>
  </head>
  <body>
    <p>The World Wide Web, abbreviated as WWW and commonly known ...</p>
  </body>
</html>

The web browser parses the HTML and interprets the markup (<title>, <p> for paragraph, and such) that surrounds the words to format the text on the screen. Many web pages use HTML to reference the URLs of other resources such as images, other embedded media, scripts that affect page behaviour, and Cascading Style Sheets that affect page layout. The browser makes additional HTTP requests to the web server for these other Internet media types. As it receives their content from the web server, the browser progressively renders the page onto the screen as specified by its HTML and these additional resources.

HTML

[edit]

Hypertext Markup Language (HTML) is the standard markup language for creating web pages and web applications. With Cascading Style Sheets (CSS) and JavaScript, it forms a triad of cornerstone technologies for the World Wide Web.[40]

Web browsers receive HTML documents from a web server or from local storage and render the documents into multimedia web pages. HTML describes the structure of a web page semantically and originally included cues for the appearance of the document.

HTML elements are the building blocks of HTML pages. With HTML constructs, images and other objects such as interactive forms may be embedded into the rendered page. HTML provides a means to create structured documents by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes and other items. HTML elements are delineated by tags, written using angle brackets. Tags such as <img /> and <input /> directly introduce content into the page. Other tags such as <p> surround and provide information about document text and may include other tags as sub-elements. Browsers do not display the HTML tags, but use them to interpret the content of the page.

HTML can embed programs written in a scripting language such as JavaScript, which affects the behaviour and content of web pages. Inclusion of CSS defines the look and layout of content. The World Wide Web Consortium (W3C), maintainer of both the HTML and the CSS standards, has encouraged the use of CSS over explicit presentational HTML since 1997.[41]

Linking

[edit]

Most web pages contain hyperlinks to other related pages and perhaps to downloadable files, source documents, definitions and other web resources. In the underlying HTML, a hyperlink looks like this: <a href="http://example.org/home.html">Example.org Homepage</a>.

Graphic representation of a minute fraction of the WWW, demonstrating hyperlinks

Such a collection of useful, related resources, interconnected via hypertext links is dubbed a web of information. Publication on the Internet created what Tim Berners-Lee first called the WorldWideWeb (in its original CamelCase, which was subsequently discarded) in November 1990.[42]

The hyperlink structure of the web is described by the webgraph: the nodes of the web graph correspond to the web pages (or URLs) the directed edges between them to the hyperlinks. Over time, many web resources pointed to by hyperlinks disappear, relocate, or are replaced with different content. This makes hyperlinks obsolete, a phenomenon referred to in some circles as link rot, and the hyperlinks affected by it are often called "dead" links. The ephemeral nature of the Web has prompted many efforts to archive websites. The Internet Archive, active since 1996, is the best known of such efforts.

WWW prefix

[edit]

Many hostnames used for the World Wide Web begin with www because of the long-standing practice of naming Internet hosts according to the services they provide. The hostname of a web server is often www, in the same way that it may be ftp for an FTP server, and news or nntp for a Usenet news server. These hostnames appear as Domain Name System (DNS) or subdomain names, as in www.example.com. The use of www is not required by any technical or policy standard and many websites do not use it; the first web server was nxoc01.cern.ch.[43] According to Paolo Palazzi, who worked at CERN along with Tim Berners-Lee, the popular use of www as subdomain was accidental; the World Wide Web project page was intended to be published at www.cern.ch while info.cern.ch was intended to be the CERN home page; however the DNS records were never switched, and the practice of prepending www to an institution's website domain name was subsequently copied.[44][better source needed] Many established websites still use the prefix, or they employ other subdomain names such as www2, secure or en for special purposes. Many such web servers are set up so that both the main domain name (e.g., example.com) and the www subdomain (e.g., www.example.com) refer to the same site; others require one form or the other, or they may map to different web sites. The use of a subdomain name is useful for load balancing incoming web traffic by creating a CNAME record that points to a cluster of web servers. Since, currently[as of?], only a subdomain can be used in a CNAME, the same result cannot be achieved by using the bare domain root.[45][dubiousdiscuss]

When a user submits an incomplete domain name to a web browser in its address bar input field, some web browsers automatically try adding the prefix "www" to the beginning of it and possibly ".com", ".org" and ".net" at the end, depending on what might be missing. For example, entering "microsoft" may be transformed to http://www.microsoft.com/ and "openoffice" to http://www.openoffice.org. This feature started appearing in early versions of Firefox, when it still had the working title 'Firebird' in early 2003, from an earlier practice in browsers such as Lynx.[46] [unreliable source?] It is reported that Microsoft was granted a US patent for the same idea in 2008, but only for mobile devices.[47]

Scheme specifiers

[edit]

The scheme specifiers http:// and https:// at the start of a web URI refer to Hypertext Transfer Protocol or HTTP Secure, respectively. They specify the communication protocol to use for the request and response. The HTTP protocol is fundamental to the operation of the World Wide Web, and the added encryption layer in HTTPS is essential when browsers send or retrieve confidential data, such as passwords or banking information. Web browsers usually automatically prepend http:// to user-entered URIs, if omitted.

Pages

[edit]
A screenshot of the home page of Wikimedia Commons

A web page (also written as webpage) is a document that is suitable for the World Wide Web and web browsers. A web browser displays a web page on a monitor or mobile device.

The term web page usually refers to what is visible, but may also refer to the contents of the computer file itself, which is usually a text file containing hypertext written in HTML or a comparable markup language. Typical web pages provide hypertext for browsing to other web pages via hyperlinks, often referred to as links. Web browsers will frequently have to access multiple web resource elements, such as reading style sheets, scripts, and images, while presenting each web page.

On a network, a web browser can retrieve a web page from a remote web server. The web server may restrict access to a private network such as a corporate intranet. The web browser uses the Hypertext Transfer Protocol (HTTP) to make such requests to the web server.

A static web page is delivered exactly as stored, as web content in the web server's file system. In contrast, a dynamic web page is generated by a web application, usually driven by server-side software. Dynamic web pages are used when each user may require completely different information, for example, bank websites, web email etc.

Static page

[edit]

A static web page (sometimes called a flat page/stationary page) is a web page that is delivered to the user exactly as stored, in contrast to dynamic web pages which are generated by a web application.

Consequently, a static web page displays the same information for all users, from all contexts, subject to modern capabilities of a web server to negotiate content-type or language of the document where such versions are available and the server is configured to do so.

Dynamic pages

[edit]
Dynamic web page: example of server-side scripting (PHP and MySQL)

A server-side dynamic web page is a web page whose construction is controlled by an application server processing server-side scripts. In server-side scripting, parameters determine how the assembly of every new web page proceeds, including the setting up of more client-side processing.

A client-side dynamic web page processes the web page using JavaScript running in the browser. JavaScript programs can interact with the document via Document Object Model, or DOM, to query page state and alter it. The same client-side techniques can then dynamically update or change the DOM in the same way.

A dynamic web page is then reloaded by the user or by a computer program to change some variable content. The updating information could come from the server, or from changes made to that page's DOM. This may or may not truncate the browsing history or create a saved version to go back to, but a dynamic web page update using Ajax technologies will neither create a page to go back to nor truncate the web browsing history forward of the displayed page. Using Ajax technologies the end user gets one dynamic page managed as a single page in the web browser while the actual web content rendered on that page can vary. The Ajax engine sits only on the browser requesting parts of its DOM, the DOM, for its client, from an application server.

Dynamic HTML, or DHTML, is the umbrella term for technologies and methods used to create web pages that are not static web pages, though it has fallen out of common use since the popularization of AJAX, a term which is now itself rarely used.[citation needed] Client-side-scripting, server-side scripting, or a combination of these make for the dynamic web experience in a browser.

JavaScript is a scripting language that was initially developed in 1995 by Brendan Eich, then of Netscape, for use within web pages.[48] The standardised version is ECMAScript.[48] To make web pages more interactive, some web applications also use JavaScript techniques such as Ajax (asynchronous JavaScript and XML). Client-side script is delivered with the page that can make additional HTTP requests to the server, either in response to user actions such as mouse movements or clicks, or based on elapsed time. The server's responses are used to modify the current page rather than creating a new page with each response, so the server needs only to provide limited, incremental information. Multiple Ajax requests can be handled at the same time, and users can interact with the page while data is retrieved. Web pages may also regularly poll the server to check whether new information is available.[49]

Website

[edit]
The usap.gov website

A website[50] is a collection of related web resources including web pages, multimedia content, typically identified with a common domain name, and published on at least one web server. Notable examples are wikipedia.org, google.com, and amazon.com.

A website may be accessible via a public Internet Protocol (IP) network, such as the Internet, or a private local area network (LAN), by referencing a uniform resource locator (URL) that identifies the site.

Websites can have many functions and can be used in various fashions; a website can be a personal website, a corporate website for a company, a government website, an organization website, etc. Websites are typically dedicated to a particular topic or purpose, ranging from entertainment and social networking to providing news and education. All publicly accessible websites collectively constitute the World Wide Web, while private websites, such as a company's website for its employees, are typically a part of an intranet.

Web pages, which are the building blocks of websites, are documents, typically composed in plain text interspersed with formatting instructions of Hypertext Markup Language (HTML, XHTML). They may incorporate elements from other websites with suitable markup anchors. Web pages are accessed and transported with the Hypertext Transfer Protocol (HTTP), which may optionally employ encryption (HTTP Secure, HTTPS) to provide security and privacy for the user. The user's application, often a web browser, renders the page content according to its HTML markup instructions onto a display terminal.

Hyperlinking between web pages conveys to the reader the site structure and guides the navigation of the site, which often starts with a home page containing a directory of the site web content. Some websites require user registration or subscription to access content. Examples of subscription websites include many business sites, news websites, academic journal websites, gaming websites, file-sharing websites, message boards, web-based email, social networking websites, websites providing real-time price quotations for different types of markets, as well as sites providing various other services. End users can access websites on a range of devices, including desktop and laptop computers, tablet computers, smartphones and smart TVs.

Browser

[edit]

A web browser (commonly referred to as a browser) is a software user agent for accessing information on the World Wide Web. To connect to a website's server and display its pages, a user needs to have a web browser program. This is the program that the user runs to download, format, and display a web page on the user's computer.

In addition to allowing users to find, display, and move between web pages, a web browser will usually have features like keeping bookmarks, recording history, managing cookies (see below), and home pages and may have facilities for recording passwords for logging into websites.

The most popular browsers are Chrome, Safari, Edge, Samsung Internet and Firefox.[51]

Server

[edit]
The inside and front of a Dell PowerEdge web server, a computer designed for rack mounting

A Web server is server software, or hardware dedicated to running said software, that can satisfy World Wide Web client requests. A web server can, in general, contain one or more websites. A web server processes incoming network requests over HTTP and several other related protocols.

The primary function of a web server is to store, process and deliver web pages to clients.[52] The communication between client and server takes place using the Hypertext Transfer Protocol (HTTP). Pages delivered are most frequently HTML documents, which may include images, style sheets and scripts in addition to the text content.

Multiple web servers may be used for a high traffic website; here, Dell servers are installed together to be used for the Wikimedia Foundation.

A user agent, commonly a web browser or web crawler, initiates communication by making a request for a specific resource using HTTP and the server responds with the content of that resource or an error message if unable to do so. The resource is typically a real file on the server's secondary storage, but this is not necessarily the case and depends on how the webserver is implemented.

While the primary function is to serve content, full implementation of HTTP also includes ways of receiving content from clients. This feature is used for submitting web forms, including uploading of files.

Many generic web servers also support server-side scripting using Active Server Pages (ASP), PHP (Hypertext Preprocessor), or other scripting languages. This means that the behaviour of the webserver can be scripted in separate files, while the actual server software remains unchanged. Usually, this function is used to generate HTML documents dynamically ("on-the-fly") as opposed to returning static documents. The former is primarily used for retrieving or modifying information from databases. The latter is typically much faster and more easily cached but cannot deliver dynamic content.

Web servers can also frequently be found embedded in devices such as printers, routers, webcams and serving only a local network. The web server may then be used as a part of a system for monitoring or administering the device in question. This usually means that no additional software has to be installed on the client computer since only a web browser is required (which now is included with most operating systems).

Optical Networking

[edit]

Optical networking is a sophisticated infrastructure that utilizes optical fiber to transmit data over long distances, connecting countries, cities, and even private residences. The technology uses optical microsystems like tunable lasers, filters, attenuators, switches, and wavelength-selective switches to manage and operate these networks.[53][54]

The large quantity of optical fiber installed throughout the world at the end of the twentieth century set the foundation of the Internet as it’s used today. The information highway relies heavily on optical networking, a method of sending messages encoded in light to relay information in various telecommunication networks.[55]

The Advanced Research Projects Agency Network (ARPANET) was one of the first iterations of the Internet, created in collaboration with universities and researchers 1969.[56][57][58][59] However, access to the ARPANET was limited to researchers, and in 1985, the National Science Foundation founded the National Science Foundation Network (NSFNET), a program that provided supercomputer access to researchers.[59]

Limited public access to the Internet led to pressure from consumers and corporations to privatize the network. In 1993, the US passed the National Information Infrastructure Act, which dictated that the National Science Foundation must hand over control of the optical capabilities to commercial operators.[60][61]

The privatization of the Internet and the release of the World Wide Web to the public in 1993 led to an increased demand for Internet capabilities. This spurred developers to seek solutions to reduce the time and cost of laying new fiber and increase the amount of information that can be sent on a single fiber, in order to meet the growing needs of the public.[62][63][64][65]

In 1994, Pirelli S.p.A.'s optical components division introduced a wavelength-division multiplexing (WDM) system to meet growing demand for increased data transmission. This four-channel WDM technology allowed more information to be sent simultaneously over a single optical fiber, effectively boosting network capacity.[66][67]

Pirelli wasn't the only company that developed a WDM system; another company, the Ciena Corporation (Ciena), created its own technology to transmit data more efficiently. David Huber, an optical networking engineer and entrepreneur Kevin Kimberlin founded Ciena in 1992.[68][69][70] Drawing on laser technology from Gordon Gould and William Culver of Optelecom, Inc., the company focused on utilizing optical amplifiers to transmit data via light.[71][72][73] Under chief executive officer Pat Nettles, Ciena developed a dual-stage optical amplifier for dense wavelength-division multiplexing (DWDM), patented in 1997 and deployed on the Sprint network in 1996.[74][75][76][77][78]

[edit]

An HTTP cookie (also called web cookie, Internet cookie, browser cookie, or simply cookie) is a small piece of data sent from a website and stored on the user's computer by the user's web browser while the user is browsing. Cookies were designed to be a reliable mechanism for websites to remember stateful information (such as items added in the shopping cart in an online store) or to record the user's browsing activity (including clicking particular buttons, logging in, or recording which pages were visited in the past). They can also be used to remember arbitrary pieces of information that the user previously entered into form fields such as names, addresses, passwords, and credit card numbers.

Cookies perform essential functions in the modern web. Perhaps most importantly, authentication cookies are the most common method used by web servers to know whether the user is logged in or not, and which account they are logged in with. Without such a mechanism, the site would not know whether to send a page containing sensitive information or require the user to authenticate themselves by logging in. The security of an authentication cookie generally depends on the security of the issuing website and the user's web browser, and on whether the cookie data is encrypted. Security vulnerabilities may allow a cookie's data to be read by a hacker, used to gain access to user data, or used to gain access (with the user's credentials) to the website to which the cookie belongs (see cross-site scripting and cross-site request forgery for examples).[79]

Tracking cookies, and especially third-party tracking cookies, are commonly used as ways to compile long-term records of individuals' browsing histories – a potential privacy concern that prompted European[80] and U.S. lawmakers to take action in 2011.[81][82] European law requires that all websites targeting European Union member states gain "informed consent" from users before storing non-essential cookies on their device.

Google Project Zero researcher Jann Horn describes ways cookies can be read by intermediaries, like Wi-Fi hotspot providers. When in such circumstances, he recommends using the browser in private browsing mode (widely known as Incognito mode in Google Chrome).[83]

Search engine

[edit]
The results of a search for the term "lunar eclipse" in a web-based image search engine

A web search engine or Internet search engine is a software system that is designed to carry out web search (Internet search), which means to search the World Wide Web in a systematic way for particular information specified in a web search query. The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs). The information may be a mix of web pages, images, videos, infographics, articles, research papers, and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler. Internet content that is not capable of being searched by a web search engine is generally described as the deep web.

In 1990, Archie, the world’s first search engine, was released. The technology was originally an index of File Transfer Protocol (FTP) sites, which was a method for moving files between a client and a server network.[84][85] This early search tool was superseded by more advanced engines like Yahoo! in 1995 and Google in 1998.[86][87]

Deep web

[edit]
Deep web diagram
Deep web vs surface web
Surface Web & Deep Web

The deep web,[88] invisible web,[89] or hidden web[90] are parts of the World Wide Web whose contents are not indexed by standard web search engines. The opposite term to the deep web is the surface web, which is accessible to anyone using the Internet.[91] Computer scientist Michael K. Bergman is credited with coining the term deep web in 2001 as a search indexing term.[92]

The content of the deep web is hidden behind HTTP forms,[93][94] and includes many very common uses such as web mail, online banking, and services that users must pay for, and which is protected by a paywall, such as video on demand, some online magazines and newspapers, among others.

The content of the deep web can be located and accessed by a direct URL or IP address and may require a password or other security access past the public website page.

Caching

[edit]

A web cache is a server computer located either on the public Internet or within an enterprise that stores recently accessed web pages to improve response time for users when the same content is requested within a certain time after the original request. Most web browsers also implement a browser cache by writing recently obtained data to a local data storage device. HTTP requests by a browser may ask only for data that has changed since the last access. Web pages and resources may contain expiration information to control caching to secure sensitive data, such as in online banking, or to facilitate frequently updated sites, such as news media. Even sites with highly dynamic content may permit basic resources to be refreshed only occasionally. Web site designers find it worthwhile to collate resources such as CSS data and JavaScript into a few site-wide files so that they can be cached efficiently. Enterprise firewalls often cache Web resources requested by one user for the benefit of many users. Some search engines store cached content of frequently accessed websites.

Security

[edit]

For criminals, the Web has become a venue to spread malware and engage in a range of cybercrime, including (but not limited to) identity theft, fraud, espionage, and intelligence gathering.[95] Web-based vulnerabilities now outnumber traditional computer security concerns,[96][97] and as measured by Google, about one in ten web pages may contain malicious code.[98] Most web-based attacks take place on legitimate websites, and most, as measured by Sophos, are hosted in the United States, China and Russia.[99] The most common of all malware threats is SQL injection attacks against websites.[100] Through HTML and URIs, the Web was vulnerable to attacks like cross-site scripting (XSS) that came with the introduction of JavaScript[101] and were exacerbated to some degree by Web 2.0 and Ajax web design that favours the use of scripts.[102] In one 2007 estimate, 70% of all websites are open to XSS attacks on their users.[103] Phishing is another common threat to the Web. In February 2013, RSA (the security division of EMC) estimated the global losses from phishing at $1.5 billion in 2012.[104] Two of the well-known phishing methods are Covert Redirect and Open Redirect.

Proposed solutions vary. Large security companies like McAfee already design governance and compliance suites to meet post-9/11 regulations,[105] and some, like Finjan Holdings have recommended active real-time inspection of programming code and all content regardless of its source.[95] Some have argued that for enterprises to see Web security as a business opportunity rather than a cost centre,[106] while others call for "ubiquitous, always-on digital rights management" enforced in the infrastructure to replace the hundreds of companies that secure data and networks.[107] Jonathan Zittrain has said users sharing responsibility for computing safety is far preferable to locking down the Internet.[108]

Privacy

[edit]

Every time a client requests a web page, the server can identify the request's IP address. Web servers usually log IP addresses in a log file. Also, unless set not to do so, most web browsers record requested web pages in a viewable history feature, and usually cache much of the content locally. Unless the server-browser communication uses HTTPS encryption, web requests and responses travel in plain text across the Internet and can be viewed, recorded, and cached by intermediate systems. Another way to hide personally identifiable information is by using a virtual private network. A VPN encrypts traffic between the client and VPN server, and masks the original IP address, lowering the chance of user identification.

When a web page asks for, and the user supplies, personally identifiable information—such as their real name, address, e-mail address, etc. web-based entities can associate current web traffic with that individual. If the website uses HTTP cookies, username, and password authentication, or other tracking techniques, it can relate other web visits, before and after, to the identifiable information provided. In this way, a web-based organization can develop and build a profile of the individual people who use its site or sites. It may be able to build a record for an individual that includes information about their leisure activities, their shopping interests, their profession, and other aspects of their demographic profile. These profiles are of potential interest to marketers, advertisers, and others. Depending on the website's terms and conditions and the local laws that apply information from these profiles may be sold, shared, or passed to other organizations without the user being informed. For many ordinary people, this means little more than some unexpected emails in their inbox or some uncannily relevant advertising on a future web page. For others, it can mean that time spent indulging an unusual interest can result in a deluge of further targeted marketing that may be unwelcome. Law enforcement, counterterrorism, and espionage agencies can also identify, target, and track individuals based on their interests or proclivities on the Web.

Social networking sites usually try to get users to use their real names, interests, and locations, rather than pseudonyms, as their executives believe that this makes the social networking experience more engaging for users. On the other hand, uploaded photographs or unguarded statements can be identified to an individual, who may regret this exposure. Employers, schools, parents, and other relatives may be influenced by aspects of social networking profiles, such as text posts or digital photos, that the posting individual did not intend for these audiences. Online bullies may make use of personal information to harass or stalk users. Modern social networking websites allow fine-grained control of the privacy settings for each posting, but these can be complex and not easy to find or use, especially for beginners.[109] Photographs and videos posted onto websites have caused particular problems, as they can add a person's face to an online profile. With modern and potential facial recognition technology, it may then be possible to relate that face with other, previously anonymous, images, events, and scenarios that have been imaged elsewhere. Due to image caching, mirroring, and copying, it is difficult to remove an image from the World Wide Web.

Standards

[edit]

Web standards include many interdependent standards and specifications, some of which govern aspects of the Internet, not just the World Wide Web. Even when not web-focused, such standards directly or indirectly affect the development and administration of websites and web services. Considerations include the interoperability, accessibility and usability of web pages and web sites.

Web standards, in the broader sense, consist of the following:

Web standards are not fixed sets of rules but are constantly evolving sets of finalized technical specifications of web technologies.[116] Web standards are developed by standards organizations—groups of interested and often competing parties chartered with the task of standardization—not technologies developed and declared to be a standard by a single individual or company. It is crucial to distinguish those specifications that are under development from the ones that already reached the final development status (in the case of W3C specifications, the highest maturity level).

Accessibility

[edit]

There are methods for accessing the Web in alternative mediums and formats to facilitate use by individuals with disabilities. These disabilities may be visual, auditory, physical, speech-related, cognitive, neurological, or some combination. Accessibility features also help people with temporary disabilities, like a broken arm, or ageing users as their abilities change.[117] The Web is receiving information as well as providing information and interacting with society. The World Wide Web Consortium claims that it is essential that the Web be accessible, so it can provide equal access and equal opportunity to people with disabilities.[118] Tim Berners-Lee once noted, "The power of the Web is in its universality. Access by everyone regardless of disability is an essential aspect."[117] Many countries regulate web accessibility as a requirement for websites.[119] International co-operation in the W3C Web Accessibility Initiative led to simple guidelines that web content authors as well as software developers can use to make the Web accessible to persons who may or may not be using assistive technology.[117][120]

Internationalisation

[edit]
A global map of the Web Index for countries in 2014

The W3C Internationalisation Activity assures that web technology works in all languages, scripts, and cultures.[121] Beginning in 2004 or 2005, Unicode gained ground and eventually in December 2007 surpassed both ASCII and Western European as the Web's most frequently used character map.[122] Originally RFC 3986 allowed resources to be identified by URI in a subset of US-ASCII.

RFC 3987 allows more characters—any character in the Universal Character Set—and now a resource can be identified by IRI in any language.[123]

See also

[edit]

References

[edit]
  1. ^ Wright, Edmund, ed. (2006). The Desk Encyclopedia of World History. New York: Oxford University Press. p. 312. ISBN 978-0-7394-7809-7.
  2. ^ a b c "What is the difference between the Web and the Internet?". W3C Help and FAQ. W3C. 2009. Archived from the original on 9 July 2015. Retrieved 16 July 2015.
  3. ^ "World Wide Web (WWW) launches in the public domain | April 30, 1993". HISTORY. Retrieved 21 January 2025.
  4. ^ a b c Berners-Lee, Tim. "Information Management: A Proposal". w3.org. The World Wide Web Consortium. Archived from the original on 1 April 2010. Retrieved 12 February 2022.
  5. ^ "The World's First Web Site". HISTORY. 30 August 2018. Archived from the original on 19 August 2023. Retrieved 19 August 2023.
  6. ^ Bleigh, Michael (16 May 2014). "The Once And Future Web Platform". TechCrunch. Archived from the original on 5 December 2021. Retrieved 9 March 2022.
  7. ^ "World Wide Web Timeline". Pews Research Center. 11 March 2014. Archived from the original on 29 July 2015. Retrieved 1 August 2015.
  8. ^ Dewey, Caitlin (12 March 2014). "36 Ways The Web Has Changed Us". The Washington Post. Archived from the original on 9 September 2015. Retrieved 1 August 2015.
  9. ^ a b "Internet Live Stats". internetlivestats.com. Archived from the original on 2 July 2015. Retrieved 1 August 2015.
  10. ^ a b Quittner, Joshua (29 March 1999). "Network Designer Tim Berners-Lee". Time Magazine. Archived from the original on 15 August 2007. Retrieved 17 May 2010. He wove the World Wide Web and created a mass medium for the 21st century. The World Wide Web is Berners-Lee's alone. He designed it. He set it loose it on the world. And he more than anyone else has fought to keep it an open, non-proprietary and free.[page needed]
  11. ^ a b McPherson, Stephanie Sammartino (2009). Tim Berners-Lee: Inventor of the World Wide Web. Twenty-First Century Books. ISBN 978-0-8225-7273-2.
  12. ^ Rutter, Dorian (2005). From Diversity to Convergence: British Computer Networks and the Internet, 1970-1995 (PDF) (Computer Science thesis). The University of Warwick. Archived (PDF) from the original on 10 October 2022. Retrieved 27 December 2022.
  13. ^ Tim Berners-Lee (1999). Weaving the Web. Internet Archive. HarperSanFrancisco. pp. 5–6. ISBN 978-0-06-251586-5.
  14. ^ Berners-Lee, T.; Cailliau, R.; Groff, J.-F.; Pollermann, B. (1992). "World-Wide Web: The Information Universe". Electron. Netw. Res. Appl. Policy. 2: 52–58. doi:10.1108/eb047254. ISSN 1066-2243. Archived from the original on 27 December 2022. Retrieved 27 December 2022.
  15. ^ W3 (1991) Re: Qualifiers on Hypertext links Archived 7 December 2021 at the Wayback Machine
  16. ^ Hopgood, Bob. "History of the Web". w3.org. The World Wide Web Consortium. Archived from the original on 21 March 2022. Retrieved 12 February 2022.
  17. ^ "A short history of the Web". CERN. Archived from the original on 17 April 2022. Retrieved 15 April 2022.
  18. ^ "Software release of WWW into public domain". CERN Document Server. CERN. 30 January 1993. Archived from the original on 17 February 2022. Retrieved 17 February 2022.
  19. ^ "Ten Years Public Domain for the Original Web Software". Tenyears-www.web.cern.ch. 30 April 2003. Archived from the original on 13 August 2009. Retrieved 27 July 2009.
  20. ^ Calore, Michael (22 April 2010). "April 22, 1993: Mosaic Browser Lights Up Web With Color, Creativity". Wired. Archived from the original on 24 April 2018. Retrieved 12 February 2022.
  21. ^ Couldry, Nick (2012). Media, Society, World: Social Theory and Digital Media Practice. London: Polity Press. p. 2. ISBN 9780745639208. Archived from the original on 27 February 2024. Retrieved 11 December 2020.
  22. ^ Hoffman, Jay (21 April 1993). "The Origin of the IMG Tag". The History of the Web. Archived from the original on 13 February 2022. Retrieved 13 February 2022.
  23. ^ Clarke, Roger. "The Birth of Web Commerce". Roger Clarke's Web-Site. XAMAX. Archived from the original on 15 February 2022. Retrieved 15 February 2022.
  24. ^ McCullough, Brian. "20 YEARS ON: WHY NETSCAPE'S IPO WAS THE "BIG BANG" OF THE INTERNET ERA". www.internethistorypodcast.com. INTERNET HISTORY PODCAST. Archived from the original on 12 February 2022. Retrieved 12 February 2022.
  25. ^ Calore, Michael (28 September 2009). "Sept. 28, 1998: Internet Explorer Leaves Netscape in Its Wake". Wired. Archived from the original on 30 November 2021. Retrieved 14 February 2022.
  26. ^ Daly, Janet (26 January 2000). "World Wide Web Consortium Issues XHTML 1.0 as a Recommendation". W3C. Archived from the original on 20 June 2021. Retrieved 8 March 2022.
  27. ^ Hickson, Ian. "WHAT open mailing list announcement". whatwg.org. WHATWG. Archived from the original on 8 March 2022. Retrieved 16 February 2022.
  28. ^ Shankland, Stephen (9 July 2009). "An epitaph for the Web standard, XHTML 2". CNet. Archived from the original on 16 February 2022. Retrieved 17 February 2022.
  29. ^ "Memorandum of Understanding Between W3C and WHATWG". W3C. Archived from the original on 29 May 2019. Retrieved 16 February 2022.
  30. ^ In, Lee (30 June 2012). Electronic Commerce Management for Business Activities and Global Enterprises: Competitive Advantages: Competitive Advantages. IGI Global. ISBN 978-1-4666-1801-5. Archived from the original on 21 April 2024. Retrieved 27 September 2020.
  31. ^ Misiroglu, Gina (26 March 2015). American Countercultures: An Encyclopedia of Nonconformists, Alternative Lifestyles, and Radical Ideas in U.S. History: An Encyclopedia of Nonconformists, Alternative Lifestyles, and Radical Ideas in U.S. History. Routledge. ISBN 978-1-317-47729-7. Archived from the original on 21 April 2024. Retrieved 27 September 2020.
  32. ^ "World Wide Web Timeline". Pew Research Center. 11 March 2014. Archived from the original on 29 July 2015. Retrieved 1 August 2015.
  33. ^ "Frequently asked questions - Spelling of WWW". W3C. Archived from the original on 2 August 2009. Retrieved 27 July 2009.
  34. ^ Castelluccio, Michael (1 October 2010). "It's not your grandfather's Internet". Strategic Finance. Institute of Management Accountants. Archived from the original on 5 March 2016. Retrieved 7 February 2016 – via The Free Library.
  35. ^ "Audible pronunciation of 'WWW'". Oxford University Press. Archived from the original on 25 May 2014. Retrieved 25 May 2014.
  36. ^ Harvey, Charlie (18 August 2015). "How we pronounce WWW in English: a detailed but unscientific survey". charlieharvey.org.uk. Archived from the original on 19 November 2022. Retrieved 19 May 2022.
  37. ^ "Stephen Fry's pronunciation of 'WWW'". Podcasts.com. Archived from the original on 4 April 2017.
  38. ^ Simonite, Tom (22 July 2008). "Help us find a better way to pronounce www". newscientist.com. New Scientist, Technology. Archived from the original on 13 March 2016. Retrieved 7 February 2016.
  39. ^ Muylle, Steve; Moenaert, Rudy; Despont, Marc (1999). "A grounded theory of World Wide Web search behaviour". Journal of Marketing Communications. 5 (3): 143. doi:10.1080/135272699345644.
  40. ^ Flanagan, David. JavaScript – The definitive guide (6 ed.). p. 1. JavaScript is part of the triad of technologies that all Web developers must learn: HTML to specify the content of web pages, CSS to specify the presentation of web pages, and JavaScript to specify the behaviour of web pages.
  41. ^ "HTML 4.0 Specification – W3C Recommendation – Conformance: requirements and recommendations". World Wide Web Consortium. 18 December 1997. Archived from the original on 5 July 2015. Retrieved 6 July 2015.
  42. ^ Berners-Lee, Tim; Cailliau, Robert (12 November 1990). "WorldWideWeb: Proposal for a HyperText Project". Archived from the original on 2 May 2015. Retrieved 12 May 2015.
  43. ^ Berners-Lee, Tim. "Frequently asked questions by the Press". W3C. Archived from the original on 2 August 2009. Retrieved 27 July 2009.
  44. ^ Palazzi, P (2011). "The Early Days of the WWW at CERN". Archived from the original on 23 July 2012.
  45. ^ Fraser, Dominic (13 May 2018). "Why a domain's root can't be a CNAME – and other tidbits about the DNS". FreeCodeCamp. Archived from the original on 21 April 2024. Retrieved 12 March 2019.
  46. ^ "automatically adding www.___.com". mozillaZine. 16 May 2003. Archived from the original on 27 June 2009. Retrieved 27 May 2009.
  47. ^ Masnick, Mike (7 July 2008). "Microsoft Patents Adding 'www.' And '.com' To Text". Techdirt. Archived from the original on 27 June 2009. Retrieved 27 May 2009.
  48. ^ a b Hamilton, Naomi (31 July 2008). "The A-Z of Programming Languages: JavaScript". Computerworld. IDG. Archived from the original on 24 May 2009. Retrieved 12 May 2009.
  49. ^ Buntin, Seth (23 September 2008). "jQuery Polling plugin". Archived from the original on 13 August 2009. Retrieved 22 August 2009.
  50. ^ "website". TheFreeDictionary.com. Archived from the original on 7 May 2018. Retrieved 2 July 2011.
  51. ^ www.similarweb.com https://www.similarweb.com/browsers/. Retrieved 15 February 2025. cite web: Missing or empty |title= (help)
  52. ^ Patrick, Killelea (2002). Web performance tuning (2nd ed.). Beijing: O'Reilly. p. 264. ISBN 978-0596001728. OCLC 49502686.
  53. ^ Liu, Xiang (20 December 2019). "Evolution of Fiber-Optic Transmission and Networking toward the 5G Era". iScience. 22: 489–506. Bibcode:2019iSci...22..489L. doi:10.1016/j.isci.2019.11.026. ISSN 2589-0042. PMC 6920305. PMID 31838439.
  54. ^ Marom, Dan M. (1 January 2008), Gianchandani, Yogesh B.; Tabata, Osamu; Zappe, Hans (eds.), "3.07 - Optical Communications", Comprehensive Microsystems, Oxford: Elsevier, pp. 219–265, doi:10.1016/b978-044452190-3.00035-5, ISBN 978-0-444-52190-3, retrieved 17 January 2025
  55. ^ Chadha, Devi (2019). Optical WDM networks: from static to elastic networks. Hoboken, NJ: Wiley-IEEE Press. ISBN 978-1-119-39326-9.
  56. ^ "The Computer History Museum, SRI International, and BBN Celebrate the 40th Anniversary of First ARPANET Transmission, Precursor to Today's Internet | SRI International". 29 March 2019. Archived from the original on 29 March 2019. Retrieved 21 January 2025.
  57. ^ Markoff, John (24 January 1993). "Building the Electronic Superhighway". The New York Times. ISSN 0362-4331. Retrieved 21 January 2025.
  58. ^ Abbate, Janet (2000). Inventing the Internet. Inside technology (3rd printing ed.). Cambridge, Mass.: MIT Press. ISBN 978-0-262-51115-5.
  59. ^ a b www.merit.edu http://web.archive.org/web/20241106150721/https://www.merit.edu/wp-content/uploads/2019/06/NSFNET_final-1.pdf. Archived from the original (PDF) on 6 November 2024. Retrieved 21 January 2025. cite web: Missing or empty |title= (help)
  60. ^ Rep. Boucher, Rick [D-VA-9 (14 September 1993). "H.R.1757 - 103rd Congress (1993-1994): National Information Infrastructure Act of 1993". www.congress.gov. Retrieved 23 January 2025.cite web: CS1 maint: numeric names: authors list (link)
  61. ^ "NSF Shapes the Internet's Evolution | NSF - National Science Foundation". new.nsf.gov. 25 July 2003. Retrieved 23 January 2025.
  62. ^ Radu, Roxana (7 March 2019), Radu, Roxana (ed.), "Privatization and Globalization of the Internet", Negotiating Internet Governance, Oxford University Press, pp. 75–112, doi:10.1093/oso/9780198833079.003.0004, ISBN 978-0-19-883307-9, retrieved 23 January 2025
  63. ^ "Birth of the Commercial Internet - NSF Impacts | NSF - National Science Foundation". new.nsf.gov. Retrieved 23 January 2025.
  64. ^ Markoff, John (3 March 1997). "Fiber-Optic Technology Draws Record Stock Value". The New York Times. ISSN 0362-4331. Retrieved 23 January 2025.
  65. ^ Paul Korzeniowski, “Record Growth Spurs Demand for Dense WDM -- Infrastructure Bandwidth Gears up for next Wave,” CommunicationsWeek, no. 666 (June 2, 1997): T.40.
  66. ^ Hecht, Jeff (1999). City of light: the story of fiber optics. The Sloan technology series. New York: Oxford University Press. ISBN 978-0-19-510818-7.
  67. ^ "Cisco to Acquire Pirelli DWDM Unit for $2.15 Billion". www.fiberopticsonline.com. Retrieved 31 January 2025.
  68. ^ Hirsch, Stacey (February 2, 2006). "Huber steps down as CEO of Broadwing". The Baltimore Sun.
  69. ^ "Dr. David Huber". History of the Internet. Retrieved 3 February 2025.
  70. ^ "Internet Commercialization History". History of the Internet. Retrieved 3 February 2025.
  71. ^ "May 17, 1993, page 76 - The Baltimore Sun at Baltimore Sun". Newspapers.com. Retrieved 3 February 2025.
  72. ^ Hall, Carla. “Inventor Beams over Laser Patents : After 30 Years, Gordon Gould Gets Credit He Deserves.” Los Angeles Times, Los Angeles Times, 17 Dec. 1987.
  73. ^ Chang, Kenneth (20 September 2005). "Gordon Gould, 85, Figure in Invention of the Laser, Dies". The New York Times. ISSN 0362-4331. Retrieved 3 February 2025.
  74. ^ Carroll, Jim (12 December 2024). "Patrick Nettles Steps Down as Executive Chair of Ciena". Converge Digest. Retrieved 3 February 2025.
  75. ^ US5696615A, Alexander, Stephen B., "Wavelength division multiplexed optical communication systems employing uniform gain optical amplifiers", issued 1997-12-09 
  76. ^ Hecht, Jeff (2004). City of light: the story of fiber optics. The Sloan technology series (Rev. and expanded ed., 1. paperback [ed.] ed.). Oxford: Oxford Univ. Press. ISBN 978-0-19-510818-7.
  77. ^ "Optica Publishing Group". opg.optica.org. Retrieved 3 February 2025.
  78. ^ "Sprint boots some users off 'Net - ProQuest". www.proquest.com. ProQuest 215944575. Retrieved 3 February 2025.
  79. ^ Vamosi, Robert (14 April 2008). "Gmail cookie stolen via Google Spreadsheets". News.cnet.com. Archived from the original on 9 December 2013. Retrieved 19 October 2017.
  80. ^ "What about the "EU Cookie Directive"?". WebCookies.org. 2013. Archived from the original on 11 October 2017. Retrieved 19 October 2017.
  81. ^ "New net rules set to make cookies crumble". BBC. 8 March 2011. Archived from the original on 10 August 2018. Retrieved 18 February 2019.
  82. ^ "Sen. Rockefeller: Get Ready for a Real Do-Not-Track Bill for Online Advertising". Adage.com. 6 May 2011. Archived from the original on 24 August 2011. Retrieved 18 February 2019.
  83. ^ Want to use my wifi? Archived 4 January 2018 at the Wayback Machine, Jann Horn accessed 5 January 2018.
  84. ^ Nguyen, Jennimai (10 September 2020). "Archie, the very first search engine, was released 30 years ago today". Mashable. Retrieved 4 February 2025.
  85. ^ "What is File Transfer Protocol (FTP) meaning". Fortinet. Retrieved 4 February 2025.
  86. ^ "Britannica Money". www.britannica.com. 4 February 2025. Retrieved 4 February 2025.
  87. ^ Clark, Andrew (1 February 2008). "How Jerry's guide to the world wide web became Yahoo". The Guardian. ISSN 0261-3077. Retrieved 4 February 2025.
  88. ^ Hamilton, Nigel (13 May 2024). "The Mechanics of a Deep Net Metasearch Engine". IADIS Digital Library: 1034–1036. ISBN 978-972-98947-0-1.
  89. ^ Devine, Jane; Egger-Sider, Francine (July 2004). "Beyond google: the invisible web in the academic library". The Journal of Academic Librarianship. 30 (4): 265–269. doi:10.1016/j.acalib.2004.04.010.
  90. ^ Raghavan, Sriram; Garcia-Molina, Hector (11–14 September 2001). "Crawling the Hidden Web". 27th International Conference on Very Large Data Bases. Archived from the original on 17 August 2019. Retrieved 18 February 2019.
  91. ^ "Surface Web". Computer Hope. Archived from the original on 5 May 2020. Retrieved 20 June 2018.
  92. ^ Wright, Alex (22 February 2009). "Exploring a 'Deep Web' That Google Can't Grasp". The New York Times. Archived from the original on 1 March 2020. Retrieved 23 February 2009.
  93. ^ Madhavan, J., Ko, D., Kot, Ł., Ganapathy, V., Rasmussen, A., & Halevy, A. (2008). Google's deep web crawl. Proceedings of the VLDB Endowment, 1(2), 1241–52.
  94. ^ Shedden, Sam (8 June 2014). "How Do You Want Me to Do It? Does It Have to Look like an Accident? – an Assassin Selling a Hit on the Net; Revealed Inside the Deep Web". Sunday Mail. Archived from the original on 1 March 2020. Retrieved 5 May 2017.
  95. ^ a b Ben-Itzhak, Yuval (18 April 2008). "Infosecurity 2008 – New defence strategy in battle against e-crime". ComputerWeekly. Reed Business Information. Archived from the original on 4 June 2008. Retrieved 20 April 2008.
  96. ^ Christey, Steve & Martin, Robert A. (22 May 2007). "Vulnerability Type Distributions in CVE (version 1.1)". MITRE Corporation. Archived from the original on 17 March 2013. Retrieved 7 June 2008.
  97. ^ "Symantec Internet Security Threat Report: Trends for July–December 2007 (Executive Summary)" (PDF). Symantec Internet Security Threat Report. XIII. Symantec Corp.: 1–2 April 2008. Archived from the original (PDF) on 25 June 2008. Retrieved 11 May 2008.
  98. ^ "Google searches web's dark side". BBC News. 11 May 2007. Archived from the original on 7 March 2008. Retrieved 26 April 2008.
  99. ^ "Security Threat Report (Q1 2008)" (PDF). Sophos. Archived (PDF) from the original on 31 December 2013. Retrieved 24 April 2008.
  100. ^ "Security threat report" (PDF). Sophos. July 2008. Archived (PDF) from the original on 31 December 2013. Retrieved 24 August 2008.
  101. ^ Jeremiah Grossman; Robert "RSnake" Hansen; Petko "pdp" D. Petkov; Anton Rager; Seth Fogie (2007). Cross Site Scripting Attacks: XSS Exploits and Defense (PDF). Syngress, Elsevier Science & Technology. pp. 68–69, 127. ISBN 978-1-59749-154-9. Archived (PDF) from the original on 15 November 2024. Retrieved 23 January 2025.
  102. ^ O'Reilly, Tim (30 September 2005). "What Is Web 2.0". O'Reilly Media. pp. 4–5. Archived from the original on 28 June 2012. Retrieved 4 June 2008. and AJAX web applications can introduce security vulnerabilities like "client-side security controls, increased attack surfaces, and new possibilities for Cross-Site Scripting (XSS)", in Ritchie, Paul (March 2007). "The security risks of AJAX/web 2.0 applications" (PDF). Infosecurity. Archived from the original (PDF) on 25 June 2008. Retrieved 6 June 2008. which cites Hayre, Jaswinder S. & Kelath, Jayasankar (22 June 2006). "Ajax Security Basics". SecurityFocus. Archived from the original on 15 May 2008. Retrieved 6 June 2008.
  103. ^ Berinato, Scott (1 January 2007). "Software Vulnerability Disclosure: The Chilling Effect". CSO. CXO Media. p. 7. Archived from the original on 18 April 2008. Retrieved 7 June 2008.
  104. ^ "2012 Global Losses From phishing Estimated At $1.5 Bn". FirstPost. 20 February 2013. Archived from the original on 21 December 2014. Retrieved 25 January 2019.
  105. ^ Prince, Brian (9 April 2008). "McAfee Governance, Risk and Compliance Business Unit". eWEEK. Ziff Davis Enterprise Holdings. Archived from the original on 21 April 2024. Retrieved 25 April 2008.
  106. ^ Preston, Rob (12 April 2008). "Down To Business: It's Past Time To Elevate The Infosec Conversation". InformationWeek. United Business Media. Archived from the original on 14 April 2008. Retrieved 25 April 2008.
  107. ^ Claburn, Thomas (6 February 2007). "RSA's Coviello Predicts Security Consolidation". InformationWeek. United Business Media. Archived from the original on 7 February 2009. Retrieved 25 April 2008.
  108. ^ Duffy Marsan, Carolyn (9 April 2008). "How the iPhone is killing the 'Net". Network World. IDG. Archived from the original on 14 April 2008. Retrieved 17 April 2008.
  109. ^ boyd, danah; Hargittai, Eszter (July 2010). "Facebook privacy settings: Who cares?". First Monday. 15 (8). doi:10.5210/fm.v15i8.3086.
  110. ^ "W3C Technical Reports and Publications". W3C. Archived from the original on 15 July 2018. Retrieved 19 January 2009.
  111. ^ "IETF RFC page". IETF. Archived from the original on 2 February 2009. Retrieved 19 January 2009.
  112. ^ "Search for World Wide Web in ISO standards". ISO. Archived from the original on 4 March 2016. Retrieved 19 January 2009.
  113. ^ "Ecma formal publications". Ecma. Archived from the original on 27 December 2017. Retrieved 19 January 2009.
  114. ^ "Unicode Technical Reports". Unicode Consortium. Archived from the original on 2 January 2022. Retrieved 19 January 2009.
  115. ^ "IANA home page". IANA. Archived from the original on 24 February 2011. Retrieved 19 January 2009.
  116. ^ Sikos, Leslie (2011). Web standards – Mastering HTML5, CSS3, and XML. Apress. ISBN 978-1-4302-4041-9. Archived from the original on 2 April 2015. Retrieved 12 March 2019.
  117. ^ a b c "Web Accessibility Initiative (WAI)". World Wide Web Consortium. Archived from the original on 2 April 2009. Retrieved 7 April 2009.
  118. ^ "Developing a Web Accessibility Business Case for Your Organization: Overview". World Wide Web Consortium. Archived from the original on 14 April 2009. Retrieved 7 April 2009.
  119. ^ "Legal and Policy Factors in Developing a Web Accessibility Business Case for Your Organization". World Wide Web Consortium. Archived from the original on 5 April 2009. Retrieved 7 April 2009.
  120. ^ "Web Content Accessibility Guidelines (WCAG) Overview". World Wide Web Consortium. Archived from the original on 1 April 2009. Retrieved 7 April 2009.
  121. ^ "Internationalization (I18n) Activity". World Wide Web Consortium. Archived from the original on 16 April 2009. Retrieved 10 April 2009.
  122. ^ Davis, Mark (5 April 2008). "Moving to Unicode 5.1". Archived from the original on 21 May 2009. Retrieved 10 April 2009.
  123. ^ "World Wide Web Consortium Supports the IETF URI Standard and IRI Proposed Standard" (Press release). World Wide Web Consortium. 26 January 2005. Archived from the original on 7 February 2009. Retrieved 10 April 2009.

Further reading

[edit]
  • Berners-Lee, Tim; Bray, Tim; Connolly, Dan; Cotton, Paul; Fielding, Roy; Jeckle, Mario; Lilley, Chris; Mendelsohn, Noah; Orchard, David; Walsh, Norman; Williams, Stuart (15 December 2004). "Architecture of the World Wide Web, Volume One". W3C. Version 20041215.
  • Berners-Lee, Tim (August 1996). "The World Wide Web: Past, Present and Future". W3C.
  • Brügger, Niels, ed, Web25: Histories from the first 25 years of the World Wide Web (Peter Lang, 2017).
  • Fielding, R.; Gettys, J.; Mogul, J.; Frystyk, H.; Masinter, L.; Leach, P.; Berners-Lee, T. (June 1999). "Hypertext Transfer Protocol – HTTP/1.1". Request For Comments 2616. Information Sciences Institute.
  • Niels Brügger, ed. Web History (2010) 362 pages; Historical perspective on the World Wide Web, including issues of culture, content, and preservation.
  • Polo, Luciano (2003). "World Wide Web Technology Architecture: A Conceptual Analysis". New Devices.
  • Skau, H.O. (March 1990). "The World Wide Web and Health Information". New Devices.
[edit]

 

 

Architecture of a Web crawler

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).[1]

Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.

Crawlers consume resources on visited systems and often visit sites unprompted. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all.

The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly.

Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping and data-driven programming.

Nomenclature

[edit]

A web crawler is also known as a spider,[2] an ant, an automatic indexer,[3] or (in the FOAF software context) a Web scutter.[4]

Overview

[edit]

A Web crawler starts with a list of URLs to visit. Those first URLs are called the seeds. As the crawler visits these URLs, by communicating with web servers that respond to those URLs, it identifies all the hyperlinks in the retrieved web pages and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. If the crawler is performing archiving of websites (or web archiving), it copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as if they were on the live web, but are preserved as 'snapshots'.[5]

The archive is known as the repository and is designed to store and manage the collection of web pages. The repository only stores HTML pages and these pages are stored as distinct files. A repository is similar to any other system that stores data, like a modern-day database. The only difference is that a repository does not need all the functionality offered by a database system. The repository stores the most recent version of the web page retrieved by the crawler.[citation needed]

The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.

The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."[6] A crawler must carefully choose at each step which pages to visit next.

Crawling policy

[edit]

The behavior of a Web crawler is the outcome of a combination of policies:[7]

  • a selection policy which states the pages to download,
  • a re-visit policy which states when to check for changes to the pages,
  • a politeness policy that states how to avoid overloading websites.
  • a parallelization policy that states how to coordinate distributed web crawlers.

Selection policy

[edit]

Given the current size of the Web, even large search engines cover only a portion of the publicly available part. A 2009 study showed even large-scale search engines index no more than 40–70% of the indexable Web;[8] a previous study by Steve Lawrence and Lee Giles showed that no search engine indexed more than 16% of the Web in 1999.[9] As a crawler always downloads just a fraction of the Web pages, it is highly desirable for the downloaded fraction to contain the most relevant pages and not just a random sample of the Web.

This requires a metric of importance for prioritizing Web pages. The importance of a page is a function of its intrinsic quality, its popularity in terms of links or visits, and even of its URL (the latter is the case of vertical search engines restricted to a single top-level domain, or search engines restricted to a fixed Web site). Designing a good selection policy has an added difficulty: it must work with partial information, as the complete set of Web pages is not known during crawling.

Junghoo Cho et al. made the first study on policies for crawling scheduling. Their data set was a 180,000-pages crawl from the stanford.edu domain, in which a crawling simulation was done with different strategies.[10] The ordering metrics tested were breadth-first, backlink count and partial PageRank calculations. One of the conclusions was that if the crawler wants to download pages with high Pagerank early during the crawling process, then the partial Pagerank strategy is the better, followed by breadth-first and backlink-count. However, these results are for just a single domain. Cho also wrote his PhD dissertation at Stanford on web crawling.[11]

Najork and Wiener performed an actual crawl on 328 million pages, using breadth-first ordering.[12] They found that a breadth-first crawl captures pages with high Pagerank early in the crawl (but they did not compare this strategy against other strategies). The explanation given by the authors for this result is that "the most important pages have many links to them from numerous hosts, and those links will be found early, regardless of on which host or page the crawl originates."

Abiteboul designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation).[13] In OPIC, each page is given an initial sum of "cash" that is distributed equally among the pages it points to. It is similar to a PageRank computation, but it is faster and is only done in one step. An OPIC-driven crawler downloads first the pages in the crawling frontier with higher amounts of "cash". Experiments were carried in a 100,000-pages synthetic graph with a power-law distribution of in-links. However, there was no comparison with other strategies nor experiments in the real Web.

Boldi et al. used simulation on subsets of the Web of 40 million pages from the .it domain and 100 million pages from the WebBase crawl, testing breadth-first against depth-first, random ordering and an omniscient strategy. The comparison was based on how well PageRank computed on a partial crawl approximates the true PageRank value. Some visits that accumulate PageRank very quickly (most notably, breadth-first and the omniscient visit) provide very poor progressive approximations.[14][15]

Baeza-Yates et al. used simulation on two subsets of the Web of 3 million pages from the .gr and .cl domain, testing several crawling strategies.[16] They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are better than breadth-first crawling, and that it is also very effective to use a previous crawl, when it is available, to guide the current one.

Daneshpajouh et al. designed a community based algorithm for discovering good seeds.[17] Their method crawls web pages with high PageRank from different communities in less iteration in comparison with crawl starting from random seeds. One can extract good seed from a previously-crawled-Web graph using this new method. Using these seeds, a new crawl can be very effective.

[edit]

A crawler may only want to seek out HTML pages and avoid all other MIME types. In order to request only HTML resources, a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. To avoid making numerous HEAD requests, a crawler may examine the URL and only request a resource if the URL ends with certain characters such as .html, .htm, .asp, .aspx, .php, .jsp, .jspx or a slash. This strategy may cause numerous HTML Web resources to be unintentionally skipped.

Some crawlers may also avoid requesting any resources that have a "?" in them (are dynamically produced) in order to avoid spider traps that may cause the crawler to download an infinite number of URLs from a Web site. This strategy is unreliable if the site uses URL rewriting to simplify its URLs.

URL normalization

[edit]

Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. The term URL normalization, also called URL canonicalization, refers to the process of modifying and standardizing a URL in a consistent manner. There are several types of normalization that may be performed including conversion of URLs to lowercase, removal of "." and ".." segments, and adding trailing slashes to the non-empty path component.[18]

Path-ascending crawling

[edit]

Some crawlers intend to download/upload as many resources as possible from a particular web site. So path-ascending crawler was introduced that would ascend to every path in each URL that it intends to crawl.[19] For example, when given a seed URL of http://llama.org/hamster/monkey/page.html, it will attempt to crawl /hamster/monkey/, /hamster/, and /. Cothey found that a path-ascending crawler was very effective in finding isolated resources, or resources for which no inbound link would have been found in regular crawling.

Focused crawling

[edit]

The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called focused crawler or topical crawlers. The concepts of topical and focused crawling were first introduced by Filippo Menczer[20][21] and by Soumen Chakrabarti et al.[22]

The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton[23] in the first web crawler of the early days of the Web. Diligenti et al.[24] propose using the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points.

Academic focused crawler
[edit]

An example of the focused crawlers are academic crawlers, which crawls free-access academic related documents, such as the citeseerxbot, which is the crawler of CiteSeerX search engine. Other academic search engines are Google Scholar and Microsoft Academic Search etc. Because most academic papers are published in PDF formats, such kind of crawler is particularly interested in crawling PDF, PostScript files, Microsoft Word including their zipped formats. Because of this, general open-source crawlers, such as Heritrix, must be customized to filter out other MIME types, or a middleware is used to extract these documents out and import them to the focused crawl database and repository.[25] Identifying whether these documents are academic or not is challenging and can add a significant overhead to the crawling process, so this is performed as a post crawling process using machine learning or regular expression algorithms. These academic documents are usually obtained from home pages of faculties and students or from publication page of research institutes. Because academic documents make up only a small fraction of all web pages, a good seed selection is important in boosting the efficiencies of these web crawlers.[26] Other academic crawlers may download plain text and HTML files, that contains metadata of academic papers, such as titles, papers, and abstracts. This increases the overall number of papers, but a significant fraction may not provide free PDF downloads.

Semantic focused crawler
[edit]

Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes.[27] In addition, ontologies can be automatically updated in the crawling process. Dong et al.[28] introduced such an ontology-learning-based crawler using a support-vector machine to update the content of ontological concepts when crawling Web pages.

Re-visit policy

[edit]

The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates, and deletions.

From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age.[29]

Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:

Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:

Coffman et al. worked with a definition of the objective of a Web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of Web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler.[30]

The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are outdated, while in the second case, the crawler is concerned with how old the local copies of pages are.

Evolution of Freshness and Age in a web crawler

Two simple re-visiting policies were studied by Cho and Garcia-Molina:[31]

  • Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
  • Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.

In both cases, the repeated crawling order of pages can be done either in a random or a fixed order.

Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. Intuitively, the reasoning is that, as web crawlers have a limit to how many pages they can crawl in a given time frame, (1) they will allocate too many new crawls to rapidly changing pages at the expense of less frequently updating pages, and (2) the freshness of rapidly changing pages lasts for shorter period than that of less frequently changing pages. In other words, a proportional policy allocates more resources to crawling frequently updating pages, but experiences less overall freshness time from them.

To improve freshness, the crawler should penalize the elements that change too often.[32] The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as Coffman et al. note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible".[30] Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. Cho and Garcia-Molina show that the exponential distribution is a good fit for describing page changes,[32] while Ipeirotis et al. show how to use statistical tools to discover parameters that affect this distribution.[33] The re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy.

Politeness policy

[edit]

Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. If a single crawler is performing multiple requests per second and/or downloading large files, a server can have a hard time keeping up with requests from multiple crawlers.

As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community.[34] The costs of using Web crawlers include:

  • network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time;
  • server overload, especially if the frequency of accesses to a given server is too high;
  • poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle; and
  • personal crawlers that, if deployed by too many users, can disrupt networks and Web servers.

A partial solution to these problems is the robots exclusion protocol, also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.[35] This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like Google, Ask Jeeves, MSN and Yahoo! Search are able to use an extra "Crawl-delay:" parameter in the robots.txt file to indicate the number of seconds to delay between requests.

The first proposed interval between successive pageloads was 60 seconds.[36] However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire Web site; also, only a fraction of the resources from that Web server would be used.

Cho uses 10 seconds as an interval for accesses,[31] and the WIRE crawler uses 15 seconds as the default.[37] The MercatorWeb crawler follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page.[38] Dill et al. use 1 second.[39]

For those using Web crawlers for research purposes, a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl.[40]

Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. Sergey Brin and Larry Page noted in 1998, "... running a crawler which connects to more than half a million servers ... generates a fair amount of e-mail and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen."[41]

Parallelization policy

[edit]

A parallel crawler is a crawler that runs multiple processes in parallel. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

Architectures

[edit]
High-level architecture of a standard Web crawler

A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture.

Shkapenyuk and Suel noted that:[42]

While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.

Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms.

Security

[edit]

While most of the website owners are keen to have their pages indexed as broadly as possible to have strong presence in search engines, web crawling can also have unintended consequences and lead to a compromise or data breach if a search engine indexes resources that should not be publicly available, or pages revealing potentially vulnerable versions of software.

Apart from standard web application security recommendations website owners can reduce their exposure to opportunistic hacking by only allowing search engines to index the public parts of their websites (with robots.txt) and explicitly blocking them from indexing transactional parts (login pages, private pages, etc.).

Crawler identification

[edit]

Web crawlers typically identify themselves to a Web server by using the User-agent field of an HTTP request. Web site administrators typically examine their Web servers' log and use the user agent field to determine which crawlers have visited the web server and how often. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. Examining Web server log is tedious task, and therefore some administrators use tools to identify, track and verify Web crawlers. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field, or they may mask their identity as a browser or other well-known crawler.

Web site administrators prefer Web crawlers to identify themselves so that they can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler trap or they may be overloading a Web server with requests, and the owner needs to stop the crawler. Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine.

Crawling the deep web

[edit]

A vast amount of web pages lie in the deep or invisible web.[43] These pages are typically only accessible by submitting queries to a database, and regular crawlers are unable to find these pages if there are no links that point to them. Google's Sitemaps protocol and mod oai[44] are intended to allow discovery of these deep-Web resources.

Deep web crawling also multiplies the number of web links to be crawled. Some crawlers only take some of the URLs in <a href="URL"> form. In some cases, such as the Googlebot, Web crawling is done on all text contained inside the hypertext content, tags, or text.

Strategic approaches may be taken to target deep Web content. With a technique called screen scraping, specialized software may be customized to automatically and repeatedly query a given Web form with the intention of aggregating the resulting data. Such software can be used to span multiple Web forms across multiple Websites. Data extracted from the results of one Web form submission can be taken and applied as input to another Web form thus establishing continuity across the Deep Web in a way not possible with traditional web crawlers.[45]

Pages built on AJAX are among those causing problems to web crawlers. Google has proposed a format of AJAX calls that their bot can recognize and index.[46]

Visual vs programmatic crawlers

[edit]

There are a number of "visual web scraper/crawler" products available on the web which will crawl pages and structure data into columns and rows based on the users requirements. One of the main difference between a classic and a visual crawler is the level of programming ability required to set up a crawler. The latest generation of "visual scrapers" remove the majority of the programming skill needed to be able to program and start a crawl to scrape web data.

The visual scraping/crawling method relies on the user "teaching" a piece of crawler technology, which then follows patterns in semi-structured data sources. The dominant method for teaching a visual crawler is by highlighting data in a browser and training columns and rows. While the technology is not new, for example it was the basis of Needlebase which has been bought by Google (as part of a larger acquisition of ITA Labs[47]), there is continued growth and investment in this area by investors and end-users.[citation needed]

List of web crawlers

[edit]

The following is a list of published crawler architectures for general-purpose crawlers (excluding focused web crawlers), with a brief description that includes the names given to the different components and outstanding features:

Historical web crawlers

[edit]
  • WolfBot was a massively multi threaded crawler built in 2001 by Mani Singh a Civil Engineering graduate from the University of California at Davis.
  • World Wide Web Worm was a crawler used to build a simple index of document titles and URLs. The index could be searched by using the grep Unix command.
  • Yahoo! Slurp was the name of the Yahoo! Search crawler until Yahoo! contracted with Microsoft to use Bingbot instead.

In-house web crawlers

[edit]
  • Applebot is Apple's web crawler. It supports Siri and other products.[48]
  • Bingbot is the name of Microsoft's Bing webcrawler. It replaced Msnbot.
  • Baiduspider is Baidu's web crawler.
  • DuckDuckBot is DuckDuckGo's web crawler.
  • Googlebot is described in some detail, but the reference is only about an early version of its architecture, which was written in C++ and Python. The crawler was integrated with the indexing process, because text parsing was done for full-text indexing and also for URL extraction. There is a URL server that sends lists of URLs to be fetched by several crawling processes. During parsing, the URLs found were passed to a URL server that checked if the URL have been previously seen. If not, the URL was added to the queue of the URL server.
  • WebCrawler was used to build the first publicly available full-text index of a subset of the Web. It was based on lib-WWW to download pages, and another program to parse and order URLs for breadth-first exploration of the Web graph. It also included a real-time crawler that followed links based on the similarity of the anchor text with the provided query.
  • WebFountain is a distributed, modular crawler similar to Mercator but written in C++.
  • Xenon is a web crawler used by government tax authorities to detect fraud.[49][50]

Commercial web crawlers

[edit]

The following web crawlers are available, for a price::

Open-source crawlers

[edit]
  • Apache Nutch is a highly extensible and scalable web crawler written in Java and released under an Apache License. It is based on Apache Hadoop and can be used with Apache Solr or Elasticsearch.
  • Grub was an open source distributed search crawler that Wikia Search used to crawl the web.
  • Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
  • ht://Dig includes a Web crawler in its indexing engine.
  • HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
  • Norconex Web Crawler is a highly extensible Web Crawler written in Java and released under an Apache License. It can be used with many repositories such as Apache Solr, Elasticsearch, Microsoft Azure Cognitive Search, Amazon CloudSearch and more.
  • mnoGoSearch is a crawler, indexer and a search engine written in C and licensed under the GPL (*NIX machines only)
  • Open Search Server is a search engine and web crawler software release under the GPL.
  • Scrapy, an open source webcrawler framework, written in python (licensed under BSD).
  • Seeks, a free distributed search engine (licensed under AGPL).
  • StormCrawler, a collection of resources for building low-latency, scalable web crawlers on Apache Storm (Apache License).
  • tkWWW Robot, a crawler based on the tkWWW web browser (licensed under GPL).
  • GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
  • YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).

See also

[edit]

References

[edit]
  1. ^ "Web Crawlers: Browsing the Web". Archived from the original on 6 December 2021.
  2. ^ Spetka, Scott. "The TkWWW Robot: Beyond Browsing". NCSA. Archived from the original on 3 September 2004. Retrieved 21 November 2010.
  3. ^ Kobayashi, M. & Takeda, K. (2000). "Information retrieval on the web". ACM Computing Surveys. 32 (2): 144–173. CiteSeerX 10.1.1.126.6094. doi:10.1145/358923.358934. S2CID 3710903.
  4. ^ See definition of scutter on FOAF Project's wiki Archived 13 December 2009 at the Wayback Machine
  5. ^ Masanès, Julien (15 February 2007). Web Archiving. Springer. p. 1. ISBN 978-3-54046332-0. Retrieved 24 April 2014.
  6. ^ Edwards, J.; McCurley, K. S.; and Tomlin, J. A. (2001). "An adaptive model for optimizing performance of an incremental web crawler". Proceedings of the 10th international conference on World Wide Web. pp. 106–113. CiteSeerX 10.1.1.1018.1506. doi:10.1145/371920.371960. ISBN 978-1581133486. S2CID 10316730. Archived from the original on 25 June 2014. Retrieved 25 January 2007.cite book: CS1 maint: multiple names: authors list (link)
  7. ^ Castillo, Carlos (2004). Effective Web Crawling (PhD thesis). University of Chile. Retrieved 3 August 2010.
  8. ^ Gulls, A.; A. Signori (2005). "The indexable web is more than 11.5 billion pages". Special interest tracks and posters of the 14th international conference on World Wide Web. ACM Press. pp. 902–903. doi:10.1145/1062745.1062789.
  9. ^ Lawrence, Steve; C. Lee Giles (8 July 1999). "Accessibility of information on the web". Nature. 400 (6740): 107–9. Bibcode:1999Natur.400..107L. doi:10.1038/21987. PMID 10428673. S2CID 4347646.
  10. ^ Cho, J.; Garcia-Molina, H.; Page, L. (April 1998). "Efficient Crawling Through URL Ordering". Seventh International World-Wide Web Conference. Brisbane, Australia. doi:10.1142/3725. ISBN 978-981-02-3400-3. Retrieved 23 March 2009.
  11. ^ Cho, Junghoo, "Crawling the Web: Discovery and Maintenance of a Large-Scale Web Data", PhD dissertation, Department of Computer Science, Stanford University, November 2001.
  12. ^ Najork, Marc and Janet L. Wiener. "Breadth-first crawling yields high-quality pages". Archived 24 December 2017 at the Wayback Machine In: Proceedings of the Tenth Conference on World Wide Web, pages 114–118, Hong Kong, May 2001. Elsevier Science.
  13. ^ Abiteboul, Serge; Mihai Preda; Gregory Cobena (2003). "Adaptive on-line page importance computation". Proceedings of the 12th international conference on World Wide Web. Budapest, Hungary: ACM. pp. 280–290. doi:10.1145/775152.775192. ISBN 1-58113-680-3. Retrieved 22 March 2009.
  14. ^ Boldi, Paolo; Bruno Codenotti; Massimo Santini; Sebastiano Vigna (2004). "UbiCrawler: a scalable fully distributed Web crawler" (PDF). Software: Practice and Experience. 34 (8): 711–726. CiteSeerX 10.1.1.2.5538. doi:10.1002/spe.587. S2CID 325714. Archived from the original (PDF) on 20 March 2009. Retrieved 23 March 2009.
  15. ^ Boldi, Paolo; Massimo Santini; Sebastiano Vigna (2004). "Do Your Worst to Make the Best: Paradoxical Effects in PageRank Incremental Computations" (PDF). Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science. Vol. 3243. pp. 168–180. doi:10.1007/978-3-540-30216-2_14. ISBN 978-3-540-23427-2. Archived from the original (PDF) on 1 October 2005. Retrieved 23 March 2009.
  16. ^ Baeza-Yates, R.; Castillo, C.; Marin, M. and Rodriguez, A. (2005). "Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering." In: Proceedings of the Industrial and Practical Experience track of the 14th conference on World Wide Web, pages 864–872, Chiba, Japan. ACM Press.
  17. ^ Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri, Mohammad Ghodsi, A Fast Community Based Algorithm for Generating Crawler Seeds Set. In: Proceedings of 4th International Conference on Web Information Systems and Technologies (Webist-2008), Funchal, Portugal, May 2008.
  18. ^ Pant, Gautam; Srinivasan, Padmini; Menczer, Filippo (2004). "Crawling the Web" (PDF). In Levene, Mark; Poulovassilis, Alexandra (eds.). Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Springer. pp. 153–178. ISBN 978-3-540-40676-1. Archived from the original (PDF) on 20 March 2009. Retrieved 9 May 2006.
  19. ^ Cothey, Viv (2004). "Web-crawling reliability" (PDF). Journal of the American Society for Information Science and Technology. 55 (14): 1228–1238. CiteSeerX 10.1.1.117.185. doi:10.1002/asi.20078.
  20. ^ Menczer, F. (1997). ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery Archived 21 December 2012 at the Wayback Machine. In D. Fisher, ed., Machine Learning: Proceedings of the 14th International Conference (ICML97). Morgan Kaufmann
  21. ^ Menczer, F. and Belew, R.K. (1998). Adaptive Information Agents in Distributed Textual Environments Archived 21 December 2012 at the Wayback Machine. In K. Sycara and M. Wooldridge (eds.) Proc. 2nd Intl. Conf. on Autonomous Agents (Agents '98). ACM Press
  22. ^ Chakrabarti, Soumen; Van Den Berg, Martin; Dom, Byron (1999). "Focused crawling: A new approach to topic-specific Web resource discovery" (PDF). Computer Networks. 31 (11–16): 1623–1640. doi:10.1016/s1389-1286(99)00052-3. Archived from the original (PDF) on 17 March 2004.
  23. ^ Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.
  24. ^ Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.
  25. ^ Wu, Jian; Teregowda, Pradeep; Khabsa, Madian; Carman, Stephen; Jordan, Douglas; San Pedro Wandelmer, Jose; Lu, Xin; Mitra, Prasenjit; Giles, C. Lee (2012). "Web crawler middleware for search engine digital libraries". Proceedings of the twelfth international workshop on Web information and data management - WIDM '12. p. 57. doi:10.1145/2389936.2389949. ISBN 9781450317207. S2CID 18513666.
  26. ^ Wu, Jian; Teregowda, Pradeep; Ramírez, Juan Pablo Fernández; Mitra, Prasenjit; Zheng, Shuyi; Giles, C. Lee (2012). "The evolution of a crawling strategy for an academic document search engine". Proceedings of the 3rd Annual ACM Web Science Conference on - Web Sci '12. pp. 340–343. doi:10.1145/2380718.2380762. ISBN 9781450312288. S2CID 16718130.
  27. ^ Dong, Hai; Hussain, Farookh Khadeer; Chang, Elizabeth (2009). "State of the Art in Semantic Focused Crawlers". Computational Science and Its Applications – ICCSA 2009. Lecture Notes in Computer Science. Vol. 5593. pp. 910–924. doi:10.1007/978-3-642-02457-3_74. hdl:20.500.11937/48288. ISBN 978-3-642-02456-6.
  28. ^ Dong, Hai; Hussain, Farookh Khadeer (2013). "SOF: A semi-supervised ontology-learning-based focused crawler". Concurrency and Computation: Practice and Experience. 25 (12): 1755–1770. doi:10.1002/cpe.2980. S2CID 205690364.
  29. ^ Junghoo Cho; Hector Garcia-Molina (2000). "Synchronizing a database to improve freshness" (PDF). Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Dallas, Texas, United States: ACM. pp. 117–128. doi:10.1145/342009.335391. ISBN 1-58113-217-4. Retrieved 23 March 2009.
  30. ^ a b E. G. Coffman Jr; Zhen Liu; Richard R. Weber (1998). "Optimal robot scheduling for Web search engines". Journal of Scheduling. 1 (1): 15–29. CiteSeerX 10.1.1.36.6087. doi:10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K.
  31. ^ a b Cho, Junghoo; Garcia-Molina, Hector (2003). "Effective page refresh policies for Web crawlers". ACM Transactions on Database Systems. 28 (4): 390–426. doi:10.1145/958942.958945. S2CID 147958.
  32. ^ a b Junghoo Cho; Hector Garcia-Molina (2003). "Estimating frequency of change". ACM Transactions on Internet Technology. 3 (3): 256–290. CiteSeerX 10.1.1.59.5877. doi:10.1145/857166.857170. S2CID 9362566.
  33. ^ Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L. (2005) Modeling and managing content changes in text databases Archived 5 September 2005 at the Wayback Machine. In Proceedings of the 21st IEEE International Conference on Data Engineering, pages 606-617, April 2005, Tokyo.
  34. ^ Koster, M. (1995). Robots in the web: threat or treat? ConneXions, 9(4).
  35. ^ Koster, M. (1996). A standard for robot exclusion Archived 7 November 2007 at the Wayback Machine.
  36. ^ Koster, M. (1993). Guidelines for robots writers Archived 22 April 2005 at the Wayback Machine.
  37. ^ Baeza-Yates, R. and Castillo, C. (2002). Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems – Design, Management and Applications, pages 565–572, Santiago, Chile. IOS Press Amsterdam.
  38. ^ Heydon, Allan; Najork, Marc (26 June 1999). "Mercator: A Scalable, Extensible Web Crawler" (PDF). Archived from the original (PDF) on 19 February 2006. Retrieved 22 March 2009. cite journal: Cite journal requires |journal= (help)
  39. ^ Dill, S.; Kumar, R.; Mccurley, K. S.; Rajagopalan, S.; Sivakumar, D.; Tomkins, A. (2002). "Self-similarity in the web" (PDF). ACM Transactions on Internet Technology. 2 (3): 205–223. doi:10.1145/572326.572328. S2CID 6416041.
  40. ^ M. Thelwall; D. Stuart (2006). "Web crawling ethics revisited: Cost, privacy and denial of service". Journal of the American Society for Information Science and Technology. 57 (13): 1771–1779. doi:10.1002/asi.20388.
  41. ^ Brin, Sergey; Page, Lawrence (1998). "The anatomy of a large-scale hypertextual Web search engine". Computer Networks and ISDN Systems. 30 (1–7): 107–117. doi:10.1016/s0169-7552(98)00110-x. S2CID 7587743.
  42. ^ Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press.
  43. ^ Shestakov, Denis (2008). Search Interfaces on the Web: Querying and Characterizing Archived 6 July 2014 at the Wayback Machine. TUCS Doctoral Dissertations 104, University of Turku
  44. ^ Michael L Nelson; Herbert Van de Sompel; Xiaoming Liu; Terry L Harrison; Nathan McFarland (24 March 2005). "mod_oai: An Apache Module for Metadata Harvesting": cs/0503069. arXiv:cs/0503069. Bibcode:2005cs........3069N. cite journal: Cite journal requires |journal= (help)
  45. ^ Shestakov, Denis; Bhowmick, Sourav S.; Lim, Ee-Peng (2005). "DEQUE: Querying the Deep Web" (PDF). Data & Knowledge Engineering. 52 (3): 273–311. doi:10.1016/s0169-023x(04)00107-7.
  46. ^ "AJAX crawling: Guide for webmasters and developers". Retrieved 17 March 2013.
  47. ^ ITA Labs "ITA Labs Acquisition" Archived 18 March 2014 at the Wayback Machine 20 April 2011 1:28 AM
  48. ^ "About Applebot". Apple Inc. Retrieved 18 October 2021.
  49. ^ Norton, Quinn (25 January 2007). "Tax takers send in the spiders". Business. Wired. Archived from the original on 22 December 2016. Retrieved 13 October 2017.
  50. ^ "Xenon web crawling initiative: privacy impact assessment (PIA) summary". Ottawa: Government of Canada. 11 April 2017. Archived from the original on 25 September 2017. Retrieved 13 October 2017.

Further reading

[edit]

 

Frequently Asked Questions

Content marketing and SEO work hand-in-hand. High-quality, relevant content attracts readers, earns backlinks, and encourages longer time spent on your site'factors that all contribute to better search engine rankings. Engaging, well-optimized content also improves user experience and helps convert visitors into customers.

A content agency in Sydney focuses on creating high-quality, SEO-optimized content that resonates with your target audience. Their services typically include blog writing, website copy, video production, and other forms of media designed to attract traffic and improve search rankings.

SEO consulting involves analyzing a website's current performance, identifying areas for improvement, and recommending strategies to boost search rankings. Consultants provide insights on keyword selection, on-page and technical optimization, content development, and link-building tactics.