Web crawler developer

Düsseldorf

‐ Vor Ort

Dieses Projekt ist archiviert und leider nicht (mehr) aktiv.
Sie finden vakante Projekte hier in unserer Projektbörse.

Schlagworte

HTML Sharepoint Content Kommunikation Excel MS Office Developer Inhalte Autor

Beschreibung

We are currently for a consultant to develop a web crawler (SharePoint / OneNote environment) for our fast moving consumer goods client

Start: 02.06.2020
End: 12.07.2020
Duration: 6 weeks
Location: Düsseldorf / remote
Volume: Fulltime

Background:
The project has the purpose to build crawler that periodically, or on demand, fetches content from certain websites over client's SharePoint Server and dump a structured image of the fetched content in a given server location, such that this dumped image can be used as a mirror of the original content.

Tasks:
- Crawling file lists (Site Libraries and lists/Documents). Most of the SharePoint websites are simply list of files, such as MS Office Word, Excel, PDF, Access and so on.
- Downloads the files
- Provide simple meta data about them
> those that are asserted in the SharePoint (author, date, etc.);
> most importantly the URL of files, in a way that the provided URLs uniquely identify (in as much as possible - can be discussed) the crawled content and that they can be used to access/download files from their origin SharePoint site or to explore/view them in a standard web browser (e.g. the URL of the main SharePoint site will be useless).
- Crawling Pages of SharePoint Team and Communications Sites . In case a SharePoint site contains rich content pages, the crawler shall be able
- to read SharePoint communication/team websites;
- identify logical segments such as web pages and their related contents (including attachments to them) and convert and dump them in HTML format (such as can be used as a mirror of the original content)
- All the identifies segments shall be assigned to a set of metadata, of which a working URL is a shall.
- Redding OneNote Files and Generating HTML Pages (OneNote files are special case that require further processing)
- Logical segments, such as sections/subsections, in each notebook shall be recognized and then converted into a HTML file(s)
- Each logical segment shall contain meta data available in OneNote, e.g. a title, author, date, as a URL that can be used in a standard browser to access directly the identified page/segment;
- The attachments in the identified segments such excel, pdf, and so on, shall be downloaded and repackaged properly together with the main HTML files;
- The generated HTML files shall contain proper links to the above-mentioned attachments;
- The generated HTML file and its attachments are self-contained and complete in the sense that they are mirroring the content of their origin OneNote segment.

Skills:
- Experience in developing web Crawler
- Experience with OneNote
- Experience with SharePoint
- Experience with HTML
- Experience with Metadata

Michael Bailey International is acting as an Employment Business in relation to this vacancy.

Start: 06/2020
Dauer: 6 weeks
Von: Michael Bailey Associates
Eingestellt: 15.05.2020
Projekt-ID:: 1928602
Vertragsart: Freiberuflich

Um sich auf dieses Projekt zu bewerben müssen Sie sich einloggen.

Web crawler developer

Schlagworte

Beschreibung

Projekt melden

Projekt empfehlen

Bewerbungslimit erreicht

Willkommen bei freelancermap!