Web crawler developer

Düsseldorf  ‐ Vor Ort
Dieses Projekt ist archiviert und leider nicht (mehr) aktiv.
Sie finden vakante Projekte hier in unserer Projektbörse.

Beschreibung

We are currently for a consultant to develop a web crawler (SharePoint / OneNote environment) for our fast moving consumer goods client

Start: 02.06.2020
End: 12.07.2020
Duration: 6 weeks
Location: Düsseldorf / remote
Volume: Fulltime

Background:
The project has the purpose to build crawler that periodically, or on demand, fetches content from certain websites over client's SharePoint Server and dump a structured image of the fetched content in a given server location, such that this dumped image can be used as a mirror of the original content.

Tasks:
- Crawling file lists (Site Libraries and lists/Documents). Most of the SharePoint websites are simply list of files, such as MS Office Word, Excel, PDF, Access and so on.
- Downloads the files
- Provide simple meta data about them
> those that are asserted in the SharePoint (author, date, etc.);
> most importantly the URL of files, in a way that the provided URLs uniquely identify (in as much as possible - can be discussed) the crawled content and that they can be used to access/download files from their origin SharePoint site or to explore/view them in a standard web browser (e.g. the URL of the main SharePoint site will be useless).
- Crawling Pages of SharePoint Team and Communications Sites . In case a SharePoint site contains rich content pages, the crawler shall be able
- to read SharePoint communication/team websites;
- identify logical segments such as web pages and their related contents (including attachments to them) and convert and dump them in HTML format (such as can be used as a mirror of the original content)
- All the identifies segments shall be assigned to a set of metadata, of which a working URL is a shall.
- Redding OneNote Files and Generating HTML Pages (OneNote files are special case that require further processing)
- Logical segments, such as sections/subsections, in each notebook shall be recognized and then converted into a HTML file(s)
- Each logical segment shall contain meta data available in OneNote, e.g. a title, author, date, as a URL that can be used in a standard browser to access directly the identified page/segment;
- The attachments in the identified segments such excel, pdf, and so on, shall be downloaded and repackaged properly together with the main HTML files;
- The generated HTML files shall contain proper links to the above-mentioned attachments;
- The generated HTML file and its attachments are self-contained and complete in the sense that they are mirroring the content of their origin OneNote segment.

Skills:
- Experience in developing web Crawler
- Experience with OneNote
- Experience with SharePoint
- Experience with HTML
- Experience with Metadata

Michael Bailey International is acting as an Employment Business in relation to this vacancy.
Start
06/2020
Dauer
6 weeks
Von
Michael Bailey Associates
Eingestellt
15.05.2020
Projekt-ID:
1928602
Vertragsart
Freiberuflich
Um sich auf dieses Projekt zu bewerben müssen Sie sich einloggen.
Registrieren