StrepHit

= Welcome to StrepHit's documentation! = StrepHit is an intelligent reading agent that understands text and translates it into Wikidata statements.

More specifically, it is a Natural Language Processing pipeline that extracts facts from text and produces Wikidata statements with references. Its final objective is to enhance the data quality of Wikidata by suggesting references to validate statements.

StrepHit was born in January 2016 and is funded by a Wikimedia Foundation Individual Engagement Grant (IEG).

This page contains the technical documentation.

= Indices and tables =
 * Index


 * Module Index


 * Search Page

= strephit =
 * strephit package


 * Subpackages


 * strephit.annotation package


 * Submodules


 * strephit.annotation.cli module


 * strephit.annotation.create_crowdflower_input module


 * strephit.annotation.generate_cml module


 * strephit.annotation.post_job module


 * strephit.annotation.pull_results module


 * Module contents


 * strephit.commons package


 * Submodules


 * strephit.commons.cache module


 * strephit.commons.cli module


 * strephit.commons.datetime module


 * strephit.commons.download module


 * strephit.commons.entity_linking module


 * strephit.commons.io module


 * strephit.commons.logging module


 * strephit.commons.parallel module


 * strephit.commons.pos_tag module


 * strephit.commons.secret_keys module


 * strephit.commons.secrets module


 * strephit.commons.split_sentences module


 * strephit.commons.text module


 * strephit.commons.tokenize module


 * strephit.commons.wikidata module


 * Module contents


 * strephit.corpus_analysis package


 * Submodules


 * strephit.corpus_analysis.cli module


 * strephit.corpus_analysis.compute_lu_distribution module


 * strephit.corpus_analysis.extract_framenet_frames module


 * strephit.corpus_analysis.rank_verbs module


 * strephit.corpus_analysis.statistics module


 * strephit.corpus_analysis.test_pos_taggers module


 * Module contents


 * strephit.extraction package


 * Submodules


 * strephit.extraction.cli module


 * strephit.extraction.extract_sentences module


 * strephit.extraction.process_semistructured module


 * Module contents


 * strephit.web_sources_corpus package


 * Subpackages


 * Submodules


 * strephit.web_sources_corpus.archive_org module


 * strephit.web_sources_corpus.britishmuseum_org module


 * strephit.web_sources_corpus.cli module


 * strephit.web_sources_corpus.items module


 * strephit.web_sources_corpus.pipelines module


 * strephit.web_sources_corpus.preprocess_corpus module


 * strephit.web_sources_corpus.run_all module


 * strephit.web_sources_corpus.settings module


 * Module contents


 * Submodules


 * strephit.cli module


 * Module contents

= strephit package =

Subpackages

 * strephit.annotation package


 * Submodules


 * strephit.annotation.cli module


 * strephit.annotation.create_crowdflower_input module


 * strephit.annotation.generate_cml module


 * strephit.annotation.post_job module


 * strephit.annotation.pull_results module


 * Module contents


 * strephit.commons package


 * Submodules


 * strephit.commons.cache module


 * strephit.commons.cli module


 * strephit.commons.datetime module


 * strephit.commons.download module


 * strephit.commons.entity_linking module


 * strephit.commons.io module


 * strephit.commons.logging module


 * strephit.commons.parallel module


 * strephit.commons.pos_tag module


 * strephit.commons.secret_keys module


 * strephit.commons.secrets module


 * strephit.commons.split_sentences module


 * strephit.commons.text module


 * strephit.commons.tokenize module


 * strephit.commons.wikidata module


 * Module contents


 * strephit.corpus_analysis package


 * Submodules


 * strephit.corpus_analysis.cli module


 * strephit.corpus_analysis.compute_lu_distribution module


 * strephit.corpus_analysis.extract_framenet_frames module


 * strephit.corpus_analysis.rank_verbs module


 * strephit.corpus_analysis.statistics module


 * strephit.corpus_analysis.test_pos_taggers module


 * Module contents


 * strephit.extraction package


 * Submodules


 * strephit.extraction.cli module


 * strephit.extraction.extract_sentences module


 * strephit.extraction.process_semistructured module


 * Module contents


 * strephit.web_sources_corpus package


 * Subpackages


 * strephit.web_sources_corpus.spiders package


 * Submodules


 * strephit.web_sources_corpus.spiders.BaseSpider module


 * strephit.web_sources_corpus.spiders.academia_net module


 * strephit.web_sources_corpus.spiders.american_bio module


 * strephit.web_sources_corpus.spiders.australasian_bio module


 * strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module


 * strephit.web_sources_corpus.spiders.bbc_co_uk module


 * strephit.web_sources_corpus.spiders.bio_english_lit module


 * strephit.web_sources_corpus.spiders.bishops module


 * strephit.web_sources_corpus.spiders.brown_edu module


 * strephit.web_sources_corpus.spiders.catholic_encyclopedia module


 * strephit.web_sources_corpus.spiders.cesar_org_uk module


 * strephit.web_sources_corpus.spiders.chinese_bio module


 * strephit.web_sources_corpus.spiders.christian_bio module


 * strephit.web_sources_corpus.spiders.cooperhewitt_org module


 * strephit.web_sources_corpus.spiders.design_and_art_australia_online module


 * strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module


 * strephit.web_sources_corpus.spiders.dnb module


 * strephit.web_sources_corpus.spiders.dsi module


 * strephit.web_sources_corpus.spiders.english_artists module


 * strephit.web_sources_corpus.spiders.freethinkers module


 * strephit.web_sources_corpus.spiders.gameo_org module


 * strephit.web_sources_corpus.spiders.genealogics module


 * strephit.web_sources_corpus.spiders.greek_roman_bio_myth module


 * strephit.web_sources_corpus.spiders.indian_bio module


 * strephit.web_sources_corpus.spiders.irish_officers module


 * strephit.web_sources_corpus.spiders.medical_bio module


 * strephit.web_sources_corpus.spiders.men_at_the_bar module


 * strephit.web_sources_corpus.spiders.men_of_time module


 * strephit.web_sources_corpus.spiders.metal_archives_com module


 * strephit.web_sources_corpus.spiders.modern_english_bio module


 * strephit.web_sources_corpus.spiders.munksroll module


 * strephit.web_sources_corpus.spiders.museothyssen_org module


 * strephit.web_sources_corpus.spiders.musicians module


 * strephit.web_sources_corpus.spiders.national_bio module


 * strephit.web_sources_corpus.spiders.naval_bio module


 * strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module


 * strephit.web_sources_corpus.spiders.nndb_com module


 * strephit.web_sources_corpus.spiders.parliament_uk module


 * strephit.web_sources_corpus.spiders.portraits_and_sketches module


 * strephit.web_sources_corpus.spiders.rkd_nl module


 * strephit.web_sources_corpus.spiders.royalsociety_org module


 * strephit.web_sources_corpus.spiders.sculpture_uk module


 * strephit.web_sources_corpus.spiders.structurae_net module


 * strephit.web_sources_corpus.spiders.vocab_getty_edu module


 * strephit.web_sources_corpus.spiders.wga_hu module


 * strephit.web_sources_corpus.spiders.who_is_who_america module


 * strephit.web_sources_corpus.spiders.who_is_who_in_china module


 * strephit.web_sources_corpus.spiders.yba_llgc_org_uk module


 * Module contents


 * Submodules


 * strephit.web_sources_corpus.archive_org module


 * strephit.web_sources_corpus.britishmuseum_org module


 * strephit.web_sources_corpus.cli module


 * strephit.web_sources_corpus.items module


 * strephit.web_sources_corpus.pipelines module


 * strephit.web_sources_corpus.preprocess_corpus module


 * strephit.web_sources_corpus.run_all module


 * strephit.web_sources_corpus.settings module


 * Module contents

Module contents
= strephit.annotation package =

strephit.annotation.create_crowdflower_input module
strephit.annotation.create_crowdflower_input.prepare_crowdflower_input(sentences, frame_data)

strephit.annotation.create_crowdflower_input.write_input_spreadsheet(data_units, outfile)

strephit.annotation.generate_cml module
strephit.annotation.generate_cml.generate_crowdflower_interface_template(input_csv, output_html)


 * Generate CrowFlower interface template based on input data spreadsheet


 * Parameters:
 * input_csv (file) -- CSV file with the input data


 * output_html (file) -- File in which to write the output


 * Returns:
 * 0 on success

strephit.annotation.post_job module
strephit.annotation.post_job.activate_gold(job_id)


 * Activate gold units in the given job. Corresponds to the 'Convert Uploaded Test Questions' UI button.


 * Parameters:
 * job_id (str) -- job ID registered in CrowdFlower


 * Returns:
 * True on success


 * Return type:
 * boolean

strephit.annotation.post_job.config_job(job_id)


 * Setup a given CrowdFlower job with default settings. See :const: JOB_SETTINGS


 * Parameters:
 * job_id (str) -- job ID registered in CrowdFlower


 * Returns:
 * the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message


 * Return type:
 * dict

strephit.annotation.post_job.create_job(title, instructions, cml, custom_js)


 * Create an empty CrowdFlower job with the specified title and instructions. Raise any HTTP error that may occur.


 * Parameters:
 * title (str) -- plain text title


 * instructions (str) -- instructions, can contain HTML


 * cml (str) -- worker interface CML template. See https://success.crowdflower.com/hc/en-us/articles/202817989-CML-CrowdFlower-Markup-Language-Overview


 * custom_js (str) -- JavaScript code to be injected into the job


 * Returns:
 * the created job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message


 * Return type:
 * dict

strephit.annotation.post_job.tag_job(job_id, tags)


 * Tag a given job.


 * Parameters:
 * job_id (str) -- job ID registered in CrowdFlower


 * tags (list) -- list of tags


 * Returns:
 * True on success


 * Return type:
 * boolean

strephit.annotation.post_job.upload_units(job_id, csv_data)


 * Upload the job data units to the given job. Raises any HTTP error that may occur.


 * Parameters:
 * job_id (str) -- job ID registered in CrowdFlower


 * csv_data (file) -- file handle pointing to the data units CSV


 * Returns:
 * the uploaded job response object, as per https://success.crowdflower.com/hc/en-us/articles/201856229-CrowdFlower-API-API-Responses-and-Messaging#job_response on success, or an error message


 * Return type:
 * dict

strephit.annotation.pull_results module
strephit.annotation.pull_results.download_full_report(job_id)


 * Download the full CSV report of the given job. See https://success.crowdflower.com/hc/en-us/articles/202703075-Guide-to-Reports-Page-and-Settings-Page#full_report Raises any HTTP error that may
 * occur.


 * Parameters:
 * job_id (str) -- job ID registered in CrowdFlower

strephit.annotation.pull_results.get_latest_job_id


 * Get the ID of the most recent job.


 * Returns:
 * the latest job ID


 * Return type:
 * str

Module contents
= strephit.commons package =

strephit.commons.cache module
strephit.commons.cache.cached(function)


 * Decorator to cache function results based on its arguments

strephit.commons.cache.get(key, default=None)

strephit.commons.cache.set(key, value, overwrite=True)

strephit.commons.datetime module
strephit.commons.datetime.parse(string)


 * Try to parse a date expressed in natural language. :return: dictionary with year, month, day

strephit.commons.entity_linking module
strephit.commons.entity_linking.extract_entities(response_json)


 * Extract the list of entities from the Dandelion Entity Extraction API JSON response.


 * Parameters:
 * response_json (dict) -- JSON response returned by Dandelion


 * Returns:
 * The extracted entities, with the surface form, start and end indices URI, and ontology types


 * Return type:
 * list

strephit.commons.io module
strephit.commons.io.dump_corpus(corpus, dump_file_handle)


 * Dump a loaded corpus to a file with one JSON object per line.

strephit.commons.io.get_and_cache(url, use_cache=True, **kwargs)


 * Perform an HTTP GET request to the given url and optionally cache the result somewhere in the file system. The cached content will be used in the subsequent requests. Raises all HTTP errors


 * Parameters:
 * url -- URL of the page to retrieve


 * use_cache -- Whether to use cache


 * **kwargs -- keyword arguments to pass to *requests.get*


 * Returns:
 * The content page at the given URL, unicode

strephit.commons.io.load_corpus(location, document_key, text_only=False)


 * Load an input corpus from a directory with scraped items, in a memory-efficient way. Each input file must contain one JSON object per line.


 * Parameters:
 * document_key (str) -- a scraped item dictionary key holding textual documents

strephit.commons.io.load_dumped_corpus(dump_file_handle, document_key, text_only=False)


 * Load a previously dumped corpus file, in a memory-efficient way.

strephit.commons.io.load_scraped_items(location)


 * Loads all the items from a directory or file.


 * Parameters:
 * location --


 * Where is the corpus.


 * If it is a directory, all files with extension jsonlines will be loaded.


 * if it is a file, it can be either a jsonlines of a tar compressed file.

strephit.commons.logging module
strephit.commons.logging.log_request_data(http_response, logger)


 * Send a debug log message with basic information of the HTTP request that was sent for the given HTTP response.


 * Parameters:
 * http_response (requests.models.Response) -- HTTP response object

strephit.commons.logging.setLogLevel(module, level)


 * Sets the log level used to log messages from the given module

strephit.commons.logging.setup

strephit.commons.parallel module
strephit.commons.parallel.execute(processes=0, *specs)


 * Execute the given functions parallelly


 * Parameters:
 * processes -- Number of functions to execute at the same time


 * specs -- a sequence of functions, each followed by its arguments (arguments as a tuple or list)


 * Returns:
 * the results that the functions returned


 * Return type:
 * list

strephit.commons.parallel.map(function, iterable, processes=0, flatten=False, raise_exc=True)


 * Applies the given function to each element of the iterable in parallel. *None* values are not allowed in the iterable nor as return values, they will simply be discarded. Can be "safely" stopped with
 * a keboard interrupt.


 * Parameters:
 * function -- the function used to transform the elements of the iterable


 * processes -- how many items to process in parallel. Use zero or a negative number to use all the available processors. No additional processes will be used if the value is 1.


 * flatten -- If the mapping function return an iterable flatten the resulting iterables into a single one.


 * raise_exc -- Only when *processes* equals 1, controls whether to propagate the exception raised by the mapping function to the called or simply to log them and carry on the computation.
 * When *processes* is different than 1 this parameter is not used.


 * Returns:
 * iterable with the results. Order is not guaranteed to be preserved

strephit.commons.pos_tag module
class strephit.commons.pos_tag.NLTKPosTagger(language)


 * Bases: "object"


 * part-of-speech tagger implemented using the NLTK library

tag_many(documents, tagset=None, **kwargs)


 * POS-Tag many documents.

tag_one(text, tagset, **kwargs)


 * POS-Tags the given text

class strephit.commons.pos_tag.TTPosTagger(language, tt_home=None, **kwargs)


 * Bases: "object"


 * part-of-speech tagger implemented using tree tagger and treetaggerwrapper

tag_many(items, document_key, pos_tag_key, batch_size=10000, **kwargs)


 * POS-Tags many text documents of the given items. Use this for massive text tagging


 * Parameters:
 * items -- Iterable of items to tag. Generator preferred


 * document_key -- Where to find the text to tag inside each item


 * pos_tag_key -- Where to put pos tagged text

tag_one(text, skip_unknown=True, **kwargs)


 * POS-Tags the given text, optionally skipping unknown lemmas

strephit.commons.pos_tag.get_pos_tagger(language, **kwargs)


 * Returns an initialized instance of the preferred POS tagger for the given language

strephit.commons.split_sentences module
class strephit.commons.split_sentences.PunktSentenceSplitter(language)


 * Bases: "object"


 * Sentence splitting splits a natural language text into sentences

'''model_path = 'tokenizers/punkt/%s.pickle'

split(text)


 * Split the given text into sentences. Leading and trailing spaces are stripped. Newline characters are first interpreted as sentence boundaries. Then, the sentence splitter is run.


 * Parameters:
 * text (str) -- Text to be split


 * Returns:
 * the sentences in the text


 * Return type:
 * generator

split_tokens(tokens)


 * Splits the given text into sentences.


 * Parameters:
 * tokens (list) -- the tokens of the text


 * Returns:
 * the sentences i the text


 * Return type:
 * generator

'''supported_models = {'el': 'tokenizers/punkt/greek.pickle', 'fr': 'tokenizers/punkt/french.pickle', 'en': 'tokenizers/punkt/english.pickle', 'nl': 'tokenizers/punkt/dutch.pickle', 'pt': 'tokenizers/punkt/portuguese.pickle', 'no': 'tokenizers/punkt/norwegian.pickle', 'sv': 'tokenizers/punkt/swedish.pickle', 'de': 'tokenizers/punkt/german.pickle', 'tr': 'tokenizers/punkt/turkish.pickle', 'it': 'tokenizers/punkt/italian.pickle', 'da': 'tokenizers/punkt/danish.pickle', 'cz': 'tokenizers/punkt/czech.pickle', 'es': 'tokenizers/punkt/spanish.pickle', 'fi': 'tokenizers/punkt/finnish.pickle', 'et': 'tokenizers/punkt/estonian.pickle', 'sl': 'tokenizers/punkt/slovene.pickle', 'pl': 'tokenizers/punkt/polish.pickle'}

strephit.commons.text module
strephit.commons.text.clean(s, unicode=True)

strephit.commons.text.clean_extract(sel, path, path_type='xpath', limit_from=None, limit_to=None, sep='\n', unicode=True)

strephit.commons.text.extract_dict(response, keys_selector, values_selector, keys_extractor='.//text', values_extractor='.//text', **kwargs)


 * Extracts a dictionary given the selectors for the keys and the vaues. The selectors should point to the elements containing the text and not the text itself.


 * Parameters:
 * response -- The response object. The methods xpath or css are used


 * keys_selector -- Selector pointing to the elements containing the keys, starting with the type *xpath:* or *css:* followed by the selector itself


 * values_selector -- Selector pointing to the elements containing the values, starting with the type *xpath:* or *css:* followed by the selector itself


 * keys_extracotr -- Selector used to actually extract the value of the key from each key element. xpath only


 * keys_extracotr -- Selector used to extract the actual value value from each value element. xpath only


 * **kwargs -- Other parameters to pass to *clean_extract*. Nothing good will come by passing *path_type='css'*, you have been warned.

strephit.commons.text.fix_name(name)


 * tries to normalize a name so that it can be searched with the wikidata APIs


 * Parameters:
 * name -- The name to normalize


 * Returns:
 * a tuple with the normalized name and a list of honorifics

strephit.commons.text.parse_birth_death(string)


 * Parses birth and death dates from a string.


 * Parameters:
 * string -- String with the dates. Can be 'd. ' to indicate the year of death, 'b. ' to indicate the year of birth, - to indicate both birth and death year. Can
 * optionally include 'c.' or 'ca.' before years to indicate approximation (ignored by the return value). If only the century is specified, birth is the first year of the century and death is the last
 * one, e.g. '19th century' will be parsed as *('1801', '1900')*


 * Returns:
 * tuple *(birth_year, death_year)*, both strings as appearing in the original string. If the string cannot be parsed *(None, None)* is returned.

strephit.commons.text.split_at(content, delimiters)


 * Splits content using given delimiters following their order, for example



strephit.commons.text.strip_honorifics(name)


 * Removes honorifics from the name


 * Parameters:
 * name -- The name


 * Returns:
 * a tuple with the name without honorifics and a list of honorifics

strephit.commons.tokenize module
class strephit.commons.tokenize.Tokenizer(language)


 * Tokenization splits a natural language utterance into words (tokens)

'''tokenization_regexps = {'en': u'[^\\p{L}\\p{N}]+'}

tokenize(sentence)


 * Tokenize the given sentence. You can also pass a generic text, but you will lose the sentence segmentation.


 * Parameters:
 * sentence (str) -- a natural language sentence or text to be tokenized


 * Returns:
 * the list of tokens


 * Return type:
 * list

strephit.commons.wikidata module
strephit.commons.wikidata.call_api(action, cache=True, **kwargs)


 * Invoke the given method of wikidata APIs with the given parameters

strephit.commons.wikidata.finalize_statement(subject, property, value, language, url=None, resolve_property=True, resolve_value=True, **kwargs)


 * Given the components of a statement, convert it into a quick statement.


 * Parameters:
 * subject -- Subject of the statement (its Wikidata ID)


 * property -- Property of the statement


 * value -- Value of the statement (to be resolved)


 * language -- Language used to resolve the value


 * url -- Source of the statement (corresponds to S854)


 * resolve_property -- Whether *property* is already a Wikidata ID or needs to be resolved


 * resolve_value -- Whether *value* can be inserted into the statement as-is or needs to be resolved


 * kwargs -- additional information used to resolve *value*

strephit.commons.wikidata.format_date(year, month, day)


 * Formats a date according to Wikidata syntax. Assumes that the date is mostly correct. The allowed values of the parameters are shown in the following truth table




 * Parameters:
 * year -- year of the date


 * month -- month of the date. Only positive values allowed


 * day -- day of the date. Only positive values allowed

strephit.commons.wikidata.get_entities(ids, batch)


 * Retrieve Wikidata entities metadata.


 * Parameters:
 * ids (list) -- list of Wikidata entity IDs


 * batch (int) -- number of IDs per call, to serve as paging for the API.


 * Returns:
 * dict of Wikidata entities with metadata


 * Return type:
 * dict

strephit.commons.wikidata.get_labels_and_aliases(entities, language_code)


 * Extract language-specific label and aliases from a list of Wikidata entities metadata.


 * Parameters:
 * entities (list) -- list of Wikidata entities with metadata.


 * language_code (str) -- 2-letter language code, e.g., *en* for English


 * Returns:
 * dict of entities, with label and aliases only


 * Return type:
 * dict

strephit.commons.wikidata.get_property_ids(batch)


 * Get the full list of Wikidata property IDs (pids).


 * Parameters:
 * batch (int) -- number of pids per call, to serve as paging for the API.


 * Returns:
 * list of all pids


 * Return type:
 * list

strephit.commons.wikidata.honorifics_resolver(property, value, language, **kwargs)

strephit.commons.wikidata.identity_resolver(property, value, language, **kwargs)


 * Default resolver, converts to unicode and surrounds with double quotes

strephit.commons.wikidata.parse_date(date, precision=None)


 * Tries to parse a date serialized according to the wikidata format into its components year, month and day


 * Returns:
 * dict (year, month, day)

strephit.commons.wikidata.resolve(property, value, language, **kwargs)


 * Tries to resolve the Wikidata ID of an object given its string representation


 * Parameters:
 * property -- Wikidata ID of the property to resolve


 * value -- String value


 * language -- Search only this language


 * kwargs -- Additional info that might be useful to help the resolver

strephit.commons.wikidata.resolver(*properties)


 * Decorator to register a function as resolver for the given properties.

strephit.commons.wikidata.search(term, language, type_=None)


 * Uses the wikidata APIs to search for a term. Can optionally specify a type (corresponding to the 'instance of' P31 wikidata property. If no type is specified simply returns all the items containing
 * term* in *label*


 * Parameters:
 * term -- The term to look for


 * language -- Search in this language


 * type -- Type of the entity to look for, wikidata numeric id (i.e. without starting Q) Can be int or anything iterable


 * Returns:
 * List of dicts with details (which details depend on *type_*)

Module contents
= strephit.corpus_analysis package =

strephit.corpus_analysis.compute_lu_distribution module
strephit.corpus_analysis.compute_lu_distribution.worker_with_sentences(bio)

strephit.corpus_analysis.compute_lu_distribution.worker_with_sub_sentences(bio)

strephit.corpus_analysis.extract_framenet_frames module
strephit.corpus_analysis.extract_framenet_frames.extract_top_corpus_tokens(enriched_lemmas, all_lemma_tokens)


 * Extract the subset of corpus lemmas with tokens gievn the set of top lemmas


 * Parameters:
 * enriched_lemmas (dict) -- Dict returned by "intersect_lemmas_with_framenet"


 * all_lemma_tokens (dict) -- Dict of all corpus lemmas with tokens


 * Returns:
 * the top lemmas with tokens dict


 * Return type:
 * dict

strephit.corpus_analysis.extract_framenet_frames.get_top_n_lus(ranked_lus, n)


 * Extract the top N Lexical Units (LUs) from a ranking.


 * Parameters:
 * ranked_lus (dict) -- LUs ranking, as returned by "compute_ranking"


 * n (int) -- Number of top LUs to return


 * Returns:
 * the top N LUs with their ranking scores


 * Return type:
 * dict

strephit.corpus_analysis.extract_framenet_frames.intersect_lemmas_with_framenet(corpus_lemmas, wikidata_properties)


 * Intersect verb lemmas extracted from the input corpus with FrameNet Lexical Units (LUs).


 * Parameters:
 * corpus_lemmas (dict) -- dict of verb lemmas with their ranking scores


 * wikidata_properties (dict) -- dict with all Wikidata properties


 * Returns:
 * a dictionary of corpus lemmas enriched with FrameNet LUs data (dicts)


 * Return type:
 * dict

strephit.corpus_analysis.rank_verbs module
class strephit.corpus_analysis.rank_verbs.PopularityRanking(corpus_path, pos_tag_key)

find_ranking(processes=0, bulk_size=10000, normalize=True)

static score_from_tokens(tokens)

class strephit.corpus_analysis.rank_verbs.TFIDFRanking(vectorizer, verbs, tfidf_matrix)

find_ranking(processes=0)

score_lemma(lemma)

strephit.corpus_analysis.rank_verbs.compute_tf_idf_matrix(corpus_path, document_key)


 * Computes the TF-IDF matrix of the corpus


 * Parameters:
 * corpus_path -- path of the corpus


 * document_key -- where the textual content is in the corpus


 * Returns:
 * a vectorizer and the computed matrix


 * Return type:
 * tuple

strephit.corpus_analysis.rank_verbs.get_similarity_scores(verb_token, vectorizer, tf_idf_matrix)


 * Compute the cosine similarity score of a given verb token against the input corpus TF/IDF matrix.


 * Parameters:
 * verb_token (str) -- Surface form of a verb, e.g., born


 * Returns:
 * cosine similarity score


 * Return type:
 * ndarray

strephit.corpus_analysis.rank_verbs.harmonic_ranking(*rankings)


 * Combines individual rankings with an harmonic mean to obtain a final ranking


 * Parameters:
 * rankings -- dictionary of individual rankings


 * Returns:
 * the new, combined ranking

strephit.corpus_analysis.rank_verbs.produce_lemma_tokens(pos_tagged_path, pos_tag_key, language)


 * Extracts a map from lemma to all its tokens


 * Parameters:
 * pos_tagged_path -- path of the pos-tagged corpus


 * pos_tag_key -- where the pos tag data is in each item


 * language -- language of the corpus


 * Returns:
 * mapping from lemma to tokens


 * Return type:
 * dict

strephit.corpus_analysis.statistics module
strephit.corpus_analysis.statistics.bulkenize(iterable, size)

strephit.corpus_analysis.test_pos_taggers module
strephit.corpus_analysis.test_pos_taggers.tag(text, tt_home)

Module contents
= strephit.extraction package =

strephit.extraction.extract_sentences module
class strephit.extraction.extract_sentences.GrammarExtractor(corpus, pos_tag_key, document_key, sentences_key, language, lemma_to_token, match_base_form)


 * Bases: "strephit.extraction.extract_sentences.SentenceExtractor"


 * Grammar-based extraction strategy: pick sentences that comply with a pre-defined grammar.

extract_from_item(item)

'''grammars = {'en': '\n               NOPH: {???*+?}\n                CHUNK: {+?+?+}\n               ', 'it': '\n                SN: {?*?+?<ADJ|VER:pper>*}\n                CHUNK: {<SN><VER.*>+<SN>}\n               '}

'''parser = None

setup_extractor

'''splitter = None

class strephit.extraction.extract_sentences.ManyToManyExtractor(corpus, pos_tag_key, document_key, sentences_key, language, lemma_to_token, match_base_form)


 * Bases: "strephit.extraction.extract_sentences.SentenceExtractor"


 * n2n extraction strategy: many sentences per many LUs N.B.: the same sentence is likely to appear multiple times

extract_from_item(item)

setup_extractor

'''splitter = None

class strephit.extraction.extract_sentences.OneToOneExtractor(corpus, pos_tag_key, document_key, sentences_key, language, lemma_to_token, match_base_form)


 * Bases: "strephit.extraction.extract_sentences.SentenceExtractor"


 * 121 extraction strategy: 1 sentence per 1 LU N.B.: the same sentence will appear only once the sentence is assigned to a RANDOM LU

'''all_verb_tokens = None

extract_from_item(item)

setup_extractor

'''splitter = None

'''token_to_lemma = None

class strephit.extraction.extract_sentences.SentenceExtractor(corpus, pos_tag_key, document_key, sentences_key, language, lemma_to_token, match_base_form)


 * Base class for sentence extractors.

extract(processes=0)


 * Processes the corpus extracting sentences from each item and storing them in the item itself.

extract_from_item(item)


 * Extract sentences from an item. Relies on *setup_extractor* having been called


 * Parameters:
 * item (dict) -- Item from which to extract sentences


 * Returns:
 * The original item and list of extracted sentences


 * Return type:
 * tuple of dict, list

setup_extractor


 * Optional setup code, run before starting the extraction

teardown_extractor


 * Optional teardown code, run after the extraction

class strephit.extraction.extract_sentences.SyntacticExtractor(corpus, pos_tag_key, document_key, sentences_key, language, lemma_to_token, match_base_form)


 * Bases: "strephit.extraction.extract_sentences.SentenceExtractor"


 * Tries to split sentences into sub-sentences so that each of them contains only one LU

'''all_verbs = None

extract_from_item(item)

find_sub_sentences(tree)

find_terminals(tree, label=None)

'''parser = None

setup_extractor

'''splitter = None

'''token_to_lemma = None

strephit.extraction.extract_sentences.extract_sentences(corpus, pos_tag_key, sentences_key, document_key, language, lemma_to_tokens, strategy, match_base_form, processes=0)


 * Extract sentences from the given corpus by matching tokens against a given set.


 * Parameters:
 * corpus -- Pos-tagged corpus, as an iterable of documents


 * sentences_key (str) -- dict key where to put extracted sentences


 * pos_tag_key (str) -- dict key where the pos-tagged text is


 * document_key (str) -- dict key where the textual document is


 * language (str) -- ISO 639-1 language code used for tokenization and sentence splitting


 * lemma_to_tokens (dict) -- Dict with corpus lemmas as keys and tokens to be matched as values


 * strategy (str) -- One of the 4 extraction strategies ['121', 'n2n', 'grammar', 'syntactic']


 * match_base_form (bool) -- whether to match verbs base form


 * processes (int) -- How many concurrent processes to use


 * Returns:
 * the corpus, updated with the extracted sentences and the number of extracted sentences


 * Return type:
 * generator of tuples

strephit.extraction.process_semistructured module
strephit.extraction.process_semistructured.serialize_item


 * Converts an item to quick statements. Takes a single tuple as parameter

Module contents
= strephit.web_sources_corpus.spiders package =

strephit.web_sources_corpus.spiders.BaseSpider module
class strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"


 * Generic base spider, to abstract most of the work. Specify the selectors to suit the website to scrape. The spider first uses a list of selectors to reach a page containing the list of items to
 * scrape. Another selector is used to extract urls pointing to detail pages, containing the details of the items to scrape. Finally a third selector is used to extract the url pointing to the next
 * "list" page.




 * *list_page_selectors* is a list of selectors used to reach the page containing the items to scrape. Each selector is applied to the page(s) fetched by extracting the url from the previous page
 * using the preceding selector.


 * *detail_page_selectors* extract the urls pointing to the detail pages. Can be a single selector or a list.


 * *next_page_selectors* extracts the url pointing to the next page




 * Selector starting with *css:* are css selectors, those starting with *xpath:* are xpath selectors, all others should follow the syntax *method:selector*, where *method* is the name of a method of the
 * spider and *selector* is another selector specified in the same way as above). The method is used to transform the result obtained by extracting the item pointed by the selecctor and should accept
 * the response as first parameter and the result of extracting the data pointed by the selector (only if specified).


 * The spider provides a simple method to parse items. Item class is specified in *item_class* (must inherit from *scrapy.Item*) and item fields are specified in the dict *item_fields*, whose keys are
 * field names and values are selectors following the syntax described above. They can also be lists or dicts arbitrarily nested eventually containing selectors.


 * Each item can be processed and refined by the method *refine_item*.

clean(response, strings, unicode=True)


 * Utility function to clean strings. Can be used within your selectors

'''detail_page_selectors = None

get_elements_from_selector(response, selector)

'''item_class = None

'''item_fields = {}

'''list_page_selectors = None

make_url_absolute(page_url, url)

'''next_page_selectors = None

parse(response)


 * First stage of the spider with the goal of reaching the list page.

parse_detail(response)


 * Third stage of the spider, parses the detail page to produce an item

parse_list(response)


 * Second stage of the spider implementing pagination

refine_item(response, item)


 * Applies any custom post-processing to the item, override if needed. Return None to discard the item

strephit.web_sources_corpus.spiders.academia_net module
class strephit.web_sources_corpus.spiders.academia_net.AcademiaNetSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.academia-net.org']

'''detail_page_selectors = 'xpath:.//li[@class="profil"]/div[1]/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'name': 'clean:xpath:.//h1[contains(@class, "profilname")]/text'}

'''list_page_selectors = None

'''name = 'academia_net'

'''next_page_selectors = 'xpath:.//div[@class="jumplist"]/a[last]/@href'

refine_item(response, item)

'''start_urls = ('http://www.academia-net.org/search/?sv%5Barea_id%5D%5B0%5D=1252&sv%5Br_rbs_fachgebiete%5D%5B0%5D=&_seite=1',)

strephit.web_sources_corpus.spiders.american_bio module
class strephit.web_sources_corpus.spiders.american_bio.AmericanBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[3]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[2]//ul[1]/li/a/@href'

'''name = 'american_bio'

'''next_page_selectors = None

'''start_urls = ('https://en.wikisource.org/wiki/Appletons%27_Cyclop%C3%A6dia_of_American_Biography',)

strephit.web_sources_corpus.spiders.australasian_bio module
class strephit.web_sources_corpus.spiders.australasian_bio.AustralasianBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//tr[2]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'australasian_bio'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/The_Dictionary_of_Australasian_Biography',)

strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module
class strephit.web_sources_corpus.spiders.australian_dictionary_of_biography.AustralianDictionaryOfBiographySpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"


 * A spider for the Australian Dictionary of Biography website

'''allowed_domains = ['adb.anu.edu.au']

'''name = 'australian_dictionary_of_biography'

parse(response)

parse_person(response)

'''start_urls = ['http://adb.anu.edu.au/biographies/name/']

strephit.web_sources_corpus.spiders.bbc_co_uk module
class strephit.web_sources_corpus.spiders.bbc_co_uk.BbcCoUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.bbc.co.uk']

'''detail_page_selectors = 'xpath:.//a[@class="artist"]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="info"]/div[@id="bio"]//text', 'other': {'read-more': 'clean:xpath:.//div[@id="info"]//div[@id="read-more"]//text', 'short-desc': 'xpath:.//div[@id="info"]/ul[@id="short-desc"]/li//text', 'oup': 'clean:xpath:.//div[@id="info"]/div[@id="oup"]/p[1]/text', 'how-to-cite': 'clean:xpath:.//div[@id="how-to-cite"]//text'}, 'name': 'clean:xpath:.//div[@id="info"]/h1/text'}

'''list_page_selectors = None

'''name = 'bbc_co_uk'

'''next_page_selectors = 'xpath:.//div[@class="topPagination"]//li[@class="next"]//a/@href'

refine_item(response, item)

start_requests

strephit.web_sources_corpus.spiders.bio_english_lit module
class strephit.web_sources_corpus.spiders.bio_english_lit.BioEnglishLitSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'bio_english_lit'

'''next_page_selectors = None

'''start_urls = ('https://en.wikisource.org/wiki/A_Short_Biographical_Dictionary_of_English_Literature',)

strephit.web_sources_corpus.spiders.bishops module
class strephit.web_sources_corpus.spiders.bishops.BishopsSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.catholic-hierarchy.org']

clean_name(response, name)

'''detail_page_selectors = 'xpath:/html/body/ul/li/a[1]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'name': 'clean_name:clean:xpath:.//h1[@align="center"]//text'}

'''list_page_selectors = 'xpath:.//a[starts-with(@href, "la")]/@href'

'''name = 'bishops'

'''next_page_selectors = None

parse_bio(response)

parse_microdata(response)

parse_other(response)

refine_item(response, item)

'''start_urls = ('http://www.catholic-hierarchy.org/bishop/la.html',)

strephit.web_sources_corpus.spiders.brown_edu module
class strephit.web_sources_corpus.spiders.brown_edu.BrownEduSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.brown.edu']

'''custom_settings = {'DOWNLOAD_DELAY': 0.5, 'RETRY_HTTP_CODES': ['403']}

'''detail_page_selectors = 'xpath:.//div[@class="index"]//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@class="index"]//text', 'other': {'credit': 'clean:xpath:.//div[@class="credit"]//text'}, 'name': 'clean:xpath:.//p[@class="head"]/following-sibling::p[1]/strong/text'}

'''list_page_selectors = None

'''name = 'brown_edu'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://www.brown.edu/Administration/News_Bureau/Databases/Encyclopedia/',)

strephit.web_sources_corpus.spiders.catholic_encyclopedia module
class strephit.web_sources_corpus.spiders.catholic_encyclopedia.CatholicEncyclopediaSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/table[1]//tr[4]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[1]//a/@href'

'''name = 'catholic_encyclopedia'

'''next_page_selectors = None

'''start_urls = ('https://en.wikisource.org/wiki/Catholic_Encyclopedia_%281913%29',)

strephit.web_sources_corpus.spiders.cesar_org_uk module
class strephit.web_sources_corpus.spiders.cesar_org_uk.CesarOrgUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['cesar.org.uk']

'''detail_page_selectors = 'xpath:.//td[@id="keywordColumn"]//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''list_page_selector = None

'''name = 'cesar_org_uk'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('http://cesar.org.uk/cesar2/people/people.php?fct=list&search=%25&listMaxRows=999999',)

strephit.web_sources_corpus.spiders.chinese_bio module
class strephit.web_sources_corpus.spiders.chinese_bio.ChineseBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@class="poem"]//a[not(@class="new")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p//text', 'name': 'clean:xpath://div[@id="headerContainer"]/following-sibling::div[1]//p/b[1]/text'}

'''list_page_selectors = None

'''name = 'chinese_bio'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/A_Chinese_Biographical_Dictionary',)

strephit.web_sources_corpus.spiders.christian_bio module
class strephit.web_sources_corpus.spiders.christian_bio.ChristianBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''base_url = 'https://en.wikisource.org/wiki/Dictionary_of_Christian_Biography_and_Literature_to_the_End_of_the_Sixth_Century/'

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'christian_bio'

'''next_page_selectors = None

start_requests

strephit.web_sources_corpus.spiders.cooperhewitt_org module
class strephit.web_sources_corpus.spiders.cooperhewitt_org.CooperhewittOrgSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['collection.cooperhewitt.org']

'''detail_page_selectors = 'get_detail_page:xpath:.//div[@class="row"]/div[2]/ul[@class="list-o-things"]//h1/a/@href'

get_detail_page(response, urls)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[contains(@class, "person-bio")]/p//text', 'name': 'clean:xpath:.//div[@class="page-header"]/h1/a/text'}

'''list_page_selectors = None

'''name = 'cooperhewitt_org'

'''next_page_selectors = 'xpath:.//ul[@class="pagination"]/li[last]/a/@href'

refine_item(response, item)

'''start_urls = ('http://collection.cooperhewitt.org/people/page1',)

strephit.web_sources_corpus.spiders.design_and_art_australia_online module
class strephit.web_sources_corpus.spiders.design_and_art_australia_online.DesignAndArtAustraliaOnlineSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"


 * A spider for the Design & Art Australia Online website

'''allowed_domains = ['www.daao.org.au']

'''name = 'design_and_art_australia_online'

parse(response)

parse_bio(response)

parse_person(response)

'''start_urls = ['https://www.daao.org.au/search/?q&selected_facets=record_type_exact%3APerson&page=1&advanced=false&view=view_list&results_per_page=100']

strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module
class strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org.DictionaryofarthistoriansOrgSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['dictionaryofarthistorians.org']

'''detail_page_selectors = 'xpath:.//div[@class="navigation-by-letter"]/following-sibling::p/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@class="arthist-publish-profile__body"]/p//text', 'death': 'clean:xpath:.//div[@class="arthist-publish-profile__deathdate"]/p//text', 'name': 'clean:xpath:.//h1[@class="arthist-publish-profile__name"]//text', 'birth': 'clean:xpath:.//div[@class="arthist-publish-profile__birthdate"]/p//text'}

'''list_page_selectors = None

'''name = 'dictionaryofarthistorians_org'

'''next_page_selectors = None

start_requests

strephit.web_sources_corpus.spiders.dnb module
class strephit.web_sources_corpus.spiders.dnb.DictionaryOfNationalBiographySpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"


 * A spider for the Dictionary of National Biography, in Wikisource

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//table//li/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div//p//text'}

'''list_page_selectors = 'xpath:.//dd/a/@href'

'''name = 'dnb'

'''next_page_selectors = 'xpath:.//span[@id="headernext"]/a/@href'

refine_item(response, item)

'''start_urls = ['https://en.wikisource.org/wiki/Dictionary_of_National_Biography,_1885-1900']

strephit.web_sources_corpus.spiders.dsi module
class strephit.web_sources_corpus.spiders.dsi.DsiSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.uni-stuttgart.de']

'''detail_page_selectors = 'xpath:.//a[contains(., "Detail page of this illustrator")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''list_page_selectors = None

'''name = 'dsi'

'''next_page_selectors = 'xpath:.//a[contains(., ">")]/@href'

'''page_url = 'http://www.uni-stuttgart.de/hi/gnt/dsi2/index.php?table_name=dsi&function=search&where_clause=&order=lastname&order_type=ASC&page=%d'

refine_item(response, item)

start_requests

strephit.web_sources_corpus.spiders.english_artists module
class strephit.web_sources_corpus.spiders.english_artists.EnglishArtistsSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['en.wikisource.org']

finalize(item)

'''name = 'english_artists'

parse(response)

parse_detail(response)

'''start_urls = ('https://en.wikisource.org/wiki/A_Dictionary_of_Artists_of_the_English_School',)

text_from_node(node)

strephit.web_sources_corpus.spiders.freethinkers module
class strephit.web_sources_corpus.spiders.freethinkers.FreethinkersSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['en.wikisource.org']

'''name = 'freethinkers'

parse(response)

'''start_urls = ('https://en.wikisource.org/wiki/A_Biographical_Dictionary_of_Ancient,_Medieval,_and_Modern_Freethinkers',)

strephit.web_sources_corpus.spiders.gameo_org module
class strephit.web_sources_corpus.spiders.gameo_org.GameoOrgSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['gameo.org']

'''detail_page_selectors = 'xpath:.//table[@class="mw-allpages-table-chunk"]//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/h1[1]/preceding-sibling::*//text'}

'''list_page_selectors = None

'''name = 'gameo_org'

'''next_page_selectors = 'xpath:.//td[@class="mw-allpages-nav"]/a[3]/@href'

parse_title(title)

refine_item(response, item)

'''start_urls = ('http://gameo.org/index.php?title=Special:AllPages&from=108+Chapel+%28100+Mile+House%2C+British+Columbia%2C+Canada%29',)

strephit.web_sources_corpus.spiders.genealogics module
class strephit.web_sources_corpus.spiders.genealogics.GenealogicsSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"


 * A spider for Leo's Genealogics website

'''allowed_domains = ['www.genealogics.org']

'''name = 'genealogics'

parse(response)

parse_person(response)

'''start_urls = ['http://www.genealogics.org/search.php?mybool=AND&nr=200']

strephit.web_sources_corpus.spiders.greek_roman_bio_myth module
class strephit.web_sources_corpus.spiders.greek_roman_bio_myth.GreekRomanBioMythSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li/a[not(@class="new")]/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]/p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text'}

'''list_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul/li[position>2]/a/@href'

'''name = 'greek_roman_bio_myth'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Dictionary_of_Greek_and_Roman_Biography_and_Mythology',)

strephit.web_sources_corpus.spiders.indian_bio module
class strephit.web_sources_corpus.spiders.indian_bio.IndianBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position>4]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'indian_bio'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/The_Indian_Biographical_Dictionary_(1915)',)

strephit.web_sources_corpus.spiders.irish_officers module
class strephit.web_sources_corpus.spiders.irish_officers.IrishOfficersSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['en.wikisource.org']

'''name = 'irish_officers'

parse(response)

parse_detail(response)

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Chronicle_of_the_law_officers_of_Ireland',)

strephit.web_sources_corpus.spiders.medical_bio module
class strephit.web_sources_corpus.spiders.medical_bio.MedicalBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[position>1]//text', 'other': {'born_died': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/text'}, 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//p[1]/b/text'}

'''list_page_selectors = 'xpath:(.//div[@id="mw-content-text"]//ol)[2]//a/@href'

'''name = 'medical_bio'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/American_Medical_Biographies',)

strephit.web_sources_corpus.spiders.men_at_the_bar module
class strephit.web_sources_corpus.spiders.men_at_the_bar.MenAtTheBarSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''base_url = 'https://en.wikisource.org/wiki/Men-at-the-Bar/Names_'

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'men_at_the_bar'

'''next_page_selectors = None

refine_item(response, item)

start_requests

strephit.web_sources_corpus.spiders.men_of_time module
class strephit.web_sources_corpus.spiders.men_of_time.MenOfTimeSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text', 'name': 'clean:xpath:.//span[@id="header_section_text"]//text'}

'''list_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//ul//a[not(@class="new")]/@href'

'''name = 'men_of_time'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Men_of_the_Time,_eleventh_edition',)

strephit.web_sources_corpus.spiders.metal_archives_com module
class strephit.web_sources_corpus.spiders.metal_archives_com.MetalArchivesComSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['www.metal-archives.com']

'''base_url = 'http://www.metal-archives.com/search/ajax-artist-search/?field=alias&query=%2Aa%2A+OR+%2Ae%2A+OR+%2Ai%2A+OR+%2Ao%2A+OR+%2Au%2A&sEcho=1&iDisplayStart={}'

'''name = 'metal_archives_com'

parse(response)

parse_detail(response)

parse_extern(response)

'''start_urls = ('http://www.metal-archives.com/search/ajax-artist-search/?field=alias&query=%2Aa%2A+OR+%2Ae%2A+OR+%2Ai%2A+OR+%2Ao%2A+OR+%2Au%2A&sEcho=1&iDisplayStart=0',)

strephit.web_sources_corpus.spiders.modern_english_bio module
class strephit.web_sources_corpus.spiders.modern_english_bio.ModernEnglishBioSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['en.wikisource.org']

'''name = 'modern_english_bio'

parse(response)

parse_detail(response)

'''start_urls = ('https://en.wikisource.org/wiki/Modern_English_Biography',)

strephit.web_sources_corpus.spiders.munksroll module
class strephit.web_sources_corpus.spiders.munksroll.MunksrollSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['munksroll.rcplondon.ac.uk']

'''detail_page_selectors = 'xpath:.//div[@id="maincontent"]/table//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="prose"]//text', 'name': 'clean:xpath:.//h2[@class="PageTitle"]/text'}

'''list_page_selectors = None

'''name = 'munksroll'

'''next_page_selectors = None

refine_item(response, item)

start_requests

strephit.web_sources_corpus.spiders.museothyssen_org module
class strephit.web_sources_corpus.spiders.museothyssen_org.MuseothyssenOrgSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.museothyssen.org']

'''detail_page_selectors = 'xpath:.//ul[@id="autoresAZ"]/li/ul/li/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//span[@id="contReader1"]//text', 'other': {'born': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Born/Dead:")]/following-sibling::dd[1]//text'}, 'name': 'clean:xpath:.//dl[@class="datosAutor"]/dt[contains(., "Author:")]/following-sibling::dd[1]//text'}

'''list_page_selectors = None

'''name = 'museothyssen_org'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('http://www.museothyssen.org/en/thyssen/artistas',)

strephit.web_sources_corpus.spiders.musicians module
class strephit.web_sources_corpus.spiders.musicians.MusiciansSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//table[@id="multicol"]//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''list_page_selectors = ['xpath:.//span[@class="mw-headline"]/parent::h2/following-sibling::ul//a/@href', 'xpath:.//span[.="Articles"]/parent::h2/following-sibling::ul//a/@href']

'''name = 'musicians'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/A_Dictionary_of_Music_and_Musicians',)

strephit.web_sources_corpus.spiders.national_bio module
class strephit.web_sources_corpus.spiders.national_bio.NationalBioSpider(year)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//table[@class="prettytable"]//tr[4]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]/text'}

'''list_page_selectors = None

'''name = 'national_bio'

'''next_page_selectors = None

strephit.web_sources_corpus.spiders.naval_bio module
class strephit.web_sources_corpus.spiders.naval_bio.NavalBioSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]/ul[position>4]//a/@href'

get_name_from_title(response, title)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="mw-content-text"]//p[position>1]//text', 'name': 'get_name_from_title:clean:xpath:.//h1[@id="firstHeading"]//text'}

'''list_page_selectors = None

'''name = 'naval_bio'

'''next_page_selectors = None

'''start_urls = ('https://en.wikisource.org/wiki/A_Naval_Biographical_Dictionary',)

strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module
class strephit.web_sources_corpus.spiders.newulsterbiography_co_uk.NewulsterbiographyCoUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.newulsterbiography.co.uk']

'''detail_page_selectors = 'xpath:.//div[@id="search_results"]/p/a/@href'

get_bio(response, values)

get_name(response, values)

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'other': {'profession': 'xpath:.//span[@class="person_heading_profession"]//text'}, 'bio': 'get_bio:xpath:.//div[@id="person_details"]/div/br[1]/preceding-sibling::*//text', 'death': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[2]/td[2]/text', 'name': 'get_name:xpath:.//h1[@class="person_heading"]/br/preceding-sibling::text', 'birth': 'clean:xpath:.//div[@id="person_details"]/div/table[2]//tr[1]/td[2]/text'}

'''list_page_selectors = None

'''name = 'newulsterbiography_co_uk'

'''next_page_selectors = None

'''start_urls = ('http://www.newulsterbiography.co.uk/index.php/home/browse/all',)

strephit.web_sources_corpus.spiders.nndb_com module
class strephit.web_sources_corpus.spiders.nndb_com.NndbComSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.nndb.com']

'''detail_page_selectors = 'xpath:.//a[contains(@href, "http://www.nndb.com/people/")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'name': 'clean:xpath:.//td/font/b/text'}

'''list_page_selectors = 'xpath:.//a[@class="newslink"]/@href'

'''name = 'nndb_com'

refine_item(response, item)

'''start_urls = ('http://www.nndb.com/',)

strephit.web_sources_corpus.spiders.parliament_uk module
class strephit.web_sources_corpus.spiders.parliament_uk.ParliamentUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.parliament.uk']

clean_name(response, name)

'''detail_page_selectors = 'xpath:.//table//tr/td/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'name': 'clean_name:clean:xpath:.//div[@id="commons-biography-header"]/h1//text'}

'''list_page_selectors = None

'''name = 'parliament_uk'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('http://www.parliament.uk/mps-lords-and-offices/mps/',)

strephit.web_sources_corpus.spiders.portraits_and_sketches module
class strephit.web_sources_corpus.spiders.portraits_and_sketches.PortraitsAndSketchesSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div[1]//text', 'name': 'clean:xpath:(.//div[@class="tiInherit"]/p/span)[1]//text'}

'''list_page_selectors = None

'''name = 'portraits_and_sketches'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Cartoon_portraits_and_biographical_sketches_of_men_of_the_day',)

strephit.web_sources_corpus.spiders.rkd_nl module
class strephit.web_sources_corpus.spiders.rkd_nl.RKDArtistsSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"


 * A spider for RKD Netherlands Institute for Art History website

'''allowed_domains = ['rkd.nl']

'''detail_page_selectors = 'xpath:.//div[@class="header"]/a/@href'

extract_dl_key_value(dl_pairs, item)


 * Feed the item with key-value pairs extracted from <dl> tags

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'name': 'clean:xpath:.//h2/text'}

'''list_page_selectors = None

'''name = 'rkd_nl'

'''next_page_selectors = 'xpath:.//a[@title="Next page"]/@href'

refine_item(response, item)

'''start_urls = ['https://rkd.nl/en/explore/artists']

strephit.web_sources_corpus.spiders.royalsociety_org module
class strephit.web_sources_corpus.spiders.royalsociety_org.RoyalsocietyOrgSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['royalsociety.org']

'''name = 'royalsociety_org'

parse(response)

parse_fellow(response)

start_requests

'''start_urls = ('http://www.royalsociety.org/',)

strephit.web_sources_corpus.spiders.sculpture_uk module
class strephit.web_sources_corpus.spiders.sculpture_uk.SculptureUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['sculpture.gla.ac.uk']

'''detail_page_selectors = 'xpath:.//div[@class="featured"]/table//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@class="featured"]/p[child::b][last]/following-sibling::p//text', 'death': 'clean:xpath:.//b[.="Died"]/following-sibling::text[1]', 'name': 'clean:xpath:.//div[@class="featured"]/h1//text', 'birth': 'clean:xpath:.//b[.="Born"]/following-sibling::text[1]'}

'''list_page_selectors = 'xpath:.//div[@class="featuredpeople"]//a/@href'

'''name = 'sculpture_uk'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('http://sculpture.gla.ac.uk/browse/index.php',)

strephit.web_sources_corpus.spiders.structurae_net module
class strephit.web_sources_corpus.spiders.structurae_net.StructuraeNetSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['structurae.net']

'''detail_page_selectors = 'xpath:.//ol[@class="searchlist"]//a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'other': {'bibliography': 'xpath:.//div[@id="person-bibliography"]//li/a/@href', 'publications': 'xpath:.//div[@id="person-literature"]//li//a/@href', 'websites': 'xpath:.//div[@id="person-websites"]//li/a/@href', 'participated_in': 'xpath:.//div[@id="person-references"]//a/@href'}, 'name': 'clean:xpath:.//h1/span[@itemprop="name"]//text'}

'''list_page_selectors = 'xpath:.//ol[@class="commalist"]//a/@href'

'''name = 'structurae_net'

'''next_page_selectors = 'xpath:(.//div[@class="nextPageNav"])[1]//a[1]/@href'

refine_item(response, item)

'''start_urls = ('http://structurae.net/persons/',)

strephit.web_sources_corpus.spiders.vocab_getty_edu module
class strephit.web_sources_corpus.spiders.vocab_getty_edu.VocabGettyEduSpider(name=None, **kwargs)


 * Bases: "scrapy.spiders.Spider"

'''allowed_domains = ['vocab.getty.edu']

'''bio_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbio2%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++skos%3AscopeNote+%3Fnote.%0D%0A+%3Fnote+rdf%3Avalue+%3Fbio2.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

'''bio_query_2 = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FshortBio%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Adescription+%3FshortBio.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

'''birth_place_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FdeathPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AdeathPlace+%3Fdpf.%0D%0A+%3Fdp+foaf%3Afocus+%3Fdpf%3B%0D%0A++++++gvp%3AparentString+%3FdeathPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'

'''birth_year_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbirth%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestStart+%3Fbirth.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

'''completed_queries = set([])

'''db_connection = <sqlite3.Connection object>

'''death_place_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FbirthPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AbirthPlace+%3Fbpf.%0D%0A+%3Fbp+foaf%3Afocus+%3Fbpf%3B%0D%0A++++++gvp%3AparentString+%3FbirthPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'

'''death_year_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fdeath%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestEnd+%3Fdeath%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

finalize_data(table)


 * This method will be called after *table* has been populated. When all tables have been populated with data joins them and yields the polished items.

'''gender_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fgender%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Agender+%3Fgender%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

load_into_db(table)

'''name = 'vocab_getty_edu'

'''name_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fname%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++gvp%3AprefLabelGVP+%3Flabel.%0D%0A%3Flabel+gvp%3Aterm+%3Fname%0D%0A%7D&_implicit=false&_equivalent=false&_form=%2Fsparql'

'''nationality_query = 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fnationality%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AnationalityPreferred+%3Fny.%0D%0A+%3Fny+gvp%3AprefLabelGVP+%3FlblNationality.%0D%0A+%3FlblNationality+gvp%3Aterm+%3Fnationality.+%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'

'''queries = [('name', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fname%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++gvp%3AprefLabelGVP+%3Flabel.%0D%0A%3Flabel+gvp%3Aterm+%3Fname%0D%0A%7D&_implicit=false&_equivalent=false&_form=%2Fsparql'), ('bio', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbio2%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++skos%3AscopeNote+%3Fnote.%0D%0A+%3Fnote+rdf%3Avalue+%3Fbio2.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('bio2', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FshortBio%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Adescription+%3FshortBio.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('nationality', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fnationality%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AnationalityPreferred+%3Fny.%0D%0A+%3Fny+gvp%3AprefLabelGVP+%3FlblNationality.%0D%0A+%3FlblNationality+gvp%3Aterm+%3Fnationality.+%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fbirth%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestStart+%3Fbirth.%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('birth_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FdeathPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AdeathPlace+%3Fdpf.%0D%0A+%3Fdp+foaf%3Afocus+%3Fdpf%3B%0D%0A++++++gvp%3AparentString+%3FdeathPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('death_year', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fdeath%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+gvp%3AestEnd+%3Fdeath%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql'), ('death_place', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3FbirthPlace%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3AbirthPlace+%3Fbpf.%0D%0A+%3Fbp+foaf%3Afocus+%3Fbpf%3B%0D%0A++++++gvp%3AparentString+%3FbirthPlace.%0D%0A%7D&_implicit=false&implicit=true&_equivalent=false&_form=%2Fsparql'), ('gender', 'http://vocab.getty.edu/sparql.csv?query=SELECT+%3Fperson+%3Fgender%0D%0AWHERE+%7B%0D%0A%3Fperson+rdf%3Atype+gvp%3APersonConcept%3B%0D%0A++++++++foaf%3Afocus+%3Ffocus.%0D%0A+%3Ffocus+gvp%3AbiographyPreferred+%3Fbio.%0D%0A+%3Fbio+schema%3Agender+%3Fgender%3B%0D%0A%7D&_implicit=false&_equivalent=false&equivalent=true&_form=%2Fsparql')]

row_to_item(row)


 * Converts a single row, result of the join between all tables, into a finished item

start_requests

strephit.web_sources_corpus.spiders.wga_hu module
class strephit.web_sources_corpus.spiders.wga_hu.WgaHuSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['www.wga.hu']

'''detail_page_selectors = ['xpath:.//table//td[@class="ARTISTLIST"]//a/@href', 'xpath:.//a[starts-with(@href, "/bio/")]/@href']

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//h3[.="Biography"]/following-sibling::p/text', 'other': {'born-died': 'clean:xpath:.//div[@class="INDEX3"]//text'}, 'name': 'clean:xpath:.//div[@class="INDEX2"]/text'}

'''list_page_selectors = None

'''name = 'wga_hu'

'''next_page_selectors = None

refine_item(response, item)

start_requests

strephit.web_sources_corpus.spiders.who_is_who_america module
class strephit.web_sources_corpus.spiders.who_is_who_america.WhoIsWhoAmericaSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//ul//a[not(@class="new")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p[2]//text', 'name': 'clean:xpath:.//div[@id="headerContainer"]/following-sibling::div//p/b/a/text'}

'''list_page_selectors = 'xpath:.//table[@class="headertemplate"]//tr[3]//a[not(@class="new")]/@href'

'''name = 'who_is_who_america'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Woman%27s_Who%27s_Who_of_America,_1914-15',)

strephit.web_sources_corpus.spiders.who_is_who_in_china module
class strephit.web_sources_corpus.spiders.who_is_who_in_china.WhoIsWhoInChinaSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['en.wikisource.org']

'''detail_page_selectors = 'xpath:.//div[@id="mw-content-text"]//table//a[not(@class="new")]/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean:xpath:.//div[@class="tiInherit"]/following-sibling::p//text', 'name': 'clean:xpath:(.//p/b)[2]/text'}

'''list_page_selectors = None

'''name = 'who_is_who_in_china'

'''next_page_selectors = None

refine_item(response, item)

'''start_urls = ('https://en.wikisource.org/wiki/Who%27s_Who_in_China_(3rd_edition)',)

strephit.web_sources_corpus.spiders.yba_llgc_org_uk module
class strephit.web_sources_corpus.spiders.yba_llgc_org_uk.YbaLlgcOrgUkSpider(name=None, **kwargs)


 * Bases: "strephit.web_sources_corpus.spiders.BaseSpider.BaseSpider"

'''allowed_domains = ['yba.llgc.org.uk']

clean_nu(response, strings)

'''detail_page_selectors = 'xpath:.//div[@id="text"]/p/a/@href'

'''item_class


 * alias of "WebSourcesCorpusItem"

'''item_fields = {'bio': 'clean_nu:xpath:.//div[@id="text"]//text', 'other': {'sources': 'clean_nu:xpath:.//div[@id="text"]/div[@class="biog"]/ul/li[@class="bib_item"]//text', 'contributer': 'clean_nu:xpath:.//div[@id="text"]/p[@class="contributer"]//text', 'surname': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="surname"]/text', 'forename': 'clean_nu:xpath:.//div[@id="text"]/span[@class="article_header"]/b/span[@class="forename"]/text'}}

'''list_page_selectors = None

'''name = 'yba_llgc_org_uk'

'''next_page_selectors = None

refine_item(response, item)

start_requests

Module contents
= strephit.web_sources_corpus package =

Subpackages

 * strephit.web_sources_corpus.spiders package


 * Submodules


 * strephit.web_sources_corpus.spiders.BaseSpider module


 * strephit.web_sources_corpus.spiders.academia_net module


 * strephit.web_sources_corpus.spiders.american_bio module


 * strephit.web_sources_corpus.spiders.australasian_bio module


 * strephit.web_sources_corpus.spiders.australian_dictionary_of_biography module


 * strephit.web_sources_corpus.spiders.bbc_co_uk module


 * strephit.web_sources_corpus.spiders.bio_english_lit module


 * strephit.web_sources_corpus.spiders.bishops module


 * strephit.web_sources_corpus.spiders.brown_edu module


 * strephit.web_sources_corpus.spiders.catholic_encyclopedia module


 * strephit.web_sources_corpus.spiders.cesar_org_uk module


 * strephit.web_sources_corpus.spiders.chinese_bio module


 * strephit.web_sources_corpus.spiders.christian_bio module


 * strephit.web_sources_corpus.spiders.cooperhewitt_org module


 * strephit.web_sources_corpus.spiders.design_and_art_australia_online module


 * strephit.web_sources_corpus.spiders.dictionaryofarthistorians_org module


 * strephit.web_sources_corpus.spiders.dnb module


 * strephit.web_sources_corpus.spiders.dsi module


 * strephit.web_sources_corpus.spiders.english_artists module


 * strephit.web_sources_corpus.spiders.freethinkers module


 * strephit.web_sources_corpus.spiders.gameo_org module


 * strephit.web_sources_corpus.spiders.genealogics module


 * strephit.web_sources_corpus.spiders.greek_roman_bio_myth module


 * strephit.web_sources_corpus.spiders.indian_bio module


 * strephit.web_sources_corpus.spiders.irish_officers module


 * strephit.web_sources_corpus.spiders.medical_bio module


 * strephit.web_sources_corpus.spiders.men_at_the_bar module


 * strephit.web_sources_corpus.spiders.men_of_time module


 * strephit.web_sources_corpus.spiders.metal_archives_com module


 * strephit.web_sources_corpus.spiders.modern_english_bio module


 * strephit.web_sources_corpus.spiders.munksroll module


 * strephit.web_sources_corpus.spiders.museothyssen_org module


 * strephit.web_sources_corpus.spiders.musicians module


 * strephit.web_sources_corpus.spiders.national_bio module


 * strephit.web_sources_corpus.spiders.naval_bio module


 * strephit.web_sources_corpus.spiders.newulsterbiography_co_uk module


 * strephit.web_sources_corpus.spiders.nndb_com module


 * strephit.web_sources_corpus.spiders.parliament_uk module


 * strephit.web_sources_corpus.spiders.portraits_and_sketches module


 * strephit.web_sources_corpus.spiders.rkd_nl module


 * strephit.web_sources_corpus.spiders.royalsociety_org module


 * strephit.web_sources_corpus.spiders.sculpture_uk module


 * strephit.web_sources_corpus.spiders.structurae_net module


 * strephit.web_sources_corpus.spiders.vocab_getty_edu module


 * strephit.web_sources_corpus.spiders.wga_hu module


 * strephit.web_sources_corpus.spiders.who_is_who_america module


 * strephit.web_sources_corpus.spiders.who_is_who_in_china module


 * strephit.web_sources_corpus.spiders.yba_llgc_org_uk module


 * Module contents

strephit.web_sources_corpus.archive_org module
strephit.web_sources_corpus.archive_org.parse_and_save(text, separator, out_file, url)

strephit.web_sources_corpus.britishmuseum_org module
strephit.web_sources_corpus.britishmuseum_org.serialize_person(person)

strephit.web_sources_corpus.items module
class strephit.web_sources_corpus.items.WebSourcesCorpusItem(*args, **kwargs)


 * Bases: "scrapy.item.Item"

'''fields = {'bio': {}, 'death': {}, 'name': {}, 'url': {}, 'other': {}, 'birth': {}}

strephit.web_sources_corpus.pipelines module
'''class strephit.web_sources_corpus.pipelines.WebSourcesCorpusPipeline


 * Bases: "object"

process_item(item, spider)