We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00/5 (Submit Your Rating)

Marietta, OH

SUMMARY

  • Around 8+ years of experience in IT, worked in all phases of software development life cycle andBig Data Analyticswith hands on experience on writing Map Reduce Jobs on Hadoop Ecosystems includingHDFS, MapReduce, HIVE, PIG, Spark, HBASE, Flume, Oozie, Airflow, Snowflake.
  • Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake .
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data Bricks, Delta Lake andAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
  • Experience in usingMicrosoft Azure, ADF, ADLS, Azure Blob, COSMOS.
  • Experience in writing medium to complex SQL queries for data analysis. Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Expert in data Extraction, Transforming and Loading (ETL) using various tools such as SQL Server Integration Services (SSIS).
  • Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Hands on experience on Unified Data Analytics with Databricks, Databricks Workspace User Interface, Managing Databricks Notebooks, Delta Lake with Python, Delta Lake with Spark SQL.
  • Extensively worked with VCS like CVS, SVN (Subversion), GIT, IBM Rational Clear Case and Harvest.
  • Expert in defining and deploying Cubes using SQL Server Analysis Services (SSAS).
  • Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, Python-Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
  • Excellent knowledge onHadoop Architectureand ecosystems suchas HDFS, Resource Manager, Node Manager, Name node, Data node and Map Reduce Programming Paradigm.
  • Experience in migrating the data usingSqoop from HDFSto relational database system and vice-versa. Experienced in configuringFlume to stream data into HDFS.
  • Experienced in customizingPIG Latin and HIVE SQLscripts forData Analysisand developingUDF’s and UDAF’s to extend the HIVE and PIG Latin functionality.
  • Experience in AWS platform and its features includingIAM, EC2,EBS,VPC,RDS,Cloud Watch,Cloud Trail,Cloud FormationAWS Configuration,Autoscaling,Cloud Front, S3, SQS, SNS, LambdaandRoute53.
  • Hands on experience in developingSPARKapplications usingSpark toolslike RDD transformations,Spark core,Spark MLlib,Spark StreamingandSpark SQL.
  • Proficient in AWS Services including EC2, S3, RDS, Redshift, Glue, Athena, IAM, QuickSight.
  • Good understanding of NOSQL databases and hands on work experience in writing applications onNOSQL databases like HBase and Cassandra.
  • Designed the data models to be used in data intensiveAWS Lambdaapplications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements fromAurora.
  • Experience implementation of a CI/CD pipeline usingAzure DevOps(VSTS,TFS)in both cloud and on-premises withGIT, MS Build, Docker, Mavenalong withJenkinsplugins.
  • Proficient with container systems likeDockerand container orchestration likeEC2 Container Service,Kubernetes, worked withTerraform.
  • Developedstored proceduresand queries usingPL/SQL and proficient in databases including ORACLE, MS SQL server, MYSQL and DB2.
  • Extensively used ETL methodology for testing and supporting data extraction, transformations and loading process, in a corporate-wide-ETL Solution using Informatica (IDQ, DVO),Ab Initio, Data Stage and SSIS
  • Experienced in NoSQL technologies like MongoDB, Couch DB Cassandra, Redis and relational databases like Oracle, SQLite, PostgreSQL and MySQL databases. Knowledge in working with continuous deployment using Heroku and Jenkins.
  • Well versed with Agile with SCRUM, Waterfall Model and Test-driven development methodologies.

TECHNICAL SKILLS

Operating systems: Windows, Linux Ubuntu, UNIX

Languages: Python, SQL and PL/SQL, SCALA

Databases: Oracle, My SQL, No SQL, Apache Cassandra, MongoDB, Zenoss

IDEs/ Tools: Eclipse, Toad, Sublime text, Spyder, PyCharm, ETL

Version Control: Bitbucket, GitHub

Big Data Tools: Hadoop (Cloudera, Azure, Horton), HDFS, MapReduce, HBase, Spark Pig, Hive, Sqoop, Flume, MongoDB, Kafka Cassandra, Oozie, Zookeeper, Impala, Solr

Deployment Tools: Heroku, Jenkins, Ansible, Redmine

PROFESSIONAL EXPERIENCE

Confidential, Marietta, OH

Big Data Engineer

Responsibilities:

  • Analyzing Hadoop stack and different big data analytic tools including Pig and Hive, HBase database and Sqoop.
  • Writing multipleMapReduceprograms to extract data for extraction, transformation and aggregation from more than 20 sources having multiple file formats includingXML, JSON, CSV&othercompressedfile formats.
  • Involved in building database Model, APIs and Views utilizing Python in order to build an interactive web-based solution.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics .
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Worked with complex SQL, Stored Procedures, Triggers and packages in very large databases from various servers.
  • Involved and worked on Python Open stack API's and used several python libraries such as wxPython, NumPy and matplotlib
  • Worked withTerraformTemplates to automate the Azure Iaas virtual machines using terraform modules and deployed virtual machine scale sets in production environment.
  • Worked on data processing and transformations and actions in spark by using Python (PySpark) language.
  • WrittenTemplatesforAzure Infrastructure as codeusingTerraformto build staging and production environments.
  • UsedSqooptoimportthe data on toCassandra tablesfrom different relational databases likeOracle, MySQL.
  • Generated workflows throughApache Airflow, thenApache Ooziefor scheduling the Hadoop jobs which controls large data transformations.
  • Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.
  • Implement data streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
  • Create several types of data visualizations using Python and Tableau.
  • Worked on Python Open stack API's and used Python scripts to update content in the database and manipulate files.
  • Worked with OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • Dealt with data ambiguity and performed lazy evaluation in PySpark for code optimization
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Create end-to-end solution for ETL transformation jobs that involve writing Informatica workflows and mappings
  • Helped individual teams to set up their repositories in bit bucket and maintain their code and help them setting up jobs which can make use of CI/CD environment.
  • Worked on creating different type of indexes based on different collections to get good performance in MongoDB.
  • Worked on stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
  • Involving in development of Python based Web Services using REST for sending and getting data from the external interface in the JSON format and to track sales and perform sales analysis using Django
  • Cluster coordination services throughZookeeper.Installed and configuredHiveand also writtenHive UDFs.
  • Working on the Analytics Infrastructure team to develop a stream filtering system on top ofApache Kafka and Storm.

Environment: Python, Hadoop, Spark, HDFS, Hive, Pig, HBase, Big Data, Apache Storm, Oozie, Sqoop, Kafka, Flume, Zookeeper, MapReduce, Cassandra, Scala, Linux, NoSQL, MySQL Workbench,Eclipse, Oracle

Confidential, CA

Big Data Developer

Responsibilities:

  • Handled importing of data from various data sources, performed data transformations using HAWQ, Map Reduce.
  • Analyzed theSQL scriptsand designed the solution to implement usingPySpark.
  • Analyzed the web log data using theHiveQL. Developed hive queries on data logs to perform a trend analysis of user behavior on various online modules.
  • Developed Map Reduce programs for some refined queries on big data. Involved in loading data fromUNIXfile system toHDFS.
  • Loaded data intoHDFSand extracted the data fromMySQL into HDFS using Sqoop
  • Perform data comparison between SDP(Streaming Data Platform) real time data with AWS S3 data and Snowflake data using Databricks, Spark SQL, and Python.
  • Executed Data Analysis and Data Visualization on survey data using Tableau Desktop as well as Compared respondent’s demographics data with Univariate Analysis using Python
  • Applied write concern for level of acknowledgement while MongoDB write operations and to avoid rollback.
  • Provide guidance to development team working on PySpark as ETL platform
  • Optimize the PySpark jobs to run on Kubernetes Cluster for faster data processing
  • Developing ETL solutions using Spark SQL in Azure Databricks for data extraction, transformation and aggregation from multiple file formats and data sources for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Worked on migration of data from On-prem SQL server to Cloud databases(Azure Synapse Analytics (DW) & Azure SQL DB).
  • Developed Airflow Workflow to schedule batch and real-time data from source to target.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis and developed scripts to migrate data from proprietary database to PostgreSQL.
  • Responsible for validating Target data after applying different Informatica Transformations on the Source Data.
  • Developed spark applications in python(PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
  • Developed Merge jobs inPythonto extract and load data into MySQL database.
  • Developed PySpark programs and created the data frames and worked on transformations.
  • Created and modified several UNIX shell Scripts according to the changing needs of the project and client requirements. DevelopedUNIXshellscriptsto call Oracle PL/SQL packages and contributed to standard framework.
  • Loaded the delta lake files and Historical files into Hive in the Batch mode, automated the process using KRON jobs.
  • Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createdDAG’sto run the Airflow.
  • Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
  • Imported required tables from RDBMS to HDFS using Sqoop and usedPySparkRDDs to get real time streaming of data into HBase.
  • Analyzed the SQL scripts and designed solutions to implement usingPySpark.
  • Have extensive experience in creating pipeline jobs, scheduling triggers, Mapping data flows using Azure Data Factory(V2) and using Key Vaults to store credentials.
  • Modified Cassandra.yaml and Cassandra-env.sh files to set theconfigurationproperties like node addresses,Memtables sizeandflush timesetc.
  • Deploying windowsKubernetes (K8s)cluster withAzure Container Service (ACS)fromAzure CLIand UtilizedKubernetesandDockerfor the runtime environment of theCI/CDsystem tobuild,testandOctopus Deploy.
  • Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
  • Automated resulting scripts and workflow usingApache Airflowandshell scriptingto ensure daily execution in production.
  • Involved in creating data-models for customer data using Cassandra Query Language. Performed benchmarking of theNo-SQLdatabases, Cassandra andHBase streams.

Environment: Python, Django, Unix Shell Scripting, Python, Oracle, DB2,HDFS, Kafka, Storm, Spark, ETL, Pig, Linux, HiveQL, Cassandra, MapReduce, Toad, SQL, Scala, MySQL Workbench, XML, No-SQL, MapReduce, SOLR, HBase, Hive, Sqoop, Flume, Talend, Oozie.

Confidential, KY

Big Data Engineer

Responsibilities:

  • Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
  • DevelopedSparkscripts usingScalaas per the requirement usingSpark 1.5 framework
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • ImplementedKafka,sparkstreaming pipelines to ingest real streaming data.
  • Worked on Data Serialization formats for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.
  • Worked on creating tabular models onAzure analysis servicesfor meeting business reporting requirements.
  • Have good experience working with Azure BLOB andData lakestorage and loading data intoAzure SQL Synapse analytics (DW).
  • UsedInformatica as ETL tool to pull data from source systems/ files, cleanse, transform and load data into the Teradatausing Teradata Utilities.
  • Dockerized applications by creatingDocker imagesfromDocker file.
  • Worked with Devops team to Clustered NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Install and configureApache Airflowfor S3 bucket and Snowflake data warehouse and createdDAG’s to run the Airflow.
  • Worked on DW based on Delta lake tables and use spark connection for PowerBI reporting.
  • Developed the application using agile methodology and followed Test Driven Development (TDD), Scrum .
  • Good experience in handling data manipulation using Python Scripts.
  • Worked on Transformations & actions in RDD, Spark Streaming, Pair RDD Operations, Check-pointing, and SBT.
  • Implemented POC to migrate map reduce jobs into Spark RDD transformation using Scala.
  • Worked on Jenkins for managing weekly Build, Test and Deploy chain as a CI/CD process, SVN/GIT with Dev/Test/Prod Branching Model for weekly releases.
  • Implemented Copy activity, CustomAzure Data FactoryPipeline Activities for On-cloud ETL processing
  • Created reports as per user requirement using SQL Server reporting services (SSRS) which delivers enterprise and Web-enabled reports.
  • Maintain current data warehouse jobs loading to SQL Database and migrating these jobs to load Snowflake
  • Utilize Kafka & NiFi to gain real-time streaming data into one of the source systems for EPM.
  • Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • DevelopedUnix Shell Scriptsto process the files on daily basis like renaming the file, extracting date from the file, unzipping the file and remove the junk characters from the file before loading them into the base tables

Environment: Hadoop, Map Reduce, HDFS, Spark, Scala, Bash Scripting, Python, Kafka, Hive, maven, Jenkins, Pig, UNIX, Sub Version, Horton works, IBM Tivoli .

Confidential, WI

Data Engineer

Responsibilities:

  • Converted the SQL Server Stored procedures and views to work with Snowflake.
  • Worked on migratingMapReduce programsintoSparktransformations usingSparkandScala, initially done usingpython (PySpark).
  • Developed Python utility to validate HDFS tables with source tables
  • Designed and developed UDF'S to extend the functionality in both PIG and HIVE
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources
  • Responsible for developing Python wrapper scripts which will extract specific date range using Sqoop by passing custom properties required for the workflow
  • Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
  • Analyzed theSQL scriptsand designed the solution to implement usingPySpark.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, Azure SQL, Azure storage, and Azure Data Factory, SSIS, PowerShell
  • Installation of MongoDB RPM’s, Tar Files and preparing YAML config files.
  • Worked extensively on building NiFi data pipelines in docker container environment in development phase
  • Involved in developing Pig Scripts for change data capture anddelta lake recordprocessing between newly arrived data and already existing data in HDFS.
  • Followed TDD Test Driven Development and developed test cases by using JUnit for unit testing for each and every module developed.
  • Responsible for orchestrating CI/CD processes by responding to Git triggers and dependency chains and environment setup
  • Created and executedSQL Server IntegrationServicepackages to populate data from the various data sources, created packages for different data loading operations for many applications
  • Maintained backups and restore activities for Subversion (SVN), Jenkins.
  • Experience with centralized version control system such as Subversion (SVN) and distributed version control system such as Git.
  • CreatedSSISpackages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa
  • All the data was loaded from our relational DBs to HIVE using Sqoop. We were getting four flat files from different vendors. These were all in different formats e.g. text, EDI and XML formats.
  • Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.

Environment: MySQL, Eclipse, PL/SQL, Apache Hadoop, HDFS, Hive, Map Reduce, Cloudera, Pig, Sqoop, Kafka, Apache Cassandra, Oozie, Impala, Cloudera, Flume, Zookeeper.

Confidential

Hadoop Consultant

Responsibilities:

  • Developed Sqoop scripts to handle change data capture for processing incremental records between new arrived and existing data in RDBMS tables.
  • Loaded the aggregated data onto Oracle from Hadoop environment using Sqoop for reporting on the dashboard.
  • Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
  • Worked close with DevOps team to understand, design and develop end to end flow requirements by utilizing Oozie workflow to do Hadoop jobs.

Environment: Map Reduce, HDFS, Hive, pig, Impala,Cassandra 5.04, spark, Scala,Solr, Java, SQL, Tableau, PIG, Zookeeper, Sqoop, Teradata, CentOS, Pentaho.

We'd love your feedback!