We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

4.00/5 (Submit Your Rating)

Cincinnati, OhiO

SUMMARY

  • Big Data Engineer/Hadoop Developer with over 8+ years of overall experience as data engineer in design, development, deploying and large scale supporting large scale distributed systems on on - prem Cloudera CDH/Hortonworks, AWS and Azure.
  • Experience in Implement frameworks to import and export data from Hadoop to RDBMS.
  • Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage new Hadoop features.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance
  • Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Replaced existing MR jobs and Hive scripts with Spark SQL & Spark data transformations for efficient data processing.
  • Experience in manipulating/analysing large datasets and finding patterns and insights within structured and unstructured data.
  • Used Informatica Power Center for (ETL) extraction, transformation and loading data from heterogeneous source systems into target database
  • Strong experience with ETL and/or orchestration tools (e.g. Informatica, Oozie, Airflow)
  • Experience in developing Spark Applications using Spark RDD, Spark - SQL and Data frame APIs.
  • Worked with real-time data processing and streaming techniques using Spark streaming and Kafka.
  • Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Experience in developing Custom UDFsfor Pig and Hive to in corporate methods and functionality of Python/Java intoPig LatinandHQL(HiveQL) and Used UDFs from Piggybank UDF Repository.
  • Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables.
  • Database design, modeling, migration and development experience in using stored procedures, triggers, cursor, constraints and functions. Used My SQL, MS SQL Server, DB2, and Oracle
  • Experience working with NoSQL database technologies, including Cassandra and HBase.
  • Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing and tuning the HQL queries.
  • Experience in writing complex SQL queries, creating reports and dashboards.
  • Experience with Software development tools such as JIRA, Play, GIT.
  • Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD)
  • Determined, committed and hardworking individual with strong communication, interpersonal and organizational skills

TECHNICAL SKILLS

Big Data Eco-system: HDFS, MapReduce, Hive, Pig, Sqoop, Flume, GCP, Hbase, Kafka, Oozie, Spark, Zookeeper, NiFi, Amazon Web Services, Customer 360.

Machine Learning: Decision Tree, LDA, Linear and Logistic Regression, Random Forest, Clustering: K-NN, K-Means, Neural Networks, ANN & RNN, PCA, SVM, Deep learning.

Python Libraries: NLP, pandas, NumPy, Seaborn, SciPy, Matplotlib, sci-kit-learn, Beautiful Soup.

Operating System: Linux (Centos, Ubuntu), Windows (XP/7/8/10)

Languages: Java, Shell scripting, Pig Latin, Scala, Python, R

Databases: MySQL, Teradata, DB2, Oracle, Databricks NoSQL: Hbase, Cassandra and Mongo DB

Hadoop Technologies and Distributions: Apache Hadoop Cloudera CDH5.13, MapR, PySpark

Application Servers: Apache Tomcat, JDBC, ODBC

BI Tools: Power BI, Tableau, Talend

PROFESSIONAL EXPERIENCE

Confidential, Cincinnati, Ohio

Senior Big Data Engineer

Responsibilities:

  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, T-SQL, and SQL Server usingPython.
  • Used Airflow for scheduling the Hive, Spark and MapReduce jobs.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs.
  • Was responsible for design and development of High-performance data architectures which support data warehousing, real-time ETL and batch big-data processing.
  • Export tables from Teradata to HDFS using Sqoop and build tables in Hive.
  • Loaded and transformed large sets of structured, semi structured and unstructured data usingHadoop/Big Data concepts.
  • Use SparkSQL to load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using SparkSQL.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
  • Develop RDD's/Data Frames in Spark using and apply several transformation logics to load data from Hadoop Data Lakes.
  • Create and maintain optimal data pipeline architecture in cloud Microsoft Azure using Data Factory and Azure Databricks.
  • Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
  • Worked with Hadoop infrastructure to storage data in HDFS storage and use Spark / HIVE SQL to migrate underlying SQL codebase in AWS.
  • Using Python in spark to extract the data from Snowflake and upload it to Salesforce on Daily basis.
  • Worked with Hadoop ecosystem and Implemented Spark using Scala and utilized Data frames and Spark SQL API for faster processing of data.
  • Created Notebooks using Databricks, Scala and spark and capturing the data from Delta tables in Delta lakes.
  • Subscribing the Kafka topic with Kafka consumer client and process the events in real time using spark.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Developed Spark Streaming job to consume the data from the Kafka topic of different source systems and push the data into S3.
  • Converting Hive/SQL queries into Spark transformations using Spark RDDs and Pyspark
  • Filtering and cleaning data using Scala code and SQL Queries
  • Designed and Developed Real Time Stream Processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
  • Use python to write a service which is event based using AWS Lambda to achieve real time data to One-Lake (A Data Lake solution in Cap-One Enterprise).
  • Designed Kafka producer client using Confluent Kafka and produced events into Kafka topic.
  • Responsible for gathering requirements, system analysis, design, development, testing and deployment and
  • Developed reusable objects like PL/SQL program units and libraries, database procedures and functions, database triggers to be used by the team and satisfying the business rules.
  • Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and Worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
  • Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ERWIN r9.6.
  • Used Informatica power center for (ETL) extraction, transformation and loading data from heterogeneous source systems and studied and reviewed application of Kimball data warehouse methodology as well as SDLC across various industries to work successfully with data-handling scenarios, such as data

Environment: Hadoop, Spark,, Hive, Python, PL/SQL AWS, EC2, EMR, S3, Lambda, Auto Scaling, Cloud WatchCloud Formation, MapReduce, Azure, Data Bricks, Data Lake, Oracle12c, Flat files, TOAD, MS SQL Server database, XML files, Kafka, MS Access database, Autosys, UNIX, Erwin.

Confidential, Mountain view, CA

Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
  • Writing a Data Bricks code and ADF pipeline with fully parameterized for efficient code management.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
  • Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model.
  • Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services.
  • Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.
  • Hands on experience working Databricks and Delta Lake tables.
  • Worked extensively on performance tuning of the spark jobs.
  • Developed Databricks Python notebooks to Join, filter, pre-aggregate, and process the files stored in Azure data lake storage.
  • Built a new CI pipeline. Testing and deployment automation with Docker, Swamp, Jenkins and Puppet. Utilized continuous integration and automated deployments with Jenkins and Docker.
  • Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
  • Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model
  • Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
  • Working with Data Lakeis usually a single store of all enterprisedataincluding raw copies of source systemdataand transformeddataused for tasks such as reporting, visualization, advanced analytics, and machine learning.
  • Developed multipleMapReducejobs to perform data cleaning and pre-processing.
  • Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources.
  • Hands on experience working with different file formats. (Avro, Parquet, Json, XML, CSV).
  • Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake Schemas.
  • Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Docker, Swamp, Azure, Azure HD Insight, Spark SQL, TDDSpark Streaming, Hive, Scala, pig, Azure Data Bricks, Azure Data Storage, Azure Data Lake, Azure SQL, No SQLImpala, Oozie, Hbase, Data Lake, Zookeeper.

Confidential, Eagan, MN

Data Engineer

Responsibilities:

  • Involved in writing optimized PigScript along with developing and testingPig LatinScripts.
  • Involved in transforming data from Mainframe tables toHDFS, andHBasetables using Sqoop.
  • Visualized the results using Tableau dashboards and the Python Seaborn libraries were used for Data interpretation in deployment.
  • Used Rest APIto Access HBase data to performanalytics.
  • Involved in creatingHivetables, loading with data and writingHive queriesthat will run internally in MapReduce way
  • Acted for bringing in data underHBaseusing HBase shell also HBaseclient API.
  • Experienced with handling administration activations usingClouderamanager
  • Created and maintained Technical documentation for launching Hadoop Clusters and for executingPigScripts.
  • Worked on POC for IOT devices data, with spark.
  • Automatically scale-up the EMR instances based on the data.
  • DevelopedMapReduceprograms to process theAvrofiles and to get the results by performing some calculations on data and also performed map side joins.
  • Imported Bulk Data intoHBaseUsing MapReduce programs.
  • UsedScalato storestreaming datato HDFS and to implementSparkfor faster processing of data.
  • Worked on creating theRDD's,DF's for the required input data and performed the data transformations using Spark Python.
  • Migrated complex map reduce programs intoin memory Sparkprocessing using Transformations and actions.
  • Involved in migrating tables fromRDBMSintoHivetables usingSQOOPand later generate visualizations using Tableau.
  • Experience working with ApacheSOLRfor indexing and querying.
  • Created customSOLRQuery segments to optimize ideal search matching.
  • Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
  • Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
  • Designed and implemented Incremental Imports intoHivetables and writing Hive queries to run onTEZ.
  • CreatedETLMapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
  • Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices usingApache Flumeand stored the data intoHDFSfor analysis.
  • Involved in developingSpark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
  • Implemented the workflows usingApache Oozieframework to automate tasks.
  • Designed and implemented Incremental Imports intoHivetables.
  • Worked with NoSQL databases likeHBasein makingHBasetables to load expansive arrangements of semi structured data.
  • Involved in collecting, aggregating and moving data from servers to HDFS usingFlume.
  • Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS usingSqoop.
  • Involved in data ingestion intoHDFSusingSqoopfor full load and Flume for incremental load on variety of sources like web server,RDBMSand Data API’s.
  • Collected data using Spark Streaming fromAWSS3bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data inHDFS.
  • InstalledOozieworkflow engine to run multipleHiveandPigjobs which run independently with time and data availability.

Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, AWS, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java, Solr.

Confidential

Hadoop Developer

Responsibilities:

  • The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
  • Extensive use of Expressions, Variables, Row Count in SSIS packages jobs in java for data cleaning and pre-processing.
  • Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
  • Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
  • Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop.
  • Created batch jobs and configuration files to create automated process using SSIS.
  • Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
  • Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.
  • Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Data validation and cleansing of staged input records was performed before loading into Data Warehouse
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).

Environment: Hadoop, CDH, MapReduce, Pig,MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, Hbase, SSIS, Office, Excel, Flat Files, T-SQL.

Confidential

SQL Developer

Responsibilities:

  • Worked on both backend and front end in the application development major in back end.
  • Involved in providing inputs for estimate preparation for the new proposal.
  • Developed monitoring and notification tools using Python.
  • Writing Tuned SQL queries for data retrieval involvingComplex Join Conditions.
  • Developed wrapper in python for instantiating multi-threaded application.
  • Queried MYSQL database queries from python using Python - MySQL connector and MySQL DB package to retrieve the information.
  • Implemented the database to store the questions, possible answers, correct answers, Scores of users and queried using SQL.
  • Interact with business analysts to develop modeling techniques.
  • As SQL Developer worked closely with Application Developers to ensure proper design and implementation of database systems
  • Read data from flat files and load into Database usingSQL*Loader.
  • Created and optimized diverse SQL queries to validate accuracy of data to ensure database integrity.

Environment: SQL, MySQL, SSIS, SQL Server, GitHub, Python, Anvil, Windows

We'd love your feedback!