Big Data Spark Engineer Resume
SUMMARY
- Having 7 plus years of Professional experience in IT Industry, involved in Developing, Implementing and maintenance of various web - based applications using Java and Big Data Ecosystem experience on Windows and Linux environments.
- Having around 4 years of experience in Hadoop/Big Data related technology experience in Storage, Querying, Processing, and analysis of data.
- Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm.
- Knowledge in installing, configuring, and using Hadoop ecosystem components like Hadoop Map Reduce, HDFS, HBase, Oozie, Hive, Sqoop, Pig, Zookeeper and Flume.
- Experience in managing and reviewing Hadoop log files.
- Experience in analyzing data using HiveQL, Pig Latin, HBase and custom Map Reduce programs in Java.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Extending Hive and Pig core functionality by writing custom UDFs.
- Implemented POC to migrate map reduce jobs into Spark RDD transformations using Scala.
- Developed Apache Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Experienced in Spark Core, Spark RDD, Pair RDD, Spark Deployment Architectures.
- Experienced with performing real time analytics on NoSQL data bases like HBase and Cassandra.
- Worked on AWS EC2, EMR and S3 to create clusters and manage data using S3.
- Good knowledge in working with Impala, Storm and Kafka.
- Experienced with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Working knowledge of Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as Storage mechanism.
- Running of Apache Hadoop, CDH and Map-R distributions, Elastic Map Reduce (EMR) on (EC2).
- Expertise in developing Pig Latin scripts and Hive Query Language.
- Developing and Maintenance the Web Applications using the Web Server Tomcat.
- Experience on Source control repositories like SVN, CVS and GITHUB.
- Good Experience on SDLC (Software Development Life cycle).
- Prepare technical reports by collecting, analyzing, and summarizing information and trends .
- Contribute to team effort by accomplishing related results as needed strategic and business planning within the various departments and programs of the client group .
- Have knowledge on google cloud platform.
- For Project Documentation used MS Word and Excel.
- Experience in migrating on premise to Windows Azure in DR on cloud using Azure Recovery Vault and Azure backups.
- Experience in writing Sub Queries, Stored Procedures, Triggers, Cursors, and Functions on MySQL and PostgreSQL database.
TECHNICAL SKILLS
Big Data Ecosystems: Hadoop, HDFS, Yarn, MapReduce, Hive, Pig, Sqoop, Spark, HBase, Zookeeper, Oozie, Flume, Kafka, Azure.
Programming Languages: SQL, Scala, Python and HQL
NoSQL Databases: HBase, MongoDB, Cassandra
Databases: SQL Server, MySQL, Oracle 8i/9i/10g
Cloud Ecosystem: Amazon Web Services (S3, EC2), Azure
Hadoop Distributions: Cloudera, MapReduce, Hortonworks
Java Experience: Java, J2EE, Servlets, JSP, HTML, JavaScript, CSS, Eclipse
Operating Systems: Microsoft Windows, LINUX, UNIX+
Development Tools: IntelliJ, Eclipse
Build Tools: Maven, SBT
Version Control Tools: GITHUB, SVN
PROFESSIONAL EXPERIENCE
Confidential
Big Data Spark Engineer
Responsibilities:
- Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
- Analyzed the SQL scripts and designed the solution to implement using PySpark.
- Designed and implemented MapReduce based large-scale parallel relation-learning system.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Involved in the implementation of design using vital phases of the Software development life cycle (SDLC) that includes Development, Testing, Implementation and Maintenance Support.
- Installed and Configured multi-nodes fully distributed Hadoop cluster.
- Developed and delivered quality services on-time and on-budget. Solutions developed by the team use Java, XML, HTTP, SOAP, Hadoop, Pig and other web technologies.
- Involved in end to end data processing like ingestion, processing, and quality checks and splitting.
- Imported data into HDFS from various SQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Involved in scripting (python and shell) to provision and spin up virtualized Hadoop clusters
- Worked with NoSQL databases like Base to create tables and store the data Collected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
- Wrote Pig scripts to store the data into HBase
- Created Hive tables, dynamic partitions, buckets for sampling, and worked on them using Hive QL
- Exported the analyzed data to Teradata using Sqoop for visualization and to generate reports for the BI team.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
- Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in NoSQL store (HBase).
- Involved in Installing Hadoop Ecosystem components.
- Worked on loading structured and semi-structured data into HDFS using Sqoop.
- Involved in copying large data from Amazon S3 buckets to HDFS.
- Used big data analytical and processing tools Hive, Spark Core, Spark SQL for batch processing large data sets on Hadoop cluster.
- Implementation of several applications, highly distributive, scalable, and large in nature using Cloudera Hadoop.
- Migrated streaming or static RDBMS data into Hadoop cluster from dynamically- generated files using Flume and Sqoop.
- Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
- Implemented Spark SQL queries, Hive queries and performed transformations on data frames.
- Used Sqoop, FTP, APIs, SQS on S3 PUT to pull data to HDFS.
- Performed data Aggregation operations using Spark SQL queries.
- Implemented Hive Partitioning and bucketing for data analytics.
- Used Maven Build tool for code repository.
- Configured Lambda to trigger the spark job on every S3 PUT request.
- Used GitHub as code repository and version control system.
- Involved in working with Sqoop to export the data from Hive to S3 buckets.
- Involved in developing a Big Data time series project that predicts the prices of spot instances using AWS components.
Environment: HDFS, Apache Spark, Apache Hive, Python, Oozie, Apache Kafka, Apache Sqoop, Agile Methodology, AWS, Amazon S3, Putty, WinSCP.
Confidential - Albany, NY
Data Engineer
Responsibilities:
- Involved in end to end data processing like ingestion, processing, and quality checks and splitting.
- Imported data into HDFS from MySQL databases and files using Sqoop and from streaming systems using Storm into Big Data Lake.
- Involved in scripting (python and shell) to provision and spin up virtualized Hadoop clusters.
- Worked with NoSQL databases to create tables and store the data Collected and aggregated large amounts of log data using Apache Flume and staged data in HDFS for further analysis.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Wrote Pig scripts to store the data into HBase
- Created Hive tables, dynamic partitions, buckets for sampling, and worked on them using Hive QL
- Developed the code which will create XML files and Flat files with the data retrieved from Databases and XML files.
- Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
- Migrated streaming or static RDBMS data into Hadoop cluster from dynamically- generated files using Flume and Sqoop.
- Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Captured data and importing it to HDFS using Flume and Kafka for semi-structured data and Sqoop for existing relational databases.
- Identified and ingested source data from different systems into Hadoop HDFS using Sqoop, Flume, creating HBase tables to store variable data formats for data analytics.
- Mapped to HBase tables and implemented SQL queries to retrieve data.
- Streaming events from HBase to SOLR using HBase Indexer.
- Collected Spark Streaming data from Kafka in near-real-time.
- Performed necessary transformations and aggregation on the fly to build the common learner data model persists the data in NoSQL store (HBase).
- Loaded data into the cluster from dynamically generated files using FLUME and from RDBMS using Sqoop.
- Involved in writing Java API's for interacting with HBase
- Involved in writing Flume and Hive scripts to extract, transform, and load data into Database
- Participated in development/implementation of Cloudera Hadoop environment.
- Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
Environment: Cloudera, HDFS, HBase, MapReduce, Hive, UDF, Pig, Sqoop, Zookeeper, Spark, RDBMS, Kafka, Teradata, Java, XML, HTTP, SOAP, Hadoop, Pig, and Flume.
Confidential, Mooresville, NC
Data Analyst
Responsibilities:
- Experience working in project with machine learning, big data, data visualization, R and Python development, Unix, SQL
- Performed exploratory data analysis using NumPy, matplotlib and pandas
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
- Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy and Matplotlib.
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication.
- Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).
- Creating complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirement gathering and translating them into clear and concise specifications and queries.
- Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.
- Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts and Wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools.
- Read date from different sources like CSV file, Excel, HTML page and SQL and performed data analysis and written to any data source like CSV file, Excel or database.
- Experience in using the Lambda functions like filter (), map () and reduce () with pandas Data Frame and perform various operations.
- Used Pandas API for analyzing time series. Creating regression test framework for new code.
- Developed and handled business logic through backend Python code
- Created templates for page rendering and Django views for the business logic.
- Used Django REST framework and integrated new and existing API's endpoints.
- Utilized PyUnit for unit testing of the application.
- Performed data analysis using goggle API's and created visualizations such as pie charts, waterfall charts and displayed in the web application
- Extensive knowledge in loading data into charts using python code.
- Using High charts, passed data and created interactive JavaScript charts for the web application
- Extensive knowledge in using Python libraries like OS, Pickle, NumPy and SciPy.
- Used Bit bucket for version control and coordinating with the team.
Environment: Python, PyQuery, HTML5, CSS3, Apache Spark, Django, SQL, UNIX, Linux, Windows, Oracle, NoSQL, PostgreSQL, and python libraries such as PySpark, NumPy, AWS.